Social Media

Preotiuc-Pietro, Daniel, Maarten Sap, Andrew H. Schwartz, and Lyle Ungar. Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPysch). NAACL, 2015. AbstractPDF

This article is a system description and report on the submission of the World Well-Being Project from the University of Pennsylvania in the `CLPsych 2015' shared task. The goal of the shared task was to automatically determine Twitter users who self-reported having one of two mental illnesses: post traumatic stress disorder (PTSD) and depression. Our system employs user metadata and textual features derived from Twitter posts. To reduce the feature space and avoid data sparsity, we consider several word clustering approaches. We explore the use of linear classifiers based on different feature sets as well as a combination use a linear ensemble. This method is agnostic of illness specific features, such as lists of medicines, thus making it readily applicable in other scenarios. Our approach ranked second in all tasks on average precision and showed best results at .1 false positive rates.

Preoţiuc-Pietro, Daniel, and Trevor Cohn. Mining user behaviours: A study of check-in patterns in Location Based Social Networks. WebSci., 2013. AbstractPDFPoster

Understanding the patterns underlying human mobility is of an essential importance to applications like recommender systems. In this paper we investigate the behaviour of around 10,000 frequent users of Location Based Social Networks (LBSNs) making use of their full movement patterns. We analyse the metadata associated with the whereabouts of the users, with emphasis on the type of places and their evolution over time. We uncover patterns across different temporal scales for venue category usage. Then, focusing on individual users, we apply this knowledge in two tasks: 1) clustering users based on their behaviour and 2) predicting users’ future movements. By this, we demonstrate both qualitatively and quantitatively that incorporating temporal regularities is beneficial for making better sense of user behaviour.

Lampos, Vasileios, Nikolaos Aletras, Daniel Preoţiuc-Pietro, and Trevor Cohn. Predicting and characterising user impact on Twitter. EACL., 2014. AbstractPDFPoster

The open structure of online social networks and their uncurated nature give rise to problems of user credibility and influence. In this paper, we address the task of predicting the impact of Twitter users based only on features under their direct control, such as usage statistics and the text posted in their tweets.We approach the problem as regression and apply linear as well as nonlinear learning methods to predict a user impact score, estimated by combining the numbers of the user’s followers, followees and listings. The experimental results point out that a strong prediction performance is achieved, especially for models based on the Gaussian Processes framework. Hence, we can interpret various modelling components, transforming them into indirect ‘suggestions’ for impact boosting.

Preoţiuc-Pietro, Daniel, and Trevor Cohn. A temporal model of text periodicities using Gaussian Processes. EMNLP., 2013. AbstractPDFPoster

Temporal variations of text are usually ignored in NLP applications. However, text use changes with time, which can affect many applications. In this paper we model periodic distributions of words over time. Focusing on hashtag frequency in Twitter, we first automatically identify the periodic patterns. We use this for regression in order to forecast the volume of a hashtag based on past data. We use Gaussian Processes, a state-of-the-art bayesian non-parametric model, with a novel periodic kernel. We demonstrate this in a text classification setting, assigning the tweet hashtag based on the rest of its text. This method shows significant improvements over competitive baselines.

Preoţiuc-Pietro, Daniel, Sina Samangooei, Trevor Cohn, Nick Gibbins, and Mahesan Niranjan. Trendminer: an architecture for real time analysis of social media text In Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS). ICWSM., 2012. AbstractPDFSlides

The emergence of online social networks (OSNs) and the accompanying availability of large amounts of data, pose a number of new natural language processing (NLP) and computational challenges. Data from OSNs is different to data from traditional sources (e.g. newswire). The texts are short, noisy and conversational. Another important issue is that data occurs in a real-time streams, needing immediate analysis that is grounded in time and context. In this paper we describe a new open-source framework for efficient text processing of streaming OSN data (available at www.trendminer-project.eu). Whilst researchers have made progress in adapting or creating text analysis tools for OSN data, a system to unify these tasks has yet to be built. Our system is focused on a real world scenario where fast processing and accuracy is paramount. We use the MapReduce framework for distributed computing and present running times for our system in order to show that scaling to online scenarios is feasible. We describe the components of the system and evaluate their accuracy. Our system supports easy integration of future modules in order to extend its functionality.

Lampos, Vasileios, Daniel Preoţiuc-Pietro, and Trevor Cohn. A user-centric model of voting intention from Social Media. ACL., 2013. AbstractPDFPoster

Social Media contain a multitude of user opinions which can be used to predict realworld phenomena in many domains including politics, finance and health. Most existing methods treat these problems as linear regression, learning to relate word frequencies and other simple features to a known response variable (e.g., voting intention polls or financial indicators). These techniques require very careful filtering of the input texts, as most Social Media posts are irrelevant to the task. In this paper, we present a novel approach which performs high quality filtering automatically, through modelling not just words but also users, framed as a bilinear
model with a sparse regulariser. We also consider the problem of modelling groups of related output variables, using a structured multi-task regularisation method. Our experiments on voting intention prediction demonstrate strong performance over large-scale input from Twitter on two distinct case studies, outperforming competitive baselines.

Rout, Dominic, Daniel Preoţiuc-Pietro, Bontcheva Kalina, and Trevor Cohn. Where's @wally: A classification approach to geolocating users based on their social ties. HT., 2013. AbstractPDF

This paper presents an approach to geolocating users of online social networks, based solely on their ‘friendship’ connections. We observe that users interact more regularly with those closer to themselves and hypothesise that, in many cases, a person’s social network is sufficient to reveal their location. The geolocation problem is formulated as a classification task, where the most likely city for a user without an explicit location is chosen amongst the known locations of their social ties. Our method uses an SVM classifier and a number of features that reflect different aspects and characteristics of Twitter user networks. The SVM classifier is trained and evaluated on a dataset of Twitter users with known locations. Our method outperforms a state-of-the-art method for geolocating users based on their social ties

Preoţiuc-Pietro, Daniel, Justin Cranshaw, and Tae Yano. Exploring venue-based city-to-city similarity measures In Workshop on Urban Computing (UrbComp). SIGKDD., 2013. AbstractPDF

In this work we explore the use of incidentally generated social network data for the folksonomic characterization of cities by the types of amenities located within them. Using data collected about venue categories in various cities, we examine the effect of different granularities of spatial aggregation and data normalization when representing a city as a collection of its venues. We introduce three vector-based representations of a city, where aggregations of the venue categories are done within a grid structure, within the city’s municipal neighborhoods, and across the city as a whole. We apply our methods to a novel dataset consisting of Foursquare venue data from 17 cities across the United States, totaling over 1 million venues. Our preliminary investigation demonstrates that different assumptions in the urban perception could lead to qualitative, yet distinctive, variations in the induced city description and categorization.

Lampos, Vasileios, Daniel Preoţiuc-Pietro, Sina Samangooei, Douwe Gelling, and Trevor Cohn. Extracting socioeconomic patterns from the news: Modelling text and outlet importance jointly In Workshop on Language Technologies and Computational Social Science (LACSS). ACL., 2014. AbstractPDFPoster

Information from news articles can be used to study correlations between textual discourse and socioeconomic patterns. This work focuses on the task of understanding how words contained in the news as well as the news outlets themselves may relate to a set of indicators, such as economic sentiment or unemployment rates. The bilinear nature of the applied regression model facilitates learning jointly word and outlet importance, supervised by these indicators. By evaluating the predictive ability of the extracted features, we can also assess their relevance to the target socioeconomic phenomena. Therefore, our approach can be formulated as a potential NLP tool, particularly suitable to the computational social science community, as it can be used to interpret connections between vast amounts of textual content and measurable society driven factors.