Computational social science

Fulgoni, Dean, Jordan Carpenter, Lyle Ungar, and Daniel Preoţiuc-Pietro. An Empirical Exploration of Moral Foundations Theory in Partisan News Sources In LREC., 2016. AbstractPDFPoster

News sources frame issues in different ways in order to appeal or control the perception of their readers. We present a large scale study of news articles from partisan sources in the US across a variety of different issues. We first highlight that differences between sides exist by predicting the political leaning of articles of unseen political bias. Framing can be driven by different types of morality that each group values. We emphasize differences in framing of different news building on the moral foundations theory quantified using hand crafted lexicons. Our results show that partisan sources frame political issues differently both in terms of words usage and through the moral foundations they relate to.

Preoţiuc-Pietro, Daniel, Wei Xu, and Lyle Ungar. Discovering User Attribute Stylistic Differences via Paraphrasing In AAAI., 2016. AbstractPDFSlides

User attribute prediction from social media text has proven successful and useful for downstream tasks. In previous studies, user trait differences have been limited primarily to the presence or absence of words that indicate topical preferences. In this study, we aim to find linguistic style distinctions across three different user attributes: gender, age and occupational class. By combining paraphrases with a simple yet effective method, we capture a wide set of stylistic differences that are exempt from topic bias. We show their predictive power in user profiling, conformity with human perception and psycholinguistic hypotheses, and potential use in generating natural language tailored to specific user traits.

Preoţiuc-Pietro, Daniel, Svitlana Volkova, Vasileios Lampos, Yoram Bachrach, and Nikolaos Aletras. "Studying User Income through Language, Behaviour and Affect in Social Media." PLoS ONE 10 (2015). AbstractWebsite

Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.

Preoţiuc-Pietro, Daniel, Vasileios Lampos, and Nikolaos Aletras. An analysis of the user occupational class through Twitter content In ACL., 2015. AbstractPDFSlides

Social media content can be used as a complementary source to the traditional methods for extracting and studying collective social attributes. This study focuses on the prediction of the occupational class for a public user profile. Our analysis is conducted on a new annotated corpus of Twitter users, their respective job titles, posted textual content and platform-related attributes. We frame our task as classification using latent feature representations such as word clusters and embeddings. The employed linear and, especially, non-linear methods can predict a user’s occupational class with strong accuracy for the coarsest level of a standard occupation taxonomy which includes nine classes. Combined with a qualitative assessment, the derived results confirm the feasibility of our approach in inferring a new user attribute that can be embedded in a multitude of downstream applications.

Preotiuc-Pietro, Daniel, Johannes Eichstaedt, Gregory Park, Maarten Sap, Laura Smith, Victoria Tobolsky, Andrew H. Schwartz, and Lyle Ungar. The Role of Personality, Age and Gender in Tweeting about Mental Illnesses In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych). NAACL, 2015. AbstractPDFSlides

Mental illnesses, such as depression and post traumatic stress disorder (PTSD), are highly underdiagnosed globally. Populations sharing similar demographics and personality traits are known to be more at risk than others. In this study, we characterise the language use of users disclosing their mental illness on Twitter. Language-derived personality and demographic estimates show surprisingly strong performance in distinguishing users that tweet a diagnosis of depression or PTSD from random controls, reaching an area under the receiver operating characteristic curve – AUC – of around .8 in all our binary classification tasks. In fact, when distinguishing users disclosing depression from those disclosing PTSD, the single feature of estimated age shows nearly as strong performance (AUC = .806) as using thousands of topics (AUC = .819) or tens of thousands of n-grams (AUC = .812). We also find that differential language analyses, controlled for demographics, recover many symptoms associated with the mental illnesses in the clinical literature.

Preotiuc-Pietro, Daniel, Maarten Sap, Andrew H. Schwartz, and Lyle Ungar. Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPysch). NAACL, 2015. AbstractPDF

This article is a system description and report on the submission of the World Well-Being Project from the University of Pennsylvania in the `CLPsych 2015' shared task. The goal of the shared task was to automatically determine Twitter users who self-reported having one of two mental illnesses: post traumatic stress disorder (PTSD) and depression. Our system employs user metadata and textual features derived from Twitter posts. To reduce the feature space and avoid data sparsity, we consider several word clustering approaches. We explore the use of linear classifiers based on different feature sets as well as a combination use a linear ensemble. This method is agnostic of illness specific features, such as lists of medicines, thus making it readily applicable in other scenarios. Our approach ranked second in all tasks on average precision and showed best results at .1 false positive rates.

Lampos, Vasileios, Nikolaos Aletras, Daniel Preoţiuc-Pietro, and Trevor Cohn. Predicting and characterising user impact on Twitter. EACL., 2014. AbstractPDFPoster

The open structure of online social networks and their uncurated nature give rise to problems of user credibility and influence. In this paper, we address the task of predicting the impact of Twitter users based only on features under their direct control, such as usage statistics and the text posted in their tweets.We approach the problem as regression and apply linear as well as nonlinear learning methods to predict a user impact score, estimated by combining the numbers of the user’s followers, followees and listings. The experimental results point out that a strong prediction performance is achieved, especially for models based on the Gaussian Processes framework. Hence, we can interpret various modelling components, transforming them into indirect ‘suggestions’ for impact boosting.

Lampos, Vasileios, Daniel Preoţiuc-Pietro, and Trevor Cohn. A user-centric model of voting intention from Social Media. ACL., 2013. AbstractPDFPoster

Social Media contain a multitude of user opinions which can be used to predict realworld phenomena in many domains including politics, finance and health. Most existing methods treat these problems as linear regression, learning to relate word frequencies and other simple features to a known response variable (e.g., voting intention polls or financial indicators). These techniques require very careful filtering of the input texts, as most Social Media posts are irrelevant to the task. In this paper, we present a novel approach which performs high quality filtering automatically, through modelling not just words but also users, framed as a bilinear
model with a sparse regulariser. We also consider the problem of modelling groups of related output variables, using a structured multi-task regularisation method. Our experiments on voting intention prediction demonstrate strong performance over large-scale input from Twitter on two distinct case studies, outperforming competitive baselines.

Lampos, Vasileios, Daniel Preoţiuc-Pietro, Sina Samangooei, Douwe Gelling, and Trevor Cohn. Extracting socioeconomic patterns from the news: Modelling text and outlet importance jointly In Workshop on Language Technologies and Computational Social Science (LACSS). ACL., 2014. AbstractPDFPoster

Information from news articles can be used to study correlations between textual discourse and socioeconomic patterns. This work focuses on the task of understanding how words contained in the news as well as the news outlets themselves may relate to a set of indicators, such as economic sentiment or unemployment rates. The bilinear nature of the applied regression model facilitates learning jointly word and outlet importance, supervised by these indicators. By evaluating the predictive ability of the extracted features, we can also assess their relevance to the target socioeconomic phenomena. Therefore, our approach can be formulated as a potential NLP tool, particularly suitable to the computational social science community, as it can be used to interpret connections between vast amounts of textual content and measurable society driven factors.