Social Media

Showing results in 'Publications'. Show all posts
Flekova, Lucie, Lyle Ungar, and Daniel Preoţiuc-Pietro. Exploring Stylistic Variation with Age and Income on Twitter. ACL., 2016. AbstractPDFSlides

Writing style allows NLP tools to adjust to the traits of an author. In this paper, we explore the relation between stylistic and syntactic features and authors’ age and income. We confirm our hypothesis that for numerous feature types writing style is predictive of income even beyond age. We analyze the predictive power of writing style features in a regression task on two data sets of around 5,000 Twitter users each. Additionally, we use our validated features to study daily variations in writing style of users from distinct income groups. Temporal stylistic patterns not only provide novel psychological insight into user behavior, but are useful for future research and applications in social media.

Flekova, Lucie, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. Analysing Biases in Human Perception of User Age and Gender from Text. ACL., 2016. AbstractPDFPoster

User traits disclosed through written text, such as age and gender, can be used to personalize applications such as recommender systems or conversational agents. However, human perception of these traits is not perfectly aligned with reality. In this paper, we conduct a large-scale crowdsourcing experiment on guessing age and gender from tweets. We systematically analyze the quality and possible biases of these predictions. We identify the textual cues which lead to miss-assessments of traits or make workers more or less confident in their choice. Our study demonstrates that differences between real and perceived traits are noteworthy and elucidates inaccurately used stereotypes in human perception.

Preoţiuc-Pietro, Daniel, Jordan Carpenter, Salvatore Giorgi, and Lyle Ungar. Studying the Dark Triad of Personality using Twitter Behavior. CIKM., 2016. AbstractPDF

Research into the darker traits of human nature is growing in interest especially in the context of increased social media usage. This allows users to express themselves to a wider online audience. We study the extent to which the standard model of dark personality – the dark triad – consisting of narcissism, psychopathy and Machiavellianism, is related to observable Twitter behavior such as platform usage, posted text and profile image choice. Our results show that we can map various behaviors to psychological theory and study new aspects related to social media usage. Finally, we build a machine learning algorithm that predicts the dark triad of personality in out-of-sample users with reliable accuracy.

Carpenter, Jordan, Daniel Preoţiuc-Pietro, Lucie Flekova, Salvatore Giorgi, Courtney Hagan, Margaret Kern, Anneke Buffone, Lyle Ungar, and Martin Seligman. "Real Men don’t say 'cute': Using Automatic Language Analysis to Isolate Inaccurate Aspects of Stereotypes." Social Psychological and Personality Science (2016). AbstractDraftSupplemental MaterialsWebsite

People associate certain behaviors with certain social groups. These stereotypical beliefs consist of both accurate and inaccurate associations. Using large-scale, data driven methods with social media as a context, we isolate stereotypes by using verbal expression. Across four social categories - gender, age, education level, and political orientation - we identify words and phrases that lead people to incorrectly guess the social category of the writer. Although raters often correctly categorize authors, they overestimate the importance of some stereotype-congruent signal. Findings suggest that data-driven approaches might be a valuable and ecologically valid tool for identifying even subtle aspects of stereotypes and highlighting the facets that are exaggerated or misapplied.

Srijith, PK, Kalina Bontcheva, Mark Hepple, and Daniel Preoţiuc-Pietro. "Sub-Story Detection in Twitter with Hierarchical Dirichlet Processes." Information Processing and Management (2016). AbstractPDFWebsite

Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance.

Preotiuc-Pietro, Daniel, Maarten Sap, Andrew H. Schwartz, and Lyle Ungar. Mental Illness Detection at the World Well-Being Project for the CLPsych 2015 Shared Task In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPysch). NAACL, 2015. AbstractPDF

This article is a system description and report on the submission of the World Well-Being Project from the University of Pennsylvania in the `CLPsych 2015' shared task. The goal of the shared task was to automatically determine Twitter users who self-reported having one of two mental illnesses: post traumatic stress disorder (PTSD) and depression. Our system employs user metadata and textual features derived from Twitter posts. To reduce the feature space and avoid data sparsity, we consider several word clustering approaches. We explore the use of linear classifiers based on different feature sets as well as a combination use a linear ensemble. This method is agnostic of illness specific features, such as lists of medicines, thus making it readily applicable in other scenarios. Our approach ranked second in all tasks on average precision and showed best results at .1 false positive rates.

Preotiuc-Pietro, Daniel, Johannes Eichstaedt, Gregory Park, Maarten Sap, Laura Smith, Victoria Tobolsky, Andrew H. Schwartz, and Lyle Ungar. The Role of Personality, Age and Gender in Tweeting about Mental Illnesses In Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych). NAACL, 2015. AbstractPDFSlides

Mental illnesses, such as depression and post traumatic stress disorder (PTSD), are highly underdiagnosed globally. Populations sharing similar demographics and personality traits are known to be more at risk than others. In this study, we characterise the language use of users disclosing their mental illness on Twitter. Language-derived personality and demographic estimates show surprisingly strong performance in distinguishing users that tweet a diagnosis of depression or PTSD from random controls, reaching an area under the receiver operating characteristic curve – AUC – of around .8 in all our binary classification tasks. In fact, when distinguishing users disclosing depression from those disclosing PTSD, the single feature of estimated age shows nearly as strong performance (AUC = .806) as using thousands of topics (AUC = .819) or tens of thousands of n-grams (AUC = .812). We also find that differential language analyses, controlled for demographics, recover many symptoms associated with the mental illnesses in the clinical literature.

Preoţiuc-Pietro, Daniel, Vasileios Lampos, and Nikolaos Aletras. An analysis of the user occupational class through Twitter content In ACL., 2015. AbstractPDFSlides

Social media content can be used as a complementary source to the traditional methods for extracting and studying collective social attributes. This study focuses on the prediction of the occupational class for a public user profile. Our analysis is conducted on a new annotated corpus of Twitter users, their respective job titles, posted textual content and platform-related attributes. We frame our task as classification using latent feature representations such as word clusters and embeddings. The employed linear and, especially, non-linear methods can predict a user’s occupational class with strong accuracy for the coarsest level of a standard occupation taxonomy which includes nine classes. Combined with a qualitative assessment, the derived results confirm the feasibility of our approach in inferring a new user attribute that can be embedded in a multitude of downstream applications.

Flekova, Lucie, Eugen Ruppert, and Daniel Preotiuc-Pietro. Analysing domain suitability of a sentiment lexicon by identifying distributionally bipolar words In Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). EMNLP, 2015. AbstractPDFSlides

Contemporary sentiment analysis approaches rely heavily on lexicon based methods. This is mainly due to their simplicity, although the best empirical results can be achieved by more complex techniques. We introduce a method to assess suitability of generic sentiment lexicons for a given domain, namely to identify frequent bigrams where a polar word switches polarity. Our bigrams are scored using Lexicographers Mutual Information and leveraging large automatically obtained corpora. Our score matches human perception of polarity and demonstrates improvements in classification results using our enhanced context-aware method. Our method enhances the assessment of lexicon based sentiment detection algorithms and can be further used to quantify ambiguous words.

Flekova, Lucie, Daniel Preoţiuc-Pietro, Jordan Carpenter, Salvatore Giorgi, and Lyle Ungar. Analyzing crowdsourced assessment of user traits through Twitter posts In Work-in-Progress. HCOMP, 2015. AbstractPDFSup. MaterialsPoster

Social media allows any user to express themselves to the public through posting content. Using a crowdsourcing experiment, we aim to quantify and analyze which human attributes lead to better perceptions of the true identity of others. Using tweet content from a set of users with known age and gender information, we ask workers to rate their perception of these traits and we analyze those results in relation to the crowdsourcing workers’ age and gender. Results show that female workers are both more confident and more accurate at reporting gender, and workers in their thirties were most accurate but least confident for rating age. Our study is a first step in identifying the worker traits which contribute to a better understanding of others through their posted text content. Our findings help to identify the types of workers best suited for certain tasks.