Publications

Export 4 results:
Sort by: Title [ Type  (Desc)] Year
Thesis
Preoţiuc-Pietro, Daniel. Temporal models of streaming social media data. University of Sheffield, 2014. Abstract

There are significant temporal dependencies between online behaviour and occurring real world activities. Particularly in text modelling, these are usually ignored or at best dealt with in overly simplistic ways such as assuming smooth variation with time. Social media is a new data source which present collective behaviour much more richly than traditional sources, such as newswire, with a finer time granularity, timely reflection of activities, multiple modalities and large volume. Analysing temporal patterns in this data is important in order to discover newly emerging topics, periodic occurrences and correlation or causality to real world indicators or human behaviour patterns. With these opportunities come many challenges, both engineering (i.e.\ data volume and processing) and algorithmic, namely the inconsistency and short length of the messages and the presence of large amounts of irrelevant messages to our goal. Equipped with a better understanding of the dynamics of the complex temporal dependencies, tasks such as classification can be augmented to provide temporally aware responses.

In this thesis we model the temporal dynamics of social media data. We first show that temporality is an important characteristic of this type of data. Further comparisons and correlation to real world indicators show that this data gives a timely reflection of real world events. Our goal is to use these variations to discover emerging or recurring user behaviours. We consider both the use of words and user behaviour in social media. With these goals in mind, we adapt existing and build novel machine learning techniques. These span a wide range of models: from Markov models to regularised regression models and from evolutionary spectral clustering which models smooth temporal variation to Gaussian Process regression which can identify more complex temporal patterns.

We introduce approaches which discover and predict words, topics or behaviours that change over time or occur with some regularity. These are modeled for the first time in the NLP literature by using Gaussian Processes. We demonstrate that we can effectively pick out patterns, including periodicities, and achieve state-of-the-art forecasting results. We show that this performance gain transfers to improve tasks which do not take temporal information in account. Further analysed is how temporal variation in the text can be used to discover and track new content. We develop a model that exploits the variation in word co-occurrences for clustering over time. Different collection and processing tools, as well as several datasets of social media data have been developed and published as open-source software.

The thesis posits that temporal analysis of data, from social media in particular, provides us with insights into real-world dynamics. Incorporating this temporal information into other applications can benefit standard tasks in natural language processing and beyond.

Report
Preoţiuc-Pietro, Daniel, Sina Samangooei, Andrea Varga, Douwe Gelling, Trevor Cohn, and Mahesan Niranjan. Tools for mining non-stationary data - v2. Clustering models for discovery of regional and demographic variation - v2. Public Deliverable for Trendminer Project, 2014. AbstractPDF

This document presents advanced research and software development work for Task 3.2 on tools for mining non-stationary data and for Task 3.3 on clustering models integrating regional and demographic information for the aim of understanding streaming data. First, for modelling non-stationary data, a research experiment is presented for categorising and forecasting word frequency patterns using Gaussian Processes, with an emphasis on word periodicities. A new soft clustering method based on topic models is introduced, which learns topics and their temporal profile jointly. For using regional and demographic user information, the predictive model presented in previous work (Samangooei et al., 2013) is extended. This is used to identify differences in voting intention between different regions of the United Kingdom and different genders. For discovering specific regional clusters, the soft clustering technique is extended to learn the topics, their regional and temporal profile jointly. Finally, the predictive and clustering models developed on social media data are applied to a news summary dataset where richer linguistic features are also used.

Conference Proceedings
Preoţiuc-Pietro, Daniel, and Trevor Cohn. A temporal model of text periodicities using Gaussian Processes. EMNLP., 2013. AbstractPDFPoster

Temporal variations of text are usually ignored in NLP applications. However, text use changes with time, which can affect many applications. In this paper we model periodic distributions of words over time. Focusing on hashtag frequency in Twitter, we first automatically identify the periodic patterns. We use this for regression in order to forecast the volume of a hashtag based on past data. We use Gaussian Processes, a state-of-the-art bayesian non-parametric model, with a novel periodic kernel. We demonstrate this in a text classification setting, assigning the tweet hashtag based on the rest of its text. This method shows significant improvements over competitive baselines.

Preoţiuc-Pietro, Daniel, Sina Samangooei, Trevor Cohn, Nick Gibbins, and Mahesan Niranjan. Trendminer: an architecture for real time analysis of social media text In Workshop on Real-Time Analysis and Mining of Social Streams (RAMSS). ICWSM., 2012. AbstractPDFSlides

The emergence of online social networks (OSNs) and the accompanying availability of large amounts of data, pose a number of new natural language processing (NLP) and computational challenges. Data from OSNs is different to data from traditional sources (e.g. newswire). The texts are short, noisy and conversational. Another important issue is that data occurs in a real-time streams, needing immediate analysis that is grounded in time and context. In this paper we describe a new open-source framework for efficient text processing of streaming OSN data (available at www.trendminer-project.eu). Whilst researchers have made progress in adapting or creating text analysis tools for OSN data, a system to unify these tasks has yet to be built. Our system is focused on a real world scenario where fast processing and accuracy is paramount. We use the MapReduce framework for distributed computing and present running times for our system in order to show that scaling to online scenarios is feasible. We describe the components of the system and evaluate their accuracy. Our system supports easy integration of future modules in order to extend its functionality.