Temporal models of streaming social media data

Citation:
Preoţiuc-Pietro, Daniel. Temporal models of streaming social media data. University of Sheffield, 2014.

Thesis Type:

PhD thesis

Abstract:

There are significant temporal dependencies between online behaviour and occurring real world activities. Particularly in text modelling, these are usually ignored or at best dealt with in overly simplistic ways such as assuming smooth variation with time. Social media is a new data source which present collective behaviour much more richly than traditional sources, such as newswire, with a finer time granularity, timely reflection of activities, multiple modalities and large volume. Analysing temporal patterns in this data is important in order to discover newly emerging topics, periodic occurrences and correlation or causality to real world indicators or human behaviour patterns. With these opportunities come many challenges, both engineering (i.e.\ data volume and processing) and algorithmic, namely the inconsistency and short length of the messages and the presence of large amounts of irrelevant messages to our goal. Equipped with a better understanding of the dynamics of the complex temporal dependencies, tasks such as classification can be augmented to provide temporally aware responses.

In this thesis we model the temporal dynamics of social media data. We first show that temporality is an important characteristic of this type of data. Further comparisons and correlation to real world indicators show that this data gives a timely reflection of real world events. Our goal is to use these variations to discover emerging or recurring user behaviours. We consider both the use of words and user behaviour in social media. With these goals in mind, we adapt existing and build novel machine learning techniques. These span a wide range of models: from Markov models to regularised regression models and from evolutionary spectral clustering which models smooth temporal variation to Gaussian Process regression which can identify more complex temporal patterns.

We introduce approaches which discover and predict words, topics or behaviours that change over time or occur with some regularity. These are modeled for the first time in the NLP literature by using Gaussian Processes. We demonstrate that we can effectively pick out patterns, including periodicities, and achieve state-of-the-art forecasting results. We show that this performance gain transfers to improve tasks which do not take temporal information in account. Further analysed is how temporal variation in the text can be used to discover and track new content. We develop a model that exploits the variation in word co-occurrences for clustering over time. Different collection and processing tools, as well as several datasets of social media data have been developed and published as open-source software.

The thesis posits that temporal analysis of data, from social media in particular, provides us with insights into real-world dynamics. Incorporating this temporal information into other applications can benefit standard tasks in natural language processing and beyond.

Related External Link