#tweetparsing: Twitter data mining introduction through parts-of-speech tagging, topic modelling and tweet generation techniques

The extensive growth of data on social media, in particular on Twitter has prompted extensive research into categorising tweets into parts-of-speech tags and specific topics, to better understand big datasets. Here we propose different methods for text interpretation through parts-of-speech tagging, topic models and tweet generation using 4000 of the most recent tweets of the top 20 most followed Members of Parliament in the UK on Twitter, as a database. Our approach made use of Conditional Random Fields as an introductory foray into structured output predictive models for parts-of-speech tagging which can be used to sort the data into lexical categories, and Latent Dirichlet Allocation as a framework to assign words in tweets to specific topic bands, in topic modelling. This is then furthered by showing how simple Markov chains can be used to generate tweets in the style of a particular Member of Parliament. We review our data and assess our model of lexical categorisation to be accurate to a level of 96.03%. Our studies show 70% of the top 20 Members of Parliament talk about 10 topics or more in their most recent tweets, with 28 and 2 being the upper and lower bounds respectively for the number of topics talked about in their tweets. Insightful phrases such as ‘thoughts are with europe’ and ‘we want social distancing’ may be observed through our Markov chain model of the first order in generated tweets.

Download PDF here