Automatic Labelling of Topic Models Learned from Twitter by Summarisation

Amparo Elizabeth Cano Basave    Yulan He    Ruifeng Xu§
Knowledge Media Institute, Open University, UK
School of Engineering and Applied Science, Aston University, UK
§ Key Laboratory of Network Oriented Intelligent Computation
Shenzhen Graduate School, Harbin Institute of Technology, China,,

Latent topics derived by topic models such as Latent Dirichlet Allocation (LDA) are the result of hidden thematic structures which provide further insights into the data. The automatic labelling of such topics derived from social media poses however new challenges since topics may characterise novel events happening in the real world. Existing automatic topic labelling approaches which depend on external knowledge sources become less applicable here since relevant articles/concepts of the extracted topics may not exist in external sources. In this paper we propose to address the problem of automatic labelling of latent topics learned from Twitter as a summarisation problem. We introduce a framework which apply summarisation algorithms to generate topic labels. These algorithms are independent of external sources and only rely on the identification of dominant terms in documents related to the latent topic. We compare the efficiency of existing state of the art summarisation algorithms. Our results suggest that summarisation algorithms generate better topic labels which capture event-related context compared to the top-n terms returned by LDA.

1 Introduction

Topic model based algorithms applied to social media data have become a mainstream technique in performing various tasks including sentiment analysis [11] and event detection [34, 6]. However, one of the main challenges is the task of understanding the semantics of a topic. This task has been approached by investigating methodologies for identifying meaningful topics through semantic coherence [1, 24, 27] and for characterising the semantic content of a topic through automatic labelling techniques [12, 14, 22]. In this paper we focus on the latter.

Our research task of automatic labelling a topic consists on selecting a set of words that best describes the semantics of the terms involved in this topic. The most generic approach to automatic labelling has been to use as primitive labels the top-n words in a topic distribution learned by a topic model such as LDA [9, 2]. Such top words are usually ranked using the marginal probabilities P(wi|tj) associated with each word wi for a given topic tj. This task can be illustrated by considering the following topic derived from social media related to Education:

school protest student fee choic motherlod tuition teacher anger polic

where the top 10 words ranked by P(wi|tj) for this topic are listed. Therefore the task is to find the top-n terms which are more representative of the given topic. In this example, the topic certainly relates to a student protest as revealed by the top 3 terms which can be used as a good label for this topic.

However previous work has shown that top terms are not enough for interpreting the coherent meaning of a topic [22]. More recent approaches have explored the use of external sources (e.g. Wikipedia, WordNet) for supporting the automatic labelling of topics by deriving candidate labels by means of lexical [14, 21, 22] or graph-based [12] algorithms applied on these sources.

Mei et al. [22] proposed an unsupervised probabilistic methodology to automatically assign a label to a topic model. Their proposed approach was defined as an optimisation problem involving the minimisation of the KL divergence between a given topic and the candidate labels while maximising the mutual information between these two word distributions. Lau et al. [15] proposed to label topics by selecting top-n terms to label the overall topic based on different ranking mechanisms including pointwise mutual information and conditional probabilities.

Methods relying on external sources for automatic labelling of topics include the work by Magatti et al. [21] which derived candidate topic labels for topics induced by LDA using the hierarchy obtained from the Google Directory service and expanded through the use of the OpenOffice English Thesaurus. Lau et al. [14] generated label candidates for a topic based on top-ranking topic terms and titles of Wikipedia articles. They then built a Support Vector Regression (SVR) model for ranking the label candidates. More recently, Hulpus et al. [12] proposed to make use of a structured data source (DBpedia) and employed graph centrality measures to generate semantic concept labels which can characterise the content of a topic.

Most previous topic labelling approaches focus on topics derived from well formatted and static documents. However in contrast to this type of content, the labelling of topics derived from tweets presents different challenges. In nature micropost content is sparse and present ill-formed words. Moreover, the use of Twitter as the “what’s-happening-right now” tool, introduces new event-dependent relations between words which might not have a counter part in existing knowledge sources (e.g. Wikipedia). Our original interest in labelling topics stems from work in topic model based event extraction from social media, in particular from tweets [32, 6]. As opposed to previous approaches, the research presented in this paper addresses the labelling of topics exposing event-related content that might not have a counter part on existing external sources. Based on the observation that a short summary of a collection of documents can serve as a label characterising the collection, we propose to generate topic label candidates based on the summarisation of a topic’s relevant documents. Our contributions are two-fold:
- We propose a novel approach for topics labelling that relies on term relevance of documents relating to a topic; and
- We show that summarisation algorithms, which are independent of extenal sources, can be used with success to label topics, presenting a higher perfomance than the top-n terms baseline.

2 Methodology

We propose to approach the topic labelling problem as a multi-document summarisation task. The following describes our proposed framework to characterise documents relevant to a topic.

2.1 Preliminaries

Given a set of documents the problem to be solved by topic modelling is the posterior inference of the variables, which determine the hidden thematic structures that best explain an observed set of documents. Focusing on the Latent Dirichlet Allocation (LDA) model [2, 9], let 𝒟 be a corpus of documents denoted as 𝒟={𝒅1,𝒅2,..,𝒅D}; where each document consists of a sequence of Nd words denoted by 𝒅=(w1,w2,..,wNd); and each word in a document is an item from a vocabulary index of V different terms denoted by {1,2,..,V}. Given D documents containing K topics expressed over V unique words, LDA generative process is described as follows:
- For each topic k{1,K} draw ϕkDirichlet(β),
- For each document d{1..D}:
draw θdDirichlet(α);
For each word n{1..Nd} in document d:
draw a topic zd,nMultinomial(θd);
draw a word wd,nMultinomial(φzd,n).

where ϕk is the word distribution for topic k, and θd is the distribution of topics in document d. Topics are interpreted using the top N terms ranked based on the marginal probability p(wi|tj).

2.2 Automatic Labelling of Topic Models

Given K topics over the document collection 𝒟, the topic labelling task consists on discovering a sequence of words for each topic k𝒦. We propose to generate topic label candidates by summarising topic relevant documents. Such documents can be derived using both the observed data from the corpus 𝒟 and the inferred topic model variables. In particular, the prominent topic of a document d can be found by

kd=argmaxk𝒦p(k|d) (1)

Therefore given a topic k, a set of C documents related to this topic can be obtained via equation 1.

Given the set of documents 𝒞 relevant to topic k, we proposed to generate a label of a desired length x from the summarisation of 𝒞.

2.3 Topic Labelling by Summarisation

We compare different summarisation algorithms based on their ability to provide a good label to a given topic. In particular we investigate the use of lexical features by comparing three different well-known multi-document summarisation algorithms against the top-n topic terms baseline. These algorithms include:

Sum Basic (SB)

This is a frequency based summarisation algorithm [25], which computes initial word probabilities for words in a text. It then weights each sentence in the text (in our case a micropost) by computing the average probability of the words in the sentence. In each iteration it picks the highest weighted document and from it the highest weighted word. It uses an update function which penalises words which have already been picked.


It is similar to SB, however rather than computing the initial word probabilities based on word frequencies it weights terms based on TFIDF. In this case the document frequency is computed as the number of times a word appears in a micropost from the collection 𝒞. Following the same procedure as SB it returns the top x weighted terms.

Maximal Marginal Relevance (MMR)

This is a relevance based ranking algorithm [4], which avoids redundancy in the documents used for generating a summary. It measures the degree of dissimilarity between the documents considered and previously selected ones already in the ranked list.

Text Rank (TR)

This is a graph-based summariser method [23] where each word is a vertex. The relevance of a vertex (term) to the graph is computed based on global information recursively drawn from the whole graph. It uses the PageRank algorithm [3] to recursively change the weight of the vertices. The final score of a word is therefore not only dependent on the terms immediately connected to it but also on how these terms connect to others. To assign the weight of an edge between two terms, TextRank computes word co-occurrence in windows of N words (in our case N=10). Once a final score is calculated for each vertex of the graph, TextRank sorts the terms in a reverse order and provided the top T vertices in the ranking. Each of these algorithms produces a label of a desired length x for a given topic k.

3 Experimental Setup

3.1 Dataset

Our Twitter Corpus (TW) was collected between November 2010 and January 2011. TW comprises over 1 million tweets. We used the OpenCalais’ document categorisation service11OpenCalais service, to generate categorical sets. In particular, we considered four different categories which contain many real-world events, namely: War and Conflict (War), Disaster and Accident (DisAc), Education (Edu) and Law and Crime (LawCri). The final TW dataset after removing retweets and short microposts (less than 5 words after removing stopwords) contains 7000 tweets in each category.

We preprocessed TW by first removing: punctuation, numbers, non-alphabet characters, stop words, user mentions, and URL links. We then performed Porter stemming [30] in order to reduce the vocabulary size. Finally to address the issue of data sparseness in the TW dataset, we removed words with a frequency lower than 5.

3.2 Generating the Gold Standard

Evaluation of automatic topic labelling often relied on human assessment which requires heavy manual effort [14, 12]. However performing human evaluations of Social Media test sets comprising thousands of inputs become a difficult task. This is due to both the corpus size, the diversity of event-related topics and the limited availability of domain experts. To alleviate this issue here, we followed the distribution similarity approach, which has been widely applied in the automatic generation of gold standards (GSs) for summary evaluations [7, 16, 19, 20]. This approach compares two corpora, one for which no GS labels exist, against a reference corpus for which a GS exists. In our case these corpora correspond to the TW and a Newswire dataset (NW). Since previous research has shown that headlines are good indicators of the main focus of a text, both in structure and content, and that they can act as a human produced abstract [26], we used headlines as the GS labels of NW.

The News Corpus (NW) was collected during the same period of time as the TW corpus. NW consists of a collection of news articles crawled from traditional news media (BBC, CNN, and New York Times) comprising over 77,000 articles which include supplemental metadata (e.g. headline, author, publishing date). We also used the OpenCalais’ document categorisation service to automatically label news articles and considered the same four topical categories, (War, DisAc, Edu and LawCri). The same preprocessing steps were performed on NW.

Therefore, following a similarity alignment approach we performed the steps oulined in Algorithm 3.2 for generating the GS topic labels of a topic in TW.


[htbp]GS for Topic Labels {algorithmic}[1] \REQUIRELDA topics for TW, and the LDA topics for NW for category c. \ENSUREGold standard topic label for each of the LDA topics for TW. \FOReach topic i{1,2,,100} from TW \FOReach topic j{1,2...,100} from NW \STATECompute the Cosine similarity between word distributions of topic ti and topic tj. \ENDFOR\STATESelect topic j which has the highest similarity to i and whose similarity measure is greater than a threshold (in this case 0.7) \ENDFOR\FOReach of the extracted topic pairs (ti-tj) \STATECollect relevant news articles 𝒞NWj of topic tj from the NW set. \STATEExtract the headlines of news articles from 𝒞NWj and select the top x most frequent words as the gold standard label for topic ti in the TW set \ENDFOR

These steps can be outlined as follows: 1) We ran LDA on TW and NW separately for each category with the number of topics set to 100; 1) We then aligned the Twitter topics and Newswire topics by the similarity measurement of word distributions of these topics [8, 10, 33, 5]; 1) Finally to generate the GS label for each aligned topic pair (ti-tj), we extracted the headlines of the news articles relevant to tj and selected the top x most frequent words (after stop word removal and stemming) . The generated label was used as the gold standard label for the corresponding Twitter topic ti in the topic pair.

4 Experimental Results

We compared the results of the summarisation techniques with the top terms (TT) of a topic as our baseline. These TT set corresponds to the top x terms ranked based on the probability of the word given the topic (p(w|k)) from the topic model. We evaluated these summarisation approaches with the ROUGE-1 method [17], a widely used summarisation evaluation metric that correlates well with human evaluation [18]. This method measures the overlap of words between the generated summary and a reference, in our case the GS generated from the NW dataset.

The evaluation was performed at x={1,..,10}. Figure 1 presents the ROUGE-1 performance of the summarisation approaches as the lengthx of the generated topic label increases. We can see in all four categories that the SB and TFIDF approaches provide a better summarisation coverage as the length of the topic label increases. In particular, in both the Education and Law & Crime categories, both SB and TFIDF outperforms TT and TR by a large margin. The obtained ROUGE-1 performance is within the same range of performance previously reported on Social Media summarisation [13, 28, 31].

Figure 1: Performance in ROUGE for Twitter-derived topic labels, where x is the number of terms in the generated label

Table 1 presents average results for ROUGE-1 in the four categories. Particularly the SB and TFIDF summarisation techniques consistently outperform the TT baseline across all four categories. SB gives the best results in three categories except War.

War 0.162 0.184 0.192 0.154 0.141
DisAc 0.134 0.194 0.160 0.132 0.124
Edu 0.106 0.240 0.187 0.104 0.023
LawCri 0.035 0.159 0.149 0.034 0.115
Table 1: Average ROUGE-1 for topic labels at x={1..10}, generated from the TW dataset.

The generated labels with summarisation at x=5 are presented in Table 2, where GS represents the label generated from the Newswire headlines.

Different summarisation techniques reveal words which do not appear in the top terms but which are relevant to the information clustered by the topic. In this way, the labels generated for topics belonging to different categories generally extend the information provided by the top terms. For example in Table 2, the DisAc headline is characteristic of the New Zealand’s Pike River’s coal mine blast accident, which is an event occurred in November 2010.

Although the top 5 terms set from the LDA topic extracted from TW (listed under TT) does capture relevant information related to the event, it does not provide information regarding the blast. In this sense the topic label generated by SB more accurately describes this event.

We can also notice that the GS labels generated from Newswire media presented in Table 2 appear on their own, to be good labels for the TW topics. However as we described in the introduction we want to avoid relaying on external sources for the derivation of topic labels.

This experiment shows that frequency based summarisation techniques outperform graph-based and relevance based summarisation techniques for generating topic labels that improve upon the top-terms baseline, without relying on external sources. This is an attractive property for automatically generating topic labels for tweets where their event-related content might not have a counter part on existing external sources.

War DisAc
GS protest brief polic afghanistan attack world leader bomb obama pakistan mine zealand rescu miner coal fire blast kill man disast
TT polic offic milit recent mosqu mine coal pike river zealand
SB terror war polic arrest offic mine coal explos river pike
TFIDF polic war arrest offic terror mine coal pike safeti zealand
MMR recent milit arrest attack target trap zealand coal mine explos
TR war world peac terror hope mine zealand plan fire fda
Edu LawCri
GS school protest student fee choic motherlod tuition teacher anger polic man charg murder arrest polic brief woman attack inquiri found
TT student univers protest occupi plan man law child deal jail
SB student univers school protest educ man arrest law kill judg
TFIDF student univers protest plan colleg man arrest law judg kill
MMR nation colleg protest student occupi found kid wife student jail
TR student tuition fee group hit man law child deal jail
Table 2: Labelling examples for topics generated from the TW Dataset. GS represents the gold-standard generated from the relevant Newswire dataset. All terms are Porter stemmed as described in subsection 3.1

5 Conclusions and Future Work

In this paper we proposed a novel alternative to topic labelling which do not rely on external data sources. To the best of out knowledge no existing work has been formally studied for automatic labelling through summarisation. This experiment shows that existing summarisation techniques can be exploited to provide a better label of a topic, extending in this way a topic’s information by providing a richer context than top-terms. These results show that there is room to further improve upon existing summarisation techniques to cater for generating candidate labels.


This work was supported by the EPRSC grant EP/J020427/1, the EU-FP7 project SENSE4US (grant no. 611242), and the Shenzhen International Cooperation Research Funding (grant number GJHZ20120613110641217).


  • [1] N. Aletras and M. Stevenson(2013-03) Evaluating topic coherence using distributional semantics. Potsdam, Germany, pp. 13–22. External Links: Link Cited by: 1.
  • [2] D. M. Blei, A. Y. Ng and M. I. Jordan(2003) Latent dirichlet allocation.. pp. 993–1022. Cited by: 1, 2.1.
  • [3] S. Brin and L. Page(1998) The anatomy of a large-scale hypertextual web search engine* 1. Vol. 30, pp. 107–117. Cited by: 2.3.
  • [4] J. Carbonell and J. Goldstein(1998) The use of mmr, diversity-based reranking for reordering documents and producing summaries. SIGIR ’98, New York, NY, USA, pp. 335–336. External Links: ISBN 1-58113-015-5, Link, Document Cited by: 2.3.
  • [5] J. Delort and E. Alfonseca(2012) DualSum: a topic-model based approach for update summarization. EACL ’12, Stroudsburg, PA, USA, pp. 214–223. External Links: ISBN 978-1-937284-19-0, Link Cited by: 3.2.
  • [6] Q. Diao, J. Jiang, F. Zhu and E. Lim(2012-07) Finding bursty topics from microblogs. Jeju Island, Korea, pp. 536–544. External Links: Link Cited by: 1, 1.
  • [7] R. L. Donaway, K. W. Drummey and L. A. Mather(2000) A comparison of rankings produced by summarization evaluation measures. NAACL-ANLP-AutoSum ’00, Stroudsburg, PA, USA, pp. 69–78. External Links: Link Cited by: 3.2.
  • [8] G. Ercan and I. Cicekli(2008) Lexical cohesion based topic modeling for summarization. CICLing’08, Berlin, Heidelberg, pp. 582–592. External Links: ISBN 3-540-78134-X, 978-3-540-78134-9, Link Cited by: 3.2.
  • [9] T. L. Griffiths and M. Steyvers(2004) Finding scientific topics. PNAS 101 (suppl. 1), pp. 5228–5235. Cited by: 1, 2.1.
  • [10] A. Haghighi and L. Vanderwende(2009) Exploring content models for multi-document summarization. NAACL ’09, Stroudsburg, PA, USA, pp. 362–370. External Links: ISBN 978-1-932432-41-1, Link Cited by: 3.2.
  • [11] Y. He(2012-06) Incorporating sentiment prior knowledge for weakly supervised sentiment analysis. ACM Transactions on Asian Language Information Processing 11 (2), pp. 4:1–4:19. External Links: ISSN 1530-0226, Link, Document Cited by: 1.
  • [12] I. Hulpus, C. Hayes, M. Karnstedt and D. Greene(2013) Unsupervised graph-based topic labelling using dbpedia. WSDM ’13, New York, NY, USA, pp. 465–474. External Links: ISBN 978-1-4503-1869-3, Link, Document Cited by: 1, 1, 1, 3.2.
  • [13] D. Inouye and J. K. Kalita(2011) Comparing twitter summarization algorithms for multiple post summaries. See 29, pp. 298–306. Cited by: 4.
  • [14] J. H. Lau, K. Grieser, D. Newman and T. Baldwin(2011) Automatic labelling of topic models. HLT ’11, Stroudsburg, PA, USA, pp. 1536–1545. Cited by: 1, 1, 1, 3.2.
  • [15] J. H. Lau, D. Newman, K. Sarvnaz and T. Baldwin(2010) Best Topic Word Selection for Topic Labelling. CoLing. Cited by: 1.
  • [16] C. Lin, G. Cao, J. Gao and J. Nie(2006) An information-theoretic approach to automatic evaluation of summaries. HLT-NAACL ’06, Stroudsburg, PA, USA, pp. 463–470. External Links: Link, Document Cited by: 3.2.
  • [17] C. Lin(2004-07) ROUGE: a package for automatic evaluation of summaries. Barcelona, Spain, pp. 74–81. Cited by: 4.
  • [18] F. Liu and Y. Liu(2008) Correlation between rouge and human evaluation of extractive meeting summaries. HLT-Short ’08, Stroudsburg, PA, USA, pp. 201–204. External Links: Link Cited by: 4.
  • [19] A. Louis and A. Nenkova(2009) Automatically evaluating content selection in summarization without human models. EMNLP ’09, Stroudsburg, PA, USA, pp. 306–314. External Links: ISBN 978-1-932432-59-6, Link Cited by: 3.2.
  • [20] A. Louis and A. Nenkova(2013) Automatically assessing machine summary content without a gold standard. Computational Linguistics 39 (2), pp. 267–300. Cited by: 3.2.
  • [21] D. Magatti, S. Calegari, D. Ciucci and F. Stella(2009) Automatic labeling of topics. ISDA ’09, Washington, DC, USA, pp. 1227–1232. External Links: ISBN 978-0-7695-3872-3, Link, Document Cited by: 1, 1.
  • [22] Q. Mei, X. Shen and C. Zhai(2007) Automatic labeling of multinomial topic models. KDD ’07, New York, NY, USA, pp. 490–499. External Links: ISBN 978-1-59593-609-7, Link, Document Cited by: 1, 1, 1.
  • [23] R. Mihalcea and P. Tarau(2004) TextRank: Bringing Order into Texts. EMNLP ’04, Barcelona, Spain, pp. 404–411. Cited by: 2.3.
  • [24] D. Mimno, H. M. Wallach, E. Talley, M. Leenders and A. McCallum(2011) Optimizing semantic coherence in topic models. EMNLP ’11, Stroudsburg, PA, USA, pp. 262–272. External Links: ISBN 978-1-937284-11-4, Link Cited by: 1.
  • [25] A. Nenkova and L. Vanderwende(2005) The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101. Cited by: 2.3.
  • [26] A. Nenkova(2005) Automatic text summarization of newswire: lessons learned from the document understanding conference. AAAI’05, pp. 1436–1441. External Links: ISBN 1-57735-236-x, Link Cited by: 3.2.
  • [27] D. Newman, J. H. Lau, K. Grieser and T. Baldwin(2010) Automatic evaluation of topic coherence. HLT ’10, Stroudsburg, PA, USA, pp. 100–108. External Links: ISBN 1-932432-65-5, Link Cited by: 1.
  • [28] J. Nichols, J. Mahmud and C. Drews(2012) Summarizing sporting events using twitter. IUI ’12, New York, NY, USA, pp. 189–198. External Links: ISBN 978-1-4503-1048-2, Link, Document Cited by: 4.
  • [29] (2011) PASSAT/socialcom 2011, privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom), boston, ma, usa, 9-11 oct., 2011. IEEE. External Links: ISBN 978-1-4577-1931-8 Cited by: 13.
  • [30] M. Porter(1980) An algorithm for suffix stripping. Program 14 (3), pp. 130–137. External Links: Link Cited by: 3.1.
  • [31] Z. Ren, S. Liang, E. Meij and M. de Rijke(2013) Personalized time-aware tweets summarization. SIGIR ’13, New York, NY, USA, pp. 513–522. External Links: ISBN 978-1-4503-2034-4, Link, Document Cited by: 4.
  • [32] C. Shen, F. Liu, F. Weng and T. Li(2013) A participant-based approach for event summarization using twitter streams. HLT ’13, Stroudsburg, PA, USA. Cited by: 1.
  • [33] D. Wang, S. Zhu, T. Li and Y. Gong(2009) Multi-document summarization using sentence-based topic models. ACLShort ’09, Stroudsburg, PA, USA, pp. 297–300. External Links: Link Cited by: 3.2.
  • [34] X. Zhao, B. Shu, J. Jiang, Y. Song, H. Yan and X. Li(2012-07) Identifying event-related bursts via social media activities. Jeju Island, Korea, pp. 1466–1477. External Links: Link Cited by: 1.