Improving Citation Polarity Classification with Product Reviews

Charles Jochim
IBM Research – Ireland
  This work was primarily conducted at the IMS – University of Stuttgart.
   Hinrich Schütze
Center for Information & Language Processing
University of Munich

Recent work classifying citations in scientific literature has shown that it is possible to improve classification results with extensive feature engineering. While this result confirms that citation classification is feasible, there are two drawbacks to this approach: (i) it requires a large annotated corpus for supervised classification, which in the case of scientific literature is quite expensive; and (ii) feature engineering that is too specific to one area of scientific literature may not be portable to other domains, even within scientific literature. In this paper we address these two drawbacks. First, we frame citation classification as a domain adaptation task and leverage the abundant labeled data available in other domains. Then, to avoid over-engineering specific citation features for a particular scientific domain, we explore a deep learning neural network approach that has shown to generalize well across domains using unigram and bigram features. We achieve better citation classification results with this cross-domain approach than using in-domain classification.

1 Introduction

Citations have been categorized and studied for a half-century [15] to better understand when and how citations are used, and to record and measure how information is exchanged (e.g., networks of co-cited papers or authors [26]). Recently, the value of this information has been shown in practical applications such as information retrieval (IR) [25], summarization [24], and even identifying scientific breakthroughs [27]. We expect that by identifying and labeling the function of citations we can improve the effectiveness of these applications.

There has been no consensus on what aspects or functions of a citation should be annotated and how. Early citation classification focused more on citation motivation [16], while later classification considered more the citation function [9]. Recent studies using automatic classification have continued this tradition of introducing a new classification scheme with each new investigation into the use of citations [22, 29, 13, 1]. One distinction that has been more consistently annotated across recent citation classification studies is between positive and negative citations [3, 2, 1].11Dong and Schäfer (2011) also annotate polarity, which can be found in their dataset (described later), but this is not discussed in their paper. The popularity of this distinction likely owes to the prominence of sentiment analysis in NLP [20]. We follow much of the recent work on citation classification and concentrate on citation polarity.

2 Domain Adaptation

By concentrating on citation polarity we are able to compare our classification to previous citation polarity work. This choice also allows us to access the wealth of existing data containing polarity annotation and then frame the task as a domain adaptation problem. Of course the risk in approaching the problem as domain adaptation is that the domains are so different that the representation of a positive instance of a movie or product review, for example, will not coincide with that of a positive scientific citation. On the other hand, because there is a limited amount of annotated citation data available, by leveraging large amounts of annotated polarity data we could potentially even improve citation classification.

We treat citation polarity classification as a sentiment analysis domain adaptation task and therefore must be careful not to define features that are too domain specific. Previous work in citation polarity classification focuses on finding new citation features to improve classification, borrowing a few from text classification in general (e.g., n-grams), and perhaps others from sentiment analysis problems (e.g., the polarity lexicon from Wilson et al. (2005)). We would like to do as little feature engineering as possible to ensure that the features we use are meaningful across domains. However, we do still want features that somehow capture the inherent positivity or negativity of our labeled instances, i.e., citations or Amazon product reviews. Currently a popular approach for accomplishing this is to use deep learning neural networks [4], which have been shown to perform well on a variety of NLP tasks using only bag-of-word features [10]. More specifically related to our work, deep learning neural networks have been successfully employed for sentiment analysis [28] and for sentiment domain adaptation [17]. In this paper we examine one of these approaches, marginalized stacked denoising autoencoders (mSDA) from Chen et al. (2012), which has been successful in classifying the polarity of Amazon product reviews across product domains. Since mSDA achieved state-of-the-art performance in Amazon product domain adaptation, we are hopeful it will also be effective when switching to a more distant domain like scientific citations.

3 Experimental Setup

3.1 Corpora

We are interested in domain adaptation for citation classification and therefore need a target dataset of citations and a non-citation source dataset. There are two corpora available that contain citation function annotation, the DFKI Citation Corpus [13] and the IMS Citation Corpus [19]. Both corpora have only about 2000 instances; unfortunately, there are no larger corpora available with citation annotation and this task would benefit from more annotated data. Due to the infrequent use of negative citations, a substantial annotation effort (annotating over 5 times more data) would be necessary to reach 1000 negative citation instances, which is the number of negative instances in a single domain in the multi-domain corpus described below.

The DFKI Citation Corpus22 has been used for classifying citation function [13], but the dataset also includes polarity annotation. The dataset has 1768 citation sentences with polarity annotation: 190 are labeled as positive, 57 as negative, and the vast majority, 1521, are left neutral. The second citation corpus, the IMS Citation Corpus33 contains 2008 annotated citations: 1836 are labeled positive and 172 are labeled negative. Jochim and Schütze (2012) use annotation labels from Moravcsik and Murugesan (1975) where positive instances are labeled confirmative, negative instances are labeled negational, and there is no neutral class. Because each of the citation corpora is of modest size we combine them to form one citation dataset, which we will refer to as CITD. The two citation corpora comprising CITD both come from the ACL Anthology [5]: the IMS corpus uses the ACL proceedings from 2004 and the DFKI corpus uses parts of the proceedings from 2007 and 2008. Since mSDA also makes use of large amounts of unlabeled data, we extend our CITD corpus with citations from the proceedings of the remaining years of the ACL, 1979–2003, 2005–2006, and 2009.

There are a number of non-citation corpora available that contain polarity annotation. For these experiments we use the Multi-Domain Sentiment Dataset44 (henceforth MDSD), introduced by Blitzer et al. (2007). We use the version of the MDSD that includes positive and negative labels for product reviews taken from in the following domains: books, dvd, electronics, and kitchen. For each domain there are 1000 positive reviews and 1000 negative reviews that comprise the “labeled” data, and then roughly 4000 more reviews in the “unlabeled”55It is usually treated as unlabeled data even though it actually contains positive and negative labels, which have been used, e.g., in [8]. data. Reviews were preprocessed so that for each review you find a list of unigrams and bigrams with their frequency within the review. Unigrams from a stop list of 55 stop words are removed, but stop words in bigrams remain.

Table 1 shows the distribution of polarity labels in the corpora we use for our experiments. We combine the DFKI and IMS corpora into the CITD corpus. We omit the citations labeled neutral from the DFKI corpus because the IMS corpus does not contain neutral annotation nor does the MDSD. It is the case in many sentiment analysis corpora that only positive and negative instances are included, e.g., [23].

Corpus Instances Pos. Neg. Neut.
DFKI 1768 190 57 1521
IMS 2008 1836 172
MDSD 27,677 13,882 13,795
Table 1: Polarity corpora.

The citation corpora presented above are both unbalanced and both have a highly skewed distribution. The MDSD on the other hand is evenly balanced and an effort was even made to keep the data treated as “unlabeled” rather balanced. For this reason, in line with previous work using MDSD, we balance the labeled portion of the CITD corpus. This is done by taking 179 unique negative sentences in the DFKI and IMS corpora and randomly selecting an equal number of positive sentences. The IMS corpus can have multiple labeled citations per sentence: there are 122 sentences containing the 172 negative citations from Table 1. The final CITD corpus comprises this balanced corpus of 358 labeled citation sentences plus another 22,093 unlabeled citation sentences.

3.2 Features

In our experiments, we restrict our features to unigrams and bigrams from the product review or citation context (i.e., the sentence containing the citation). This follows previous studies in domain adaptation [6, 17]. Chen et al. (2012) achieve state-of-the-art results on MDSD by testing the 5000 and 30,000 most frequent unigram and bigram features.

Previous work in citation classification has largely focused on identifying new features for improving classification accuracy. A significant amount of effort goes into engineering new features, in particular for identifying cue phrases, e.g., [30, 13]. However, there seems to be little consensus on which features help most for this task. For example, Abu-Jbara et al. (2013) and Jochim and Schütze (2012) find the list of polar words from Wilson et al. (2005) to be useful, and neither study lists dependency relations as significant features. Athar (2011) on the other hand reported significant improvement using dependency relation features and found that the same list of polar words slightly hurt classification accuracy. The classifiers and implementation of features varies between these studies, but the problem remains that there seems to be no clear set of features for citation polarity classification.

The lack of consensus on the most useful citation polarity features coupled with the recent success of deep learning neural networks [10] further motivate our choice to limit our features to the n-grams available in the product review or citation context and not rely on external resources or tools for additional features.

3.3 Classification with mSDA

For classification we use marginalized stacked denoising autoencoders (mSDA) from Chen et al. (2012)66We use their MATLAB implementation available at plus a linear SVM. mSDA takes the concept of denoising – introducing noise to make the autoencoder more robust – from Vincent et al. (2008), but does the optimization in closed form, thereby avoiding iterating over the input vector to stochastically introduce noise. The result of this is faster run times and currently state-of-the-art performance on MDSD, which makes it a good choice for our domain adaptation task. The mSDA implementation comes with LIBSVM, which we replace with LIBLINEAR [14] for faster run times with no decrease in accuracy. LIBLINEAR, with default settings, also serves as our baseline.

3.4 Outline of Experiments

Our initial experiments simply extend those of Chen et al. (2012) (and others who have used MDSD) by adding another domain, citations. We train on each of the domains from the MDSD – books, dvd, electronics, and kitchen – and test on the citation data. We split the labeled data 80/20 following Blitzer et al. (2007) (cf. Chen et al. (2012) train on all “labeled” data and test on the “unlabeled” data). These experiments should help answer two questions: does a larger amount of training data, even if out of domain, improve citation classification; and how well do the different product domains generalize to citations (i.e., which domains are most similar to citations)?

In contrast to previous work using MDSD, a lot of the work in domain adaptation also leverages a small amount of labeled target data. In our second set of experiments, we follow the domain adaptation approaches described in [12] and train on product review and citation data before testing on citations.

4 Results and Discussion

4.1 Citation mSDA

Figure 1: Cross domain macro-F1 results training on Multi-Domain Sentiment Dataset and testing on citation dataset (CITD). The horizontal line indicates macro-F1 for in-domain citation classification.
Domain Baseline All Weight Pred LinInt Augment mSDA
books 54.5 54.8 52.0 51.9 53.4 53.4 57.1
dvd 53.2 50.9 56.0 53.4 51.9 47.5 51.6
electronics 53.4 49.0 50.5 53.4 54.8 51.9 59.2
kitchen 47.9 48.8 50.7 53.4 52.6 49.2 50.1
citations 51.9 54.9
Table 2: Macro-F1 results on CITD using different domain adaptation approaches.

Our initial results show that using mSDA for domain adaptation to citations actually outperforms in-domain classification. In Figure 1 we compare citation classification with mSDA to the SVM baseline. Each pair of vertical bars represents training on a domain from MDSD (e.g., books) and testing on CITD. The dark gray bar indicates the F1 scores for the SVM baseline using the 30,000 features and the lighter gray bar shows the mSDA results. The black horizontal line indicates the F1 score for in-domain citation classification, which sometimes represents the goal for domain adaptation. We can see that using a larger dataset, even if out of domain, does improve citation classification. For books, dvd, and electronics, even the SVM baseline improves on in-domain classification. mSDA does better than the baseline for all domains except dvd. Using a larger training set, along with mSDA, which makes use of the unlabeled data, leads to the best results for citation classification.

In domain adaptation we would expect the domains most similar to the target to lead to the highest results. Like Dai et al. (2007), we measure the Kullback-Leibler divergence between the source and target domains’ distributions. According to this measure, citations are most similar to the books domain. Therefore, it is not surprising that training on books performs well on citations, and intuitively, among the domains in the Amazon dataset, a book review is most similar to a scientific citation. This makes the good mSDA results for electronics a bit more surprising.

4.2 Easy Domain Adaptation

The results in Section 4.1 are for semi-supervised domain adaptation: the case where we have some large annotated corpus (Amazon product reviews) and a large unannotated corpus (citations). There have been a number of other successful attempts at fully supervised domain adaptation, where it is assumed that some small amount of data is annotated in the target domain [7, 12, 18]. To see how mSDA compares to supervised domain adaptation we take the various approaches presented by Daumé III (2007). The results of this comparison can be seen in Table 2. Briefly, “All” trains on source and target data; “Weight” is the same as “All” except that instances may be weighted differently based on their domain (weights are chosen on a development set); “Pred” trains on the source data, makes predictions on the target data, and then trains on the target data with the predictions; “LinInt” linearly interpolates predictions using the source-only and target-only models (the interpolation parameter is chosen on a development set); “Augment” uses a larger feature set with source-specific and target-specific copies of features; see [12] for further details.

We are only interested in citations as the target domain. Daumé’s source-only baseline corresponds to the “Baseline” column for domains: books, dvd, electronics, and kitchen; while his target-only baseline can be seen for citations in the last row of the “Baseline” column in Table 2.

The semi-supervised mSDA performs quite well with respect to the fully supervised approaches, obtaining the best results for books and electronics, which are also the highest scores overall. Weight and Pred have the highest F1 scores for dvd and kitchen respectively. Daumé III (2007) noted that the “Augment” algorithm performed best when the target-only results were better than the source-only results. When this was not the case in his experiments, i.e., for the treebank chunking task, both Weight and Pred were among the best approaches. In our experiments, training on source-only outperforms target-only, with the exception of the kitchen domain.

We have included the line for citations to see the results training only on the target data (F1=51.9) and to see the improvement when using all of the unlabeled data with mSDA (F1=54.9).

4.3 Discussion

These results are very promising. Although they are not quite as high as other published results for citation polarity [1]77Their work included a CRF model to identify the citation context that gave them an increase of 9.2 percent F1 over a single sentence citation context. Our approach achieves similar macro-F1 on only the citation sentence, but using a different corpus. , we have shown that you can improve citation polarity classification by leveraging large amounts of annotated data from other domains and using a simple set of features.

mSDA and fully supervised approaches can also be straightforwardly combined. We do not present those results here due to space constraints. The combination led to mixed results: adding mSDA to the supervised approaches tended to improve F1 over those approaches but results never exceeded the top mSDA numbers in Table 2.

5 Related Work

Teufel et al. (2006b) introduced automatic citation function classification, with classes that could be grouped as positive, negative, and neutral. They relied in part on a manually compiled list of cue phrases that cannot easily be transferred to other classification schemes or other scientific domains. Athar (2011) followed this and was the first to specifically target polarity classification on scientific citations. He found that dependency tuples contributed the most significant improvement in results. Abu-Jbara et al. (2013) also looks at both citation function and citation polarity. A big contribution of this work is that they also train a CRF sequence tagger to find the citation context, which significantly improves results over using only the citing sentence. Their feature analysis indicates that lexicons for negation, speculation, and polarity were most important for improving polarity classification.

6 Conclusion

Robust citation classification has been hindered by the relative lack of annotated data. In this paper we successfully use a large, out-of-domain, annotated corpus to improve the citation polarity classification. Our approach uses a deep learning neural network for domain adaptation with labeled out-of-domain data and unlabeled in-domain data. This semi-supervised domain adaptation approach outperforms the in-domain citation polarity classification and other fully supervised domain adaptation approaches.

Acknowledgments. We thank the DFG for funding this work (SPP 1335 Scalable Visual Analytics).


  • [1] A. Abu-Jbara, J. Ezra and D. Radev(2013) Purpose and polarity of citation: towards NLP-based bibliometrics. pp. 596–606. External Links: Link Cited by: 1, 3.2, 4.3, 5.
  • [2] A. Athar and S. Teufel(2012) Context-enhanced citation sentiment detection. pp. 597–601. Cited by: 1.
  • [3] A. Athar(2011) Sentiment analysis of citations using sentence structure-based features. pp. 81–87. External Links: Link Cited by: 1, 3.2, 5.
  • [4] Y. Bengio(2009) Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (1), pp. 1–127. Cited by: 2.
  • [5] S. Bird, R. Dale, B. Dorr, B. Gibson, M. Joseph, M. Kan, D. Lee, B. Powley, D. Radev and Y. F. Tan(2008) The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. pp. 1755–1759. Cited by: 3.1.
  • [6] J. Blitzer, M. Dredze and F. Pereira(2007) Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. pp. 440–447. Cited by: 3.1, 3.2, 3.4.
  • [7] C. Chelba and A. Acero(2004) Adaptation of maximum entropy capitalizer: little data can help a lot. pp. 285–292. Cited by: 4.2.
  • [8] M. Chen, Z. E. Xu, K. Q. Weinberger and F. Sha(2012) Marginalized denoising autoencoders for domain adaptation. pp. 767–774. Cited by: 2, 3.1, 3.2, 3.3, 3.4.
  • [9] D. E. Chubin and S. D. Moitra(1975) Content analysis of references: adjunct or alternative to citation counting?. Social Studies of Science 5, pp. 423–441. Cited by: 1.
  • [10] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. P. Kuksa(2011) Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, pp. 2493–2537. Cited by: 2, 3.2.
  • [11] W. Dai, G. Xue, Q. Yang and Y. Yu(2007) Transferring naive bayes classifiers for text classification. pp. 540–545. Cited by: 4.1.
  • [12] H. Daumé III(2007) Frustratingly easy domain adaptation. pp. 256–263. Cited by: 3.4, 4.2, 4.2.
  • [13] C. Dong and U. Schäfer(2011) Ensemble-style self-training on citation classification. pp. 623–631. Cited by: 1, 3.1, 3.1, 3.2.
  • [14] R. Fan, K. Chang, C. Hsieh, X. Wang and C. Lin(2008) LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research 9, pp. 1871–1874. Cited by: 3.3.
  • [15] E. Garfield(1955) Citation indexes to science: A new dimension in documentation through association of ideas. Science 122, pp. 108–111. Cited by: 1.
  • [16] E. Garfield(1964) Can citation indexing be automated?. pp. 189–192. Cited by: 1.
  • [17] X. Glorot, A. Bordes and Y. Bengio(2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. pp. 513–520. Cited by: 2, 3.2.
  • [18] J. Jiang and C. Zhai(2007) Instance weighting for domain adaptation in NLP. pp. 264–271. Cited by: 4.2.
  • [19] C. Jochim and H. Schütze(2012) Towards a generic and flexible citation classifier based on a faceted classification scheme. pp. 1343–1358. Cited by: 3.1, 3.1, 3.2.
  • [20] B. Liu(2010) Sentiment analysis and subjectivity. in N. Indurkhya and F. J. Damerau (Eds.), Handbook of Natural Language Processing, Second Edition, Cited by: 1.
  • [21] M. J. Moravcsik and P. Murugesan(1975) Some results on the function and quality of citations. Social Studies of Science 5, pp. 86–92. Cited by: 3.1.
  • [22] H. Nanba and M. Okumura(1999) Towards multi-paper summarization using reference information. pp. 926–931. Cited by: 1.
  • [23] B. Pang, L. Lee and S. Vaithyanathan(2002) Thumbs up? sentiment classification using machine learning techniques. pp. 79–86. External Links: Link, Document Cited by: 3.1.
  • [24] V. Qazvinian and D. R. Radev(2008) Scientific paper summarization using citation summary networks. pp. 689–696. Cited by: 1.
  • [25] A. Ritchie, S. Robertson and S. Teufel(2008) Comparing citation contexts for information retrieval. pp. 213–222. Cited by: 1.
  • [26] H. G. Small and B. C. Griffith(1974) The structure of scientific literatures I: identifying and graphing specialties. Science Studies 4 (1), pp. 17–40. Cited by: 1.
  • [27] H. Small and R. Klavans(2011) Identifying scientific breakthroughs by combining co-citation analysis and citation context. Cited by: 1.
  • [28] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng and C. D. Manning(2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. pp. 151–161. External Links: Link Cited by: 2.
  • [29] S. Teufel, A. Siddharthan and D. Tidhar(2006) An annotation scheme for citation function. pp. 80–87. Cited by: 1.
  • [30] S. Teufel, A. Siddharthan and D. Tidhar(2006) Automatic classification of citation function. pp. 103–110. Cited by: 3.2, 5.
  • [31] P. Vincent, H. Larochelle, Y. Bengio and P. Manzagol(2008) Extracting and composing robust features with denoising autoencoders. pp. 1096–1103. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: 3.3.
  • [32] T. Wilson, J. Wiebe and P. Hoffmann(2005) Recognizing contextual polarity in phrase-level sentiment analysis. pp. 347–354. Cited by: 2, 3.2.