Infusion of Labeled Data into Distant Supervision for Relation Extraction

Maria Pershina + Bonan Min^  Wei Xu # Ralph Grishman +
+New York University, New York, NY
{pershina, grishman}
^Raytheon BBN Technologies, Cambridge, MA
#University of Pennsylvania, Philadelphia, PA
 Most of the work was done when this author was at New York University

Distant supervision usually utilizes only unlabeled data and existing knowledge bases to learn relation extraction models. However, in some cases a small amount of human labeled data is available. In this paper, we demonstrate how a state-of-the-art multi-instance multi-label model can be modified to make use of these reliable sentence-level labels in addition to the relation-level distant supervision from a database. Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.

1 Introduction

Relation extraction is the task of tagging semantic relations between pairs of entities from free text. Recently, distant supervision has emerged as an important technique for relation extraction and has attracted increasing attention because of its effective use of readily available databases []. It automatically labels its own training data by heuristically aligning a knowledge base of facts with an unlabeled corpus. The intuition is that any sentence which mentions a pair of entities (e1 and e2) that participate in a relation, r, is likely to express the fact r(e1,e2) and thus forms a positive training example of r.

One of most crucial problems in distant supervision is the inherent errors in the automatically generated training data []. Table 1 illustrates this problem with a toy example. Sophisticated multi-instance learning algorithms [] have been proposed to address the issue by loosening the distant supervision assumption. These approaches consider all mentions of the same pair (e1,e2) and assume that at-least-one mention actually expresses the relation. On top of that, researchers further improved performance by explicitly adding preprocessing steps [] or additional layers inside the model [] to reduce the effect of training noise.

True Positive … to get information out of captured al-Qaida leader Abu Zubaydah.
False Positive …Abu Zubaydah and former Taliban leader Jalaluddin Haqqani …
False Negative …Abu Zubaydah is one of Osama bin Laden’s senior operational planners
Table 1: Classic errors in the training data generated by a toy knowledge base of only one entry 𝗉𝖾𝗋𝗌𝗈𝗇𝖳𝗂𝗍𝗅𝖾(Abu Zubaydah, leader).

However, the potential of these previously proposed approaches is limited by the inevitable gap between the relation-level knowledge and the instance-level extraction task. In this paper, we present the first effective approach, 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 (distant supervision), to incorporate labeled data into distant supervision for extracting relations from sentences. In contrast to simply taking the union of the hand-labeled data and the corpus labeled by distant supervision as in the previous work by Zhang et al. [], we generalize the labeled data through feature selection and model this additional information directly in the latent variable approaches. Aside from previous semi-supervised work that employs labeled and unlabeled data [, and others], this is a learning scheme that combines unlabeled text and two training sources whose quantity and quality are radically different [].

Guideline g={gi|i=1,2,3}: Relation r(g)
𝗍𝗒𝗉𝖾𝗌𝗈𝖿𝖾𝗇𝗍𝗂𝗍𝗂𝖾𝗌, dependencypath, span word (optional)
𝗉𝖾𝗋𝗌𝗈𝗇_𝗉𝖾𝗋𝗌𝗈𝗇, nsubjdobj, married 𝗉𝖾𝗋𝗌𝗈𝗇𝖲𝗉𝗈𝗎𝗌𝖾
𝗉𝖾𝗋𝗌𝗈𝗇_𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇, nsubjprep_of, became 𝗉𝖾𝗋𝗌𝗈𝗇𝖬𝖾𝗆𝖻𝖾𝗋𝖮𝖿
𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇_𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇, nsubjprep_of, company 𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇𝖲𝗎𝖻𝗌𝗂𝖽𝗂𝖺𝗋𝗂𝖾𝗌
𝗉𝖾𝗋𝗌𝗈𝗇_𝗉𝖾𝗋𝗌𝗈𝗇, possappos, sister 𝗉𝖾𝗋𝗌𝗈𝗇𝖲𝗂𝖻𝗅𝗂𝗇𝗀𝗌
𝗉𝖾𝗋𝗌𝗈𝗇_𝗉𝖾𝗋𝗌𝗈𝗇, possappos, father 𝗉𝖾𝗋𝗌𝗈𝗇𝖯𝖺𝗋𝖾𝗇𝗍𝗌
𝗉𝖾𝗋𝗌𝗈𝗇_𝗍𝗂𝗍𝗅𝖾, nn 𝗉𝖾𝗋𝗌𝗈𝗇𝖳𝗂𝗍𝗅𝖾
𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇_𝗉𝖾𝗋𝗌𝗈𝗇, prep_ofappos 𝗈𝗋𝗀𝖺𝗇𝗂𝗓𝖺𝗍𝗂𝗈𝗇𝖳𝗈𝗉𝖬𝖾𝗆𝖻𝖾𝗋𝗌𝖤𝗆𝗉𝗅𝗈𝗒𝖾𝖾𝗌
𝗉𝖾𝗋𝗌𝗈𝗇_𝖼𝖺𝗎𝗌𝖾, nsubjprep_of 𝗉𝖾𝗋𝗌𝗈𝗇𝖢𝖺𝗎𝗌𝖾𝖮𝖿𝖣𝖾𝖺𝗍𝗁
𝗉𝖾𝗋𝗌𝗈𝗇_𝗇𝗎𝗆𝖻𝖾𝗋, appos 𝗉𝖾𝗋𝗌𝗈𝗇𝖠𝗀𝖾
𝗉𝖾𝗋𝗌𝗈𝗇_𝖽𝖺𝗍𝖾, nsubjpassprep_onnum 𝗉𝖾𝗋𝗌𝗈𝗇𝖣𝖺𝗍𝖾𝖮𝖿𝖡𝗂𝗋𝗍𝗁
Table 2: Some examples from the final set 𝐆 of extracted guidelines.

To demonstrate the effectiveness of our proposed approach, we extend 𝖬𝖨𝖬𝖫 [], a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP [] dataset. While prior work employed tens of thousands of human labeled examples [] and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.

2 The Challenge

Simply taking the union of the hand-labeled data and the corpus labeled by distant supervision is not effective since hand-labeled data will be swamped by a larger amount of distantly labeled data. An effective approach must recognize that the hand-labeled data is more reliable than the automatically labeled data and so must take precedence in cases of conflict. Conflicts cannot be limited to those cases where all the features in two examples are the same; this would almost never occur, because of the dozens of features used by a typical relation extractor []. Instead we propose to perform feature selection to generalize human labeled data into training guidelines, and integrate them into latent variable model.

2.1 Guidelines

The sparse nature of feature space dilutes the discriminative capability of useful features. Given the small amount of hand-labeled data, it is important to identify a small set of features that are general enough while being capable of predicting quite accurately the type of relation that may hold between two entities.

We experimentally tested alternative feature sets by building supervised Maximum Entropy (MaxEnt) models using the hand-labeled data (Table 3), and selected an effective combination of three features from the full feature set used by Surdeanu et al., []:

  • the semantic types of the two arguments (e.g. person, organization, location, date, title, …)

  • the sequence of dependency relations along the path connecting the heads of the two arguments in the dependency tree.

  • a word in the sentence between the two arguments

These three features are strong indicators of the type of relation between two entities. In some cases the semantic types of the arguments alone narrows the possibilities to one or two relation types. For example, entity types such as 𝗉𝖾𝗋𝗌𝗈𝗇 and 𝗍𝗂𝗍𝗅𝖾 often implies the relation 𝗉𝖾𝗋𝗌𝗈𝗇𝖳𝗂𝗍𝗅𝖾. Some lexical items are clear indicators of particular relations, such as “brother” and “sister” for a sibling relationship

Model Precision Recall F-score
𝖬𝖺𝗑𝖤𝗇𝗍all 18.6 6.3 9.4
𝖬𝖺𝗑𝖤𝗇𝗍two 24.13 10.75 14.87
𝖬𝖺𝗑𝖤𝗇𝗍three 40.27 12.40 18.97
Table 3: Performance of a MaxEnt, trained on hand-labeled data using all features [] vs using a subset of two (types of entities, dependency path), or three (adding a span word) features, and evaluated on the test set.

We extract guidelines from hand-labeled data. Each guideline g={gi|i=1,2,3} consists of a pair of semantic types, a dependency path, and optionally a span word and is associated with a particular relation r(g). We keep only those guidelines which make the correct prediction for all and at least k=3 examples in the training corpus (threshold 3 was obtained by running experiments on the development dataset). Table 2 shows some examples in the final set 𝐆 of extracted guidelines.

3 Guided DS

Our goal is to jointly model human-labeled ground truth and structured data from a knowledge base in distant supervision. To do this, we extend the MIML model [] by adding a new layer as shown in Figure 1.

The input to the model consists of (1) distantly supervised data, represented as a list of n bags11A bag is a set of mentions sharing same entity pair. with a vector 𝐲𝐢 of binary gold-standard labels, either Positive(P) or Negative(N) for each relation rR; (2) generalized human-labeled ground truth, represented as a set G of feature conjunctions g={gi|i=1,2,3} associated with a unique relation r(g). Given a bag of sentences, 𝐱𝐢, which mention an ith entity pair (e1, e2), our goal is to correctly predict which relation is mentioned in each sentence, or NR if none of the relations under consideration are mentioned. The vector 𝐳𝐢 contains the latent mention-level classifications for the ith entity pair. We introduce a set of latent variables 𝐡𝐢 which model human ground truth for each mention in the ith bag and take precedence over the current model assignment 𝐳𝐢.

Figure 1: Plate diagram of 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲

Let i,j be the index in the bag and the mention level, respectively. We model mention-level extraction p(zij|𝐱ij;𝐰z), human relabeling hij(𝐱ij,zij) and multi-label aggregation p(yir|𝐡i;𝐰𝐲). We define:

  • yir{P,N}:r holds for the ith bag or not.

  • 𝐱ij is the feature representation of the jth relation mention in the ith bag. We use the same set of features as in Surdeanu et al. (2012).

  • zijRNR: a latent variable that denotes the relation of the jth mention in the ith bag

  • hijRNR: a latent variable that denotes the refined relation of the mention 𝐱ij

We define relabeled relations hij as following:


Thus, relation r(g) is assigned to hij iff there exists a unique guideline g𝐆, such that the feature vector 𝐱ij contains all constituents of g, i.e. entity types, a dependency path and maybe a span word, if g has one. We use mention relation zij inferred by the model only in case no such a guideline exists or there is more than one matching guideline. We also define:

  • 𝐰z is the weight vector for the multi-class relation mention-level classifier22All classifiers are implemented using L2-regularized logistic regression with Stanford CoreNLP package.

  • 𝐰yr is the weight vector for the rth binary top-level aggregation classifier (from mention labels to bag-level prediction). We use 𝐰y to represent 𝐰y1,𝐰y2,,𝐰y|R|.

Our approach is aimed at improving the mention-level classifier, while keeping the multi-instance multi-label framework to allow for joint modeling.

             Iteration 1 2 3 4 5 6 7 8
(a) Corrected relations: 2052 718 648 596 505 545 557 535
(b) Retrieved relations: 10219 860 676 670 621 599 594 592
Total relabelings 12271 1578 1324 1264 1226 1144 1153 1127
Table 4: Number of relabelings for each training iteration of 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲: (a) relabelings due to corrected relations, e.g. 𝗉𝖾𝗋𝗌𝗈𝗇𝖢𝗁𝗂𝗅𝖽𝗋𝖾𝗇𝗉𝖾𝗋𝗌𝗈𝗇𝖲𝗂𝖻𝗅𝗂𝗇𝗀𝗌 (b) relabelings due to retrieved relations, e.g. 𝗇𝗈𝗍𝖱𝖾𝗅𝖺𝗍𝖾𝖽(NR)𝗉𝖾𝗋𝗌𝗈𝗇𝖳𝗂𝗍𝗅𝖾

4 Training

We use a hard expectation maximization algorithm to train the model. Our objective function is to maximize log-likelihood of the data:


where the last equality is due to conditional independence. Because of the non-convexity of LL(𝐰𝐲,𝐰𝐳) we approximate and maximize the joint log-probability p(𝐲𝐢,𝐡𝐢|𝐱𝐢,𝐰𝐲,𝐰𝐳,𝐆) for each entity pair in the database:


[h!] : Guided DS training {algorithmic}[1] \StatePhase 1: build set G of guidelines \StatePhase 2: EM training \For iteration=1,,T \Fori=1,,n \Forj=1,,|𝐱𝐢| \Statezij*=argmaxzijp(zij|𝐱𝐢,𝐲𝐢,𝐰𝐳,𝐰𝐲) \Statehij*={r(g),if!g𝐆:{gk}{𝐱ij}zij*,  otherwise \Stateupdate 𝐡𝐢 with hij* \EndFor\EndFor\State𝐰𝐳*=argmax𝐰i=1nj=1|𝐱𝐢|logp(hij|𝐱ij,𝐰) \ForrR \State𝐰𝐲𝐫*=argmax𝐰1ins.t.  rPiNilogp(yir|𝐡𝐢,𝐰) \EndFor\EndFor\Statereturn 𝐰𝐳,𝐰𝐲

The pseudocode is presented as algorithm 1.

The following approximation is used for inference at step 6:

p(zij| 𝐱𝐢,𝐲𝐢,𝐰𝐳,𝐰𝐲)p(𝐲𝐢,zij|𝐱𝐢,𝐰𝐲,𝐰𝐳)

where 𝐡𝐢 contains previously inferred and maybe further relabeled mention labels for group i (steps 5-10), with the exception of component j whose label is replaced by zij. In the M-step (lines 12-15) we optimize model parameters 𝐰𝐳,𝐰𝐲, given the current assignment of mention-level labels 𝐡𝐢.

Experiments show that 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 efficiently learns new model, resulting in a drastically decreasing number of needed relabelings for further iterations (Table 4). At the inference step we first classify all mentions:


Then final relation labels for ith entity tuple are obtained via the top-level classifiers:


5 Experiments

5.1 Data

We use the KBP [] dataset33Available from Linguistic Data Consortium (LDC) at which is preprocessed by Surdeanu et al. [] using the Stanford parser44 []. This dataset is generated by mapping Wikipedia infoboxes into a large unlabeled corpus that consists of 1.5M documents from KBP source corpus and a complete snapshot of Wikipedia.

The KBP 2010 and 2011 data includes 200 query named entities with the relations they are involved in. We used 40 queries as development set and the rest 160 queries (3334 entity pairs that express a relation) as the test set. The official KBP evaluation is performed by pooling the system responses and manually reviewing each response, producing a hand-checked assessment data. We used KBP 2012 assessment data to generate guidelines since queries from different years do not overlap. It contains about 2500 labeled sentences of 41 relations, which is less than 0.09% of the size of the distantly labeled dataset of 2M sentences. The final set G consists of 99 guidelines (section 2.1).

Figure 2: Performance of 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 on KBP task compared to a) baselines: 𝖬𝖺𝗑𝖤𝗇𝗍, 𝖣𝖲+𝗎𝗉𝗌𝖺𝗆𝗉𝗅𝗂𝗇𝗀, 𝖲𝖾𝗆𝗂-𝖬𝖨𝖬𝖫 [] b) state-of-art models: 𝖬𝗂𝗇𝗍𝗓++ [], 𝖬𝗎𝗅𝗍𝗂𝖱 [], 𝖬𝖨𝖬𝖫 []

5.2 Models

We implement 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 on top of the 𝖬𝖨𝖬𝖫 [] code base55Available at Training 𝖬𝖨𝖬𝖫 on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality. Upsampling the labeled data did not improve the performance either. We experimented with different upsampling ratios and report best results using ratio 1:1 in Figure 2.

Our baselines: 1) 𝖬𝖺𝗑𝖤𝗇𝗍 is a supervised maximum entropy baseline trained on a human-labeled data; 2) 𝖣𝖲+𝗎𝗉𝗌𝖺𝗆𝗉𝗅𝗂𝗇𝗀 is an upsampling experiment, where 𝖬𝖨𝖬𝖫 was trained on a mix of a distantly-labeled and human-labeled data; 3) 𝖲𝖾𝗆𝗂-𝖬𝖨𝖬𝖫 is a recent semi-supervised extension. We also compare 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 with three state-of-the-art models: 1) 𝖬𝗎𝗅𝗍𝗂𝖱 and 2) 𝖬𝖨𝖬𝖫 are two distant supervision models that support multi-instance learning and overlapping relations; 3) 𝖬𝗂𝗇𝗍𝗓++ is a single-instance learning algorithm for distant supervision. The difference between 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 and all other systems is significant with p-value less than 0.05 according to a paired t-test assuming a normal distribution.

5.3 Results

We scored our model against all 41 relations and thus replicated the actual KBP evaluation. Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum F-score by more than 13.5% relative to 𝖬𝖨𝖬𝖫 (from 28.35% to 32.19%) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1). Also, 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 improves the overall recall by more than 9% absolute (from 30.9% to 39.93%) at a comparable level of precision (24.35% for 𝖬𝖨𝖬𝖫 vs 23.64% for 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲), while increases the running time of 𝖬𝖨𝖬𝖫 by only 3%. Thus, our approach outperforms state-of-the-art model for relation extraction using much less labeled data that was used by Zhang et al., [] to outperform logistic regression baseline. Performance of 𝐆𝗎𝗂𝖽𝖾𝖽𝖣𝖲 also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., [] system for KBP 2011, which reports an F-score of 25.7%.

6 Conclusions and Future Work

We show that relation extractors trained with distant supervision can benefit significantly from a small number of human labeled examples. We propose a strategy to generate and select guidelines so that they are more generalized forms of labeled instances. We show how to incorporate these guidelines into an existing state-of-art model for relation extraction. Our approach significantly improves performance in practice and thus opens up many opportunities for further research in RE where only a very limited amount of labeled training data is available.


Supported by the Intelligence Advanced Research Projects Activity ( IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.