# Infusion of Labeled Data into Distant Supervision for Relation Extraction

Maria Pershina + Bonan Min^  Wei Xu # Ralph Grishman +
+New York University, New York, NY
{pershina, grishman}@cs.nyu.edu
^Raytheon BBN Technologies, Cambridge, MA
bmin@bbn.com
xwe@cis.upenn.edu
Most of the work was done when this author was at New York University
###### Abstract

Distant supervision usually utilizes only unlabeled data and existing knowledge bases to learn relation extraction models. However, in some cases a small amount of human labeled data is available. In this paper, we demonstrate how a state-of-the-art multi-instance multi-label model can be modified to make use of these reliable sentence-level labels in addition to the relation-level distant supervision from a database. Experiments show that our approach achieves a statistically significant increase of 13.5% in F-score and 37% in area under the precision recall curve.

## 1 Introduction

Relation extraction is the task of tagging semantic relations between pairs of entities from free text. Recently, distant supervision has emerged as an important technique for relation extraction and has attracted increasing attention because of its effective use of readily available databases []. It automatically labels its own training data by heuristically aligning a knowledge base of facts with an unlabeled corpus. The intuition is that any sentence which mentions a pair of entities ($e_{1}$ and $e_{2}$) that participate in a relation, $r$, is likely to express the fact $r(e_{1},\!e_{2})$ and thus forms a positive training example of $r$.

One of most crucial problems in distant supervision is the inherent errors in the automatically generated training data []. Table 1 illustrates this problem with a toy example. Sophisticated multi-instance learning algorithms [] have been proposed to address the issue by loosening the distant supervision assumption. These approaches consider all mentions of the same pair $(e_{1},\!e_{2})$ and assume that $at$-$least$-$one$ mention actually expresses the relation. On top of that, researchers further improved performance by explicitly adding preprocessing steps [] or additional layers inside the model [] to reduce the effect of training noise.

True Positive … to get information out of captured al-Qaida leader Abu Zubaydah.
False Positive …Abu Zubaydah and former Taliban leader Jalaluddin Haqqani …
False Negative …Abu Zubaydah is one of Osama bin Laden’s senior operational planners
Table 1: Classic errors in the training data generated by a toy knowledge base of only one entry $\mathsf{personTitle}$(Abu Zubaydah, leader).

However, the potential of these previously proposed approaches is limited by the inevitable gap between the relation-level knowledge and the instance-level extraction task. In this paper, we present the first effective approach, $\mathsf{{\bf G}uided\ DS}$ (distant supervision), to incorporate labeled data into distant supervision for extracting relations from sentences. In contrast to simply taking the union of the hand-labeled data and the corpus labeled by distant supervision as in the previous work by Zhang et al. [], we generalize the labeled data through feature selection and model this additional information directly in the latent variable approaches. Aside from previous semi-supervised work that employs labeled and unlabeled data [, and others], this is a learning scheme that combines unlabeled text and two training sources whose quantity and quality are radically different [].

Guideline $g=\{g_{i}|i=1,2,3\}$: Relation $r(g)$ $\mathsf{personSpouse}$ $\mathsf{personMemberOf}$ $\mathsf{organizationSubsidiaries}$ $\mathsf{personSiblings}$ $\mathsf{personParents}$ $\mathsf{personTitle}$ $\mathsf{organizationTopMembersEmployees}$ $\mathsf{personCauseOfDeath}$ $\mathsf{personAge}$ $\mathsf{personDateOfBirth}$
Table 2: Some examples from the final set ${\bf G}$ of extracted guidelines.

To demonstrate the effectiveness of our proposed approach, we extend $\mathsf{MIML}$ [], a state-of-the-art distant supervision model and show a significant improvement of 13.5% in F-score on the relation extraction benchmark TAC-KBP [] dataset. While prior work employed tens of thousands of human labeled examples [] and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines.

## 2 The Challenge

Simply taking the union of the hand-labeled data and the corpus labeled by distant supervision is not effective since hand-labeled data will be swamped by a larger amount of distantly labeled data. An effective approach must recognize that the hand-labeled data is more reliable than the automatically labeled data and so must take precedence in cases of conflict. Conflicts cannot be limited to those cases where all the features in two examples are the same; this would almost never occur, because of the dozens of features used by a typical relation extractor []. Instead we propose to perform feature selection to generalize human labeled data into training guidelines, and integrate them into latent variable model.

### 2.1 Guidelines

The sparse nature of feature space dilutes the discriminative capability of useful features. Given the small amount of hand-labeled data, it is important to identify a small set of features that are general enough while being capable of predicting quite accurately the type of relation that may hold between two entities.

We experimentally tested alternative feature sets by building supervised Maximum Entropy (MaxEnt) models using the hand-labeled data (Table 3), and selected an effective combination of three features from the full feature set used by Surdeanu et al., []:

[leftmargin=*]
• the semantic types of the two arguments (e.g. person, organization, location, date, title, …)

• the sequence of dependency relations along the path connecting the heads of the two arguments in the dependency tree.

• a word in the sentence between the two arguments

These three features are strong indicators of the type of relation between two entities. In some cases the semantic types of the arguments alone narrows the possibilities to one or two relation types. For example, entity types such as $\mathsf{person}$ and $\mathsf{title}$ often implies the relation $\mathsf{personTitle}$. Some lexical items are clear indicators of particular relations, such as “brother” and “sister” for a sibling relationship

Model Precision Recall F-score
$\mathsf{MaxEnt}^{\rm all}$ 18.6 6.3 9.4
$\mathsf{MaxEnt}^{\rm two}$ 24.13 10.75 14.87
$\mathsf{MaxEnt}^{\rm three}$ 40.27 12.40 18.97
Table 3: Performance of a MaxEnt, trained on hand-labeled data using all features [] vs using a subset of two (types of entities, dependency path), or three (adding a span word) features, and evaluated on the test set.

We extract guidelines from hand-labeled data. Each guideline $g$={$g_{i}|i$=1,2,3} consists of a pair of semantic types, a dependency path, and optionally a span word and is associated with a particular relation $r(g)$. We keep only those guidelines which make the correct prediction for $all$ and at least $k$=3 examples in the training corpus (threshold 3 was obtained by running experiments on the development dataset). Table 2 shows some examples in the final set ${\bf G}$ of extracted guidelines.

## 3 Guided DS

Our goal is to jointly model human-labeled ground truth and structured data from a knowledge base in distant supervision. To do this, we extend the MIML model [] by adding a new layer as shown in Figure 1.

The input to the model consists of (1) distantly supervised data, represented as a list of n bags11A bag is a set of mentions sharing same entity pair. with a vector ${\bf y_{i}}$ of binary gold-standard labels, either $Positive(P)$ or $Negative(N)$ for each relation $r\!\!\in\!\!R$; (2) generalized human-labeled ground truth, represented as a set G of feature conjunctions $g$={$g_{i}|i$=1,2,3} associated with a unique relation $r(g)$. Given a bag of sentences, ${\bf x_{i}}$, which mention an $i$th entity pair ($e_{1}$, $e_{2}$), our goal is to correctly predict which relation is mentioned in each sentence, or $N\!R$ if none of the relations under consideration are mentioned. The vector ${\bf z_{i}}$ contains the latent mention-level classifications for the $i$th entity pair. We introduce a set of latent variables ${\bf h_{i}}$ which model human ground truth for each mention in the ith bag and take precedence over the current model assignment ${\bf z_{i}}$.

Figure 1: Plate diagram of $\mathsf{{\bf G}uided\ DS}$

Let $i,j$ be the index in the bag and the mention level, respectively. We model mention-level extraction $p(z_{ij}|{\bf x}_{ij};{\bf w}_{z})$, human relabeling $h_{ij}({\bf x}_{ij},z_{ij})$ and multi-label aggregation $p(y_{i}^{r}|{\bf h}_{i};{\bf w_{y}})$. We define:

[leftmargin=*]
• $y_{i}^{r}\!\in\!\{P,N\}:r$ holds for the $i$th bag or not.

• ${\bf x}_{ij}$ is the feature representation of the $j$th relation mention in the $i$th bag. We use the same set of features as in Surdeanu et al. (2012).

• $z_{ij}\!\!\in\!\!R\cup N\!R$: a latent variable that denotes the relation of the $j$th mention in the $i$th bag

• $h_{ij}\!\in\!R\cup N\!R$: a latent variable that denotes the refined relation of the mention ${\bf x}_{ij}$

We define relabeled relations $h_{ij}$ as following:

 $h_{ij}(x_{ij},z_{ij})\!\!=\!\!\left\{\begin{array}[]{l}\!\!\!\!r(g),\!\ {\rm if% \ }\exists!\ \!g\!\in\!{\bf G}{\rm\ s.t.\ }\!g\!=\!\!\{g_{k}\!\}\!\!\subseteq% \!\!\{{\bf x}_{ij}\!\}\\ \!\!\!z_{ij}\ \ ,{\rm\ \ otherwise}\end{array}\right.$

Thus, relation $r(g)$ is assigned to $h_{ij}$ iff there exists a unique guideline $g\!\!\in\!\!{\bf G}$, such that the feature vector ${\bf x}_{ij}$ contains all constituents of $g$, i.e. entity types, a dependency path and maybe a span word, if $g$ has one. We use mention relation $z_{ij}$ inferred by the model only in case no such a guideline exists or there is more than one matching guideline. We also define:

[leftmargin=*]
• ${\bf w}_{z}$ is the weight vector for the multi-class relation mention-level classifier22All classifiers are implemented using L2-regularized logistic regression with Stanford CoreNLP package.

• ${\bf w}_{y}^{r}$ is the weight vector for the rth binary top-level aggregation classifier (from mention labels to bag-level prediction). We use ${\bf w}_{y}$ to represent ${\bf w}_{y}^{1},{\bf w}_{y}^{2},\dots,{\bf w}_{y}^{|R|}$.

Our approach is aimed at improving the mention-level classifier, while keeping the multi-instance multi-label framework to allow for joint modeling.

 Iteration 1 2 3 4 5 6 7 8 (a) Corrected relations: 2052 718 648 596 505 545 557 535 (b) Retrieved relations: 10219 860 676 670 621 599 594 592 Total relabelings 12271 1578 1324 1264 1226 1144 1153 1127
Table 4: Number of relabelings for each training iteration of $\mathsf{{\bf G}uided\ DS}$: (a) relabelings due to corrected relations, e.g. $\mathsf{personChildren}\!\!\rightarrow\!\!\mathsf{personSiblings}$ (b) relabelings due to retrieved relations, e.g. $\mathsf{notRelated}(N\!R)\!\rightarrow\!\mathsf{personTitle}$

## 4 Training

We use a hard expectation maximization algorithm to train the model. Our objective function is to maximize log-likelihood of the data:

 $\displaystyle LL({\bf w_{y},w_{z}})=\sum_{i=1}^{n}\log p({\bf y_{i}|x_{i},w_{y% },w_{z},G})$ $\displaystyle=\sum_{i=1}^{n}\log\sum_{\bf h_{i}}p({\bf y_{i},h_{i}|x_{i},w_{y}% ,w_{z},G})$ $\displaystyle=\sum_{i=1}^{n}\log\sum_{\bf h_{i}}\prod_{j=1}^{|{\bf h_{i}}|}p(h% _{ij}|{\bf x}_{ij},\!{\bf w_{z}},\!{\bf G})\!\!\!\!\!\!\prod_{r\in P_{i}\cup N% _{i}}\!\!\!\!\!p(y_{i}^{r}|{\bf h}_{i},\!{\bf w}_{\bf y}^{r})$

where the last equality is due to conditional independence. Because of the non-convexity of $LL({\bf w_{y},w_{z}})$ we approximate and maximize the joint log-probability $p({\bf y_{i},\!h_{i}|x_{i},\!w_{y},\!\!w_{z},\!G}\!)$ for each entity pair in the database:

 $\displaystyle\log p({\bf y_{i},\!h_{i}|x_{i},\!w_{y},\!\!w_{z},\!G}\!)$ $\displaystyle=\sum_{j=1}^{|{\bf h_{i}}|}\!\log p(h_{ij}|{\bf x}_{ij},\!{\bf w_% {z},\!G})\!+\!\!\!\!\!\!\!\sum_{r\in P_{i}\cup N_{i}}\!\!\!\!\!\!\log p(y_{i}^% {r}|{\bf h}_{i},\!{\bf w}_{\bf y}^{r}).$
{algorithm}

[h!] : Guided DS training {algorithmic}[1] \StatePhase 1: build set G of guidelines \StatePhase 2: EM training \For ${\rm iteration}=1,\dots,T$ \For$i=1,\dots,n$ \For$j=1,\dots,|{\bf x_{i}}|$ \State$\!\!\!\!\!\!z_{ij}^{*}\!\!={\rm argmax}_{z_{ij}}p(z_{ij}|{\bf x_{i},\!y_{i},\!% w_{z},\!w_{y}}\!)$ \State$\!\!\!\!\!\!h_{ij}^{*}\!\!\!=\!\!\left\{\begin{array}[]{l}\!\!\!\!r(g),\!\ {% \rm if\ }\exists!\ \!g\!\in\!{\bf G}\!:\!\{g_{k}\!\}\!\!\subseteq\!\!\{{\bf x}% _{ij}\!\}\\ \!\!\!{z_{ij}}^{*}\ ,{\rm\ otherwise}\end{array}\right.$ \Stateupdate $\bf h_{i}$ with $h_{ij}^{*}$ \EndFor\EndFor\State${\bf w}_{\bf z}^{*}\!\!=\!\!{\rm argmax}_{\bf w}\!\sum_{i=1}^{n}\!\sum_{j=1}^{% |{\bf x_{i}}|}\!\log p(h_{ij}|{\bf x}_{ij},\!\!{\bf w}\!)$ \For$r\in R$ \State${\bf w_{y}^{r*}}\!\!=\!{\rm argmax}_{\bf w}\!\!\!\!\!\!\!\!\!\!\!\sum\limits_{% 1\leq i\leq n\ s.t.\ r\in P_{i}\cup N_{i}}\!\!\!\!\!\!\!\!\!\!\!\log p(y_{i}^{% r}|{\bf h_{i}},\!\!{\bf w}\!)$ \EndFor\EndFor\Statereturn $\bf w_{z},w_{y}$

The pseudocode is presented as algorithm 1.

The following approximation is used for inference at step 6:

 $\displaystyle p(z_{ij}|$ $\displaystyle{\bf x_{i},\!y_{i},\!w_{z},\!w_{y}}\!)\propto p({\bf y_{i}},z_{ij% }|{\bf x_{i},w_{y},w_{z}})$ $\displaystyle\approx p(z_{ij}|x_{ij},{\bf w_{z}})p({\bf y_{i}|h_{i}^{\prime},w% _{y}})$ $\displaystyle=p(z_{ij}|x_{ij},{\bf w_{z}})\prod_{r\in P_{i}\cup N_{i}}p(y_{i}^% {r}|{\bf h_{i}^{\prime},w_{y}^{r}}),$

where ${\bf h_{i}^{\prime}}$ contains previously inferred and maybe further relabeled mention labels for group $i$ (steps 5-10), with the exception of component $j$ whose label is replaced by $z_{ij}$. In the M-step (lines 12-15) we optimize model parameters $\bf w_{z},w_{y}$, given the current assignment of mention-level labels $\bf h_{i}$.

Experiments show that $\mathsf{{\bf G}uided\ DS}$ efficiently learns new model, resulting in a drastically decreasing number of needed relabelings for further iterations (Table 4). At the inference step we first classify all mentions:

 $z_{ij}^{*}={\rm argmax}_{z\in R\cup N\!R}\ \ p(z|x_{ij},{\bf w_{z}})$

Then final relation labels for $i$th entity tuple are obtained via the top-level classifiers:

 $y_{i}^{r*}={\rm argmax}_{y\in\{P,N\}}\ \ p(y|{\bf z_{i}^{*},w_{y}^{r}})$

## 5 Experiments

### 5.1 Data

We use the KBP [] dataset33Available from Linguistic Data Consortium (LDC) at http://projects.ldc.upenn.edu/kbp/data. which is preprocessed by Surdeanu et al. [] using the Stanford parser44http://nlp.stanford.edu/software/lex-parser.shtml []. This dataset is generated by mapping Wikipedia infoboxes into a large unlabeled corpus that consists of 1.5M documents from KBP source corpus and a complete snapshot of Wikipedia.

The KBP 2010 and 2011 data includes 200 query named entities with the relations they are involved in. We used 40 queries as development set and the rest 160 queries (3334 entity pairs that express a relation) as the test set. The official KBP evaluation is performed by pooling the system responses and manually reviewing each response, producing a hand-checked assessment data. We used KBP 2012 assessment data to generate guidelines since queries from different years do not overlap. It contains about 2500 labeled sentences of 41 relations, which is less than 0.09% of the size of the distantly labeled dataset of 2M sentences. The final set G consists of 99 guidelines (section 2.1).

Figure 2: Performance of $\mathsf{{\bf G}uided\ DS}$ on KBP task compared to a) baselines: $\mathsf{MaxEnt}$, $\mathsf{DS}$+$\mathsf{upsampling}$, $\mathsf{Semi}$-$\mathsf{MIML}$ [] b) state-of-art models: $\mathsf{Mintz}$++ [], $\mathsf{MultiR}$ [], $\mathsf{MIML}$ []

### 5.2 Models

We implement $\mathsf{{\bf G}uided\ DS}$ on top of the $\mathsf{MIML}$ [] code base55Available at http://nlp.stanford.edu/software/mimlre.shtml.. Training $\mathsf{MIML}$ on a simple fusion of distantly-labeled and human-labeled datasets does not improve the maximum F-score since this hand-labeled data is swamped by a much larger amount of distant-supervised data of much lower quality. Upsampling the labeled data did not improve the performance either. We experimented with different upsampling ratios and report best results using ratio 1:1 in Figure 2.

Our baselines: 1) $\mathsf{MaxEnt}$ is a supervised maximum entropy baseline trained on a human-labeled data; 2) $\mathsf{DS}$+$\mathsf{upsampling}$ is an upsampling experiment, where $\mathsf{MIML}$ was trained on a mix of a distantly-labeled and human-labeled data; 3) $\mathsf{Semi}$-$\mathsf{MIML}$ is a recent semi-supervised extension. We also compare $\mathsf{{\bf G}uided\ DS}$ with three state-of-the-art models: 1) $\mathsf{MultiR}$ and 2) $\mathsf{MIML}$ are two distant supervision models that support multi-instance learning and overlapping relations; 3) $\mathsf{Mintz}$++ is a single-instance learning algorithm for distant supervision. The difference between $\mathsf{{\bf G}uided\ DS}$ and all other systems is significant with $p$-value less than 0.05 according to a paired $t$-test assuming a normal distribution.

### 5.3 Results

We scored our model against all 41 relations and thus replicated the actual KBP evaluation. Figure 2 shows that our model consistently outperforms all six algorithms at almost all recall levels and improves the maximum $F$-score by more than 13.5$\%$ relative to $\mathsf{MIML}$ (from 28.35$\%$ to 32.19$\%$) as well as increases the area under precision-recall curve by more than 37% (from 11.74 to 16.1). Also, $\mathsf{{\bf G}uided\ DS}$ improves the overall recall by more than 9% absolute (from 30.9% to 39.93%) at a comparable level of precision (24.35% for $\mathsf{MIML}$ vs 23.64% for $\mathsf{{\bf G}uided\ DS}$), while increases the running time of $\mathsf{MIML}$ by only 3%. Thus, our approach outperforms state-of-the-art model for relation extraction using much less labeled data that was used by Zhang et al., [] to outperform logistic regression baseline. Performance of $\mathsf{{\bf G}uided\ DS}$ also compares favorably with best scored hand-coded systems for a similar task such as Sun et al., [] system for KBP 2011, which reports an F-score of 25.7%.

## 6 Conclusions and Future Work

We show that relation extractors trained with distant supervision can benefit significantly from a small number of human labeled examples. We propose a strategy to generate and select guidelines so that they are more generalized forms of labeled instances. We show how to incorporate these guidelines into an existing state-of-art model for relation extraction. Our approach significantly improves performance in practice and thus opens up many opportunities for further research in RE where only a very limited amount of labeled training data is available.

## Acknowledgments

Supported by the Intelligence Advanced Research Projects Activity ( IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.