# An Extension of BLANC to System Mentions

Xiaoqiang Luo
111 8th Ave, New York, NY 10011
Harvard Medical School
300 Longwood Ave., Boston, MA 02115
Marta Recasens
1600 Amphitheatre Pkwy,
Mountain View, CA 94043
&Eduard Hovy
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213
hovy@cmu.edu
###### Abstract

BLANC is a link-based coreference evaluation metric for measuring the quality of coreference systems on gold mentions. This paper extends the original BLANC (“BLANC-gold” henceforth) to system mentions, removing the gold mention assumption. The proposed BLANC falls back seamlessly to the original one if system mentions are identical to gold mentions, and it is shown to strongly correlate with existing metrics on the 2011 and 2012 CoNLL data.

## 1 Introduction

Coreference resolution aims at identifying natural language expressions (or mentions) that refer to the same entity. It entails partitioning (often imperfect) mentions into equivalence classes. A critically important problem is how to measure the quality of a coreference resolution system. Many evaluation metrics have been proposed in the past two decades, including the MUC measure [9], B-cubed [1], CEAF [3] and, more recently, BLANC-gold [7]. B-cubed and CEAF treat entities as sets of mentions and measure the agreement between key (or gold standard) entities and response (or system-generated) entities, while MUC and BLANC-gold are link-based.

In particular, MUC measures the degree of agreement between key coreference links (i.e., links among mentions within entities) and response coreference links, while non-coreference links (i.e., links formed by mentions from different entities) are not explicitly taken into account. This leads to a phenomenon where coreference systems outputting large entities are scored more favorably than those outputting small entities [3]. BLANC [7], on the other hand, considers both coreference links and non-coreference links. It calculates recall, precision and F-measure separately on coreference and non-coreference links in the usual way, and defines the overall recall, precision and F-measure as the mean of the respective measures for coreference and non-coreference links.

The BLANC-gold metric was developed with the assumption that response mentions and key mentions are identical. In reality, however, mentions need to be detected from natural language text and the result is, more often than not, imperfect: some key mentions may be missing in the response, and some response mentions may be spurious—so-called “twinless” mentions by Stoyanov et al. (2009). Therefore, the identical-mention-set assumption limits BLANC-gold’s applicability when gold mentions are not available, or when one wants to have a single score measuring both the quality of mention detection and coreference resolution. The goal of this paper is to extend the BLANC-gold metric to imperfect response mentions.

We first briefly review the original definition of BLANC, and rewrite its definition using set notation. We then argue that the gold-mention assumption in Recasens and Hovy (2011) can be lifted without changing the original definition. In fact, the proposed BLANC metric subsumes the original one in that its value is identical to the original one when response mentions are identical to key mentions.

The rest of the paper is organized as follows. We introduce the notions used in this paper in Section 2. We then present the original BLANC-gold in Section 3 using the set notation defined in Section 2. This paves the way to generalize it to imperfect system mentions, which is presented in Section 4. The proposed BLANC is applied to the CoNLL 2011 and 2012 shared task participants, and the scores and its correlations with existing metrics are shown in Section 5.

## 2 Notations

To facilitate the presentation, we define the notations used in the paper.

We use key to refer to gold standard mentions or entities, and response to refer to system mentions or entities. The collection of key entities is denoted by $K=\{k_{i}\}_{i=1}^{|K|}$, where $k_{i}$ is the $i^{th}$ key entity; accordingly, $R=\{r_{j}\}_{j=1}^{|R|}$ is the set of response entities, and $r_{j}$ is the $j^{th}$ response entity. We assume that mentions in $\{k_{i}\}$ and $\{r_{j}\}$ are unique; in other words, there is no duplicate mention.

Let $C_{k}(i)$ and $C_{r}(j)$ be the set of coreference links formed by mentions in $k_{i}$ and $r_{j}$:

 $\displaystyle C_{k}(i)$ $\displaystyle=\{(m_{1},m_{2}):m_{1}\in k_{i},m_{2}\in k_{i},m_{1}\neq m_{2}\}$ $\displaystyle C_{r}(j)$ $\displaystyle=\{(m_{1},m_{2}):m_{1}\in r_{j},m_{2}\in r_{j},m_{1}\neq m_{2}\}$

As can be seen, a link is an undirected edge between two mentions, and it can be equivalently represented by a pair of mentions. Note that when an entity consists of a single mention, its coreference link set is empty.

Let $N_{k}(i,j)$ $(i\neq j)$ be key non-coreference links formed between mentions in $k_{i}$ and those in $k_{j}$, and let $N_{r}(i,j)$ $(i\neq j)$ be response non-coreference links formed between mentions in $r_{i}$ and those in $r_{j}$, respectively:

 $\displaystyle N_{k}(i,j)$ $\displaystyle=\{(m_{1},m_{2}):m_{1}\in k_{i},m_{2}\in k_{j}\}$ $\displaystyle N_{r}(i,j)$ $\displaystyle=\{(m_{1},m_{2}):m_{1}\in r_{i},m_{2}\in r_{j}\}$

Note that the non-coreference link set is empty when all mentions are in the same entity.

We use the same letter and subscription without the index in parentheses to denote the union of sets, e.g.,

 $\displaystyle C_{k}=\cup_{i}C_{k}(i),$ $\displaystyle\;N_{k}=\cup_{i\neq j}N_{k}(i,j)$ $\displaystyle C_{r}=\cup_{j}C_{r}(j),$ $\displaystyle\;N_{r}=\cup_{i\neq j}N_{r}(i,j)$

We use $T_{k}=C_{k}\cup N_{k}$ and $T_{r}=C_{r}\cup N_{r}$ to denote the total set of key links and total set of response links, respectively. Clearly, $C_{k}$ and $N_{k}$ form a partition of $T_{k}$ since $C_{k}\cap N_{k}=\emptyset$, $T_{k}=C_{k}\cup N_{k}$. Likewise, $C_{r}$ and $N_{r}$ form a partition of $T_{r}$.

We say that a key link $l_{1}\in T_{k}$ equals a response link $l_{2}\in T_{r}$ if and only if the pair of mentions from which the links are formed are identical. We write $l_{1}=l_{2}$ if two links are equal. It is easy to see that the gold mention assumption—same set of response mentions as the set of key mentions—can be equivalently stated as $T_{k}=T_{r}$ (this does not necessarily mean that $C_{k}=C_{r}$ or $N_{k}=N_{r}$).

We also use $|\cdot|$ to denote the size of a set.

## 3 Original BLANC

BLANC-gold is adapted from Rand Index [6], a metric for clustering objects. Rand Index is defined as the ratio between the number of correct within-cluster links plus the number of correct cross-cluster links, and the total number of links.

When $T_{k}=T_{r}$, Rand Index can be applied directly since coreference resolution reduces to a clustering problem where mentions are partitioned into clusters (entities):

 Rand Index $\displaystyle=\frac{|C_{k}\cap C_{r}|+|N_{k}\cap N_{r}|}{\frac{1}{2}\big(|T_{k% }|(|T_{k}|-1)\big)}$ (1)

In practice, though, the simple-minded adoption of Rand Index is not satisfactory since the number of non-coreference links often overwhelms that of coreference links [7], or, $|N_{k}|\gg|C_{k}|$ and $|N_{r}|\gg|C_{r}|$. Rand Index, if used without modification, would not be sensitive to changes of coreference links.

BLANC-gold solves this problem by averaging the F-measure computed over coreference links and the F-measure over non-coreference links. Using the notations in Section 2, the recall, precision, and F-measure on coreference links are:

 $\displaystyle R_{c}^{(g)}$ $\displaystyle=\frac{|C_{k}\cap C_{r}|}{|C_{k}\cap C_{r}|+|C_{k}\cap N_{r}|}$ (2) $\displaystyle P_{c}^{(g)}$ $\displaystyle=\frac{|C_{k}\cap C_{r}|}{|C_{r}\cap C_{k}|+|C_{r}\cap N_{k}|}$ (3) $\displaystyle F_{c}^{(g)}$ $\displaystyle=\frac{2R_{c}^{(g)}P_{c}^{(g)}}{R_{c}^{(g)}+P_{c}^{(g)}};$ (4)

Similarly, the recall, precision, and F-measure on non-coreference links are computed as:

 $\displaystyle R_{n}^{(g)}$ $\displaystyle=\frac{|N_{k}\cap N_{r}|}{|N_{k}\cap C_{r}|+|N_{k}\cap N_{r}|}$ (5) $\displaystyle P_{n}^{(g)}$ $\displaystyle=\frac{|N_{k}\cap N_{r}|}{|N_{r}\cap C_{k}|+|N_{r}\cap N_{k}|}$ (6) $\displaystyle F_{n}^{(g)}$ $\displaystyle=\frac{2R_{n}^{(g)}P_{n}^{(g)}}{R_{n}^{(g)}+P_{n}^{(g)}}.$ (7)

Finally, the BLANC-gold metric is the arithmetic average of $F_{c}^{(g)}$ and $F_{n}^{(g)}$:

 $\displaystyle\text{BLANC}^{(g)}$ $\displaystyle=\frac{F_{c}^{(g)}+F_{n}^{(g)}}{2}.$ (8)

Superscript ${}^{g}$ in these equations highlights the fact that they are meant for coreference systems with gold mentions.

Eqn. (8) indicates that BLANC-gold assigns equal weight to $F_{c}^{(g)}$, the F-measure from coreference links, and $F_{n}^{(g)}$, the F-measure from non-coreference links. This avoids the problem that $|N_{k}|\gg|C_{k}|$ and $|N_{r}|\gg|C_{r}|$, should the original Rand Index be used.

In Eqn. (2) - (3) and Eqn. (5) - (6), denominators are written as a sum of disjoint subsets so they can be related to the contingency table in [7]. Under the assumption that $T_{k}=T_{r}$, it is clear that $C_{k}=(C_{k}\cap C_{r})\cup(C_{k}\cap N_{r})$, $C_{r}=(C_{k}\cap C_{r})\cup(N_{k}\cap C_{r})$, and so on.

## 4 BLANC for Imperfect Response Mentions

Under the assumption that the key and response mention sets are identical (which implies that $T_{k}=T_{r}$), Equations (2) to (7) make sense. For example, $R_{c}$ is the ratio of the number of correct coreference links over the number of key coreference links; $P_{c}$ is the ratio of the number of correct coreference links over the number of response coreference links, and so on.

However, when response mentions are not identical to key mentions, a key coreference link may not appear in either $C_{r}$ or $N_{r}$, so Equations (2) to (7) cannot be applied directly to systems with imperfect mentions. For instance, if the key entities are {a,b,c} {d,e}; and the response entities are {b,c} {e,f,g}, then the key coreference link (a,b) is not seen on the response side; similarly, it is possible that a response link does not appear on the key side either: (c,f) and (f,g) are not in the key in the above example.

To account for missing or spurious links, we observe that
x  $\bullet$ $C_{k}\setminus T_{r}$ are key coreference links missing in the response;
x  $\bullet$ $N_{k}\setminus T_{r}$ are key non-coreference links missing in the response;
x  $\bullet$ $C_{r}\setminus T_{k}$ are response coreference links missing in the key;
x  $\bullet$ $N_{r}\setminus T_{k}$ are response non-coreference links missing in the key,
and we propose to extend the coreference F-measure and non-coreference F-measure as follows. Coreference recall, precision and F-measure are changed to:

 $\displaystyle R_{c}$ $\displaystyle=\frac{|C_{k}\cap C_{r}|}{|C_{k}\cap C_{r}|+|C_{k}\cap N_{r}|+|C_% {k}\setminus T_{r}|}$ (9) $\displaystyle P_{c}$ $\displaystyle=\frac{|C_{k}\cap C_{r}|}{|C_{r}\cap C_{k}|+|C_{r}\cap N_{k}|+|C_% {r}\setminus T_{k}|}$ (10) $\displaystyle F_{c}$ $\displaystyle=\frac{2R_{c}P_{c}}{R_{c}+P_{c}}$ (11)

Non-coreference recall, precision and F-measure are changed to:

 $\displaystyle R_{n}$ $\displaystyle=\frac{|N_{k}\cap N_{r}|}{|N_{k}\cap C_{r}|+|N_{k}\cap N_{r}|+|N_% {k}\setminus T_{r}|}$ (12) $\displaystyle P_{n}$ $\displaystyle=\frac{|N_{k}\cap N_{r}|}{|N_{r}\cap C_{k}|+|N_{r}\cap N_{k}|+|N_% {r}\setminus T_{k}|}$ (13) $\displaystyle F_{n}$ $\displaystyle=\frac{2R_{n}P_{n}}{R_{n}+P_{n}}.$ (14)

The proposed BLANC continues to be the arithmetic average of $F_{c}$ and $F_{n}$:

 $\displaystyle\text{BLANC}=\frac{F_{c}+F_{n}}{2}.$ (15)

We observe that the definition of the proposed BLANC, Equ. (9)-(14) subsume the BLANC-gold (2) to (7) due to the following proposition:
If $T_{k}=T_{r}$, then $BLANC=BLANC^{(g)}$.

Proof. We only need to show that $R_{c}=R_{c}^{(g)}$, $P_{c}=P_{c}^{(g)}$, $R_{n}=R_{n}^{(g)}$, and $P_{n}=P_{n}^{(g)}$. We prove the first one (the other proofs are similar and elided due to space limitations). Since $T_{k}=T_{r}$ and $C_{k}\subset T_{k}$, we have $C_{k}\subset T_{r}$; thus $C_{k}\setminus T_{r}=\emptyset$, and $|C_{k}\cap T_{r}|=0$. This establishes that $R_{c}=R_{c}^{(g)}$.

Indeed, since $C_{k}$ is a union of three disjoint subsets: $C_{k}=(C_{k}\cap C_{r})\cup(C_{k}\cap N_{r})\cup(C_{k}\setminus T_{r})$, $R_{c}^{(g)}$ and $R_{c}$ can be unified as $\frac{|C_{k}\cap C_{r}|}{|C_{K}|}$. Unification for other component recalls and precisions can be done similarly. So the final definition of BLANC can be succinctly stated as:

 $\displaystyle R_{c}=\frac{|C_{k}\cap C_{r}|}{|C_{k}|},\;\;P_{c}=\frac{|C_{k}% \cap C_{r}|}{|C_{r}|}$ (16) $\displaystyle R_{n}=\frac{|N_{k}\cap N_{r}|}{|N_{k}|},\;\;P_{n}=\frac{|N_{k}% \cap N_{r}|}{|N_{r}|}$ (17) $\displaystyle F_{c}=\frac{2|C_{k}\cap C_{r}|}{|C_{k}|+|C_{r}|},\;\;F_{n}=\frac% {2|N_{k}\cap N_{r}|}{|N_{k}|+|N_{r}|}$ (18) $\displaystyle\text{BLANC}=\frac{F_{c}+F_{n}}{2}$ (19)

### 4.1 Boundary Cases

Care has to be taken when counts of the BLANC definition are 0. This can happen when all key (or response) mentions are in one cluster or are all singletons: the former case will lead to $N_{k}=\emptyset$ (or $N_{r}=\emptyset$); the latter will lead to $C_{k}=\emptyset$ (or $C_{r}=\emptyset$). Observe that as long as $|C_{k}|+|C_{r}|>0$, $F_{c}$ in (18) is well-defined; as long as $|N_{k}|+|N_{r}|>0$, $F_{n}$ in (18) is well-defined. So we only need to augment the BLANC definition for the following cases:

(1) If $C_{k}=C_{r}=\emptyset$ and $N_{k}=N_{r}=\emptyset$, then $\text{BLANC}=I(M_{k}=M_{r})$, where $I(\cdot)$ is an indicator function whose value is 1 if its argument is true, and 0 otherwise. $M_{k}$ and $M_{r}$ are the key and response mention set. This can happen when a document has no more than one mention and there is no link.

(2) If $C_{k}=C_{r}=\emptyset$ and $|N_{k}|+|N_{r}|>0$, then $\text{BLANC}=F_{n}$. This is the case where the key and response side has only entities consisting of singleton mentions. Since there is no coreference link, BLANC reduces to the non-coreference F-measure $F_{n}$.

(3) If $N_{k}=N_{r}=\emptyset$ and $|C_{k}|+|C_{r}|>0$, then $\text{BLANC}=F_{c}$. This is the case where all mentions in the key and response are in one entity. Since there is no non-coreference link, BLANC reduces to the coreference F-measure $F_{c}$.

### 4.2 Toy Examples

We walk through a few examples and show how BLANC is calculated in detail. In all the examples below, each lower-case letter represents a mention; mentions in an entity are closed in {}; two letters in () represent a link.

Example 1. Key entities are $\{abc\}$ and $\{d\}$; response entities are $\{bc\}$ and $\{de\}$. Obviously,
$C_{k}=\{(ab),(bc),(ac)\}$;
$N_{k}=\{(ad),(bd),(cd)\}$;
$C_{r}=\{(bc),(de)\}$;
$N_{r}=\{(bd),(be),(cd),(ce)\}$.
Therefore, $C_{k}\cap C_{r}=\{(bc)\}$, $N_{k}\cap N_{r}=\{(bd),(cd)\}$, and $R_{c}=\frac{1}{3}$, $P_{c}=\frac{1}{2}$, $F_{c}=\frac{2}{5}$; $R_{n}=\frac{2}{3}$, $P_{n}=\frac{2}{4}$, $F_{n}=\frac{4}{7}$. Finally, $\text{BLANC}=\frac{17}{35}$.

Example 2. Key entity is $\{a\}$; response entity is $\{b\}$. This is boundary case (1): $\text{BLANC}=0$.

Example 3. Key entities are $\{a\}\{b\}\{c\}$; response entities are $\{a\}\{b\}\{d\}$. This is boundary case (2): there are no coreference links. Since
$N_{k}=\{(ab),(bc),(ca)\}$,
$N_{r}=\{(ab),(bd),(ad)\}$,
we have
$N_{k}\cap N_{r}=\{(ab)\}$, and $R_{n}=\frac{1}{3}$, $P_{n}=\frac{1}{3}$.
So $\text{BLANC}=F_{n}=\frac{1}{3}$.

Example 4. Key entity is $\{abc\}$; response entity is $\{bc\}$. This is boundary case (3): there are no non-coreference links. Since
$C_{k}=\{(ab),(bc),(ca)\}$, and $C_{r}=\{(bc)\}$,
we have
$C_{k}\cap C_{r}=\{(bc)\}$, and $R_{c}=\frac{1}{3}$, $P_{c}=1$,
So $\text{BLANC}=F_{c}=\frac{2}{4}=\frac{1}{2}$.

## 5 Results

Participant lee sapena R P BLANC 50.23 49.28 48.84 40.68 49.05 44.47 47.83 44.22 45.95 44.71 47.48 45.49 49.37 29.80 34.58 46.74 37.33 41.33 36.88 39.69 30.92 35.42 39.56 36.31 47.95 29.12 36.09 42.32 31.54 35.65 45.41 32.75 36.98 29.93 45.58 34.95 32.29 33.01 32.57 36.83 34.39 35.02 34.84 29.53 30.98 30.10 43.96 35.71 26.40 15.32 15.37 03.62 28.28 06.28
Table 1: The proposed BLANC scores of the CoNLL-2011 shared task participants.

### 5.1 CoNLL-2011/12

We have updated the publicly available CoNLL coreference scorer11http://code.google.com/p/reference-coreference-scorers with the proposed BLANC, and used it to compute the proposed BLANC scores for all the CoNLL 2011 [5] and 2012 [4] participants in the official track, where participants had to automatically predict the mentions. Tables 1 and 2 report the updated results.22The order is kept the same as in Pradhan et al. (2011) and Pradhan et al. (2012) for easy comparison.

Participant Language: Arabic fernandes R P BLANC 33.43 44.66 37.99 32.65 45.47 37.93 31.62 35.26 33.02 32.59 36.92 34.50 31.81 31.52 30.82 11.04 62.58 18.51 04.60 56.63 08.42 54.91 63.66 58.75 52.00 58.84 55.04 52.01 59.55 55.42 52.85 55.03 53.86 50.52 56.82 52.87 51.19 55.47 52.65 54.39 54.88 54.42 50.58 54.29 52.11 45.99 54.59 46.47 49.55 52.46 50.44 44.15 48.89 46.04 40.60 50.85 45.10 41.46 33.13 34.80 44.39 32.79 36.54 25.17 52.96 31.85 48.45 62.44 54.10 53.15 40.75 43.20 47.58 45.93 44.22 44.11 36.45 38.45 42.36 61.72 49.63 39.60 55.12 45.89 33.44 56.01 41.88 27.24 62.33 37.89 37.43 36.18 36.77 36.46 39.79 37.85 21.61 62.94 30.37 18.74 40.76 25.68 21.50 37.18 22.89
Table 2: The proposed BLANC scores of the CoNLL-2012 shared task participants.

### 5.2 Correlation with Other Measures

R P F1
MUC 0.975 0.844 0.935
B-cubed 0.981 0.942 0.966
CEAF-m 0.941 0.923 0.966
CEAF-e 0.797 0.781 0.919
Table 3: Pearson’s r correlation coefficients between the proposed BLANC and the other coreference measures based on the CoNLL 2011/2012 results. All $p$-values are significant at $<$ 0.001.

Figure 1 shows how the proposed BLANC measure works when compared with existing metrics such as MUC, B-cubed and CEAF, using the BLANC and F1 scores. The proposed BLANC is highly positively correlated with the other measures along R, P and F1 (Table 3), showing that BLANC is able to capture most entity-based similarities measured by B-cubed and CEAF. However, the CoNLL data sets come from OntoNotes [2], where singleton entities are not annotated, and BLANC has a wider dynamic range on data sets with singletons [7]. So the correlations will likely be lower on data sets with singleton entities.

Figure 1: Correlation plot between the proposed BLANC and the other measures based on the CoNLL 2011/2012 results. All values are F1 scores.

## 6 Conclusion

The original BLANC-gold [7] requires that system mentions be identical to gold mentions, which limits the metric’s utility since detected system mentions often have missing key mentions or spurious mentions. The proposed BLANC is free from this assumption, and we have shown that it subsumes the original BLANC-gold. Since BLANC works on imperfect system mentions, we have used it to score the CoNLL 2011 and 2012 coreference systems. The BLANC scores show strong correlation with existing metrics, especially B-cubed and CEAF-m.

## Acknowledgments

We would like to thank the three anonymous reviewers for their invaluable suggestions for improving the paper. This work was partially supported by grants R01LM10090 from the National Library of Medicine.

## References

• [1] A. Bagga and B. Baldwin(1998) Algorithms for scoring coreference chains. pp. 563–566. Cited by: 1.
• [2] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw and R. Weischedel(2006-06) OntoNotes: the 90% solution. New York City, USA, pp. 57–60. External Links: Link Cited by: 5.2.
• [3] X. Luo(2005) On coreference resolution performance metrics. Cited by: 1, 1.
• [4] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina and Y. Zhang(2012-07) CoNLL-2012 shared task: modeling multilingual unrestricted coreference in OntoNotes. Jeju Island, Korea, pp. 1–40. External Links: Link Cited by: 5.1.
• [5] S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel and N. Xue(2011-06) CoNLL-2011 shared task: modeling unrestricted coreference in OntoNotes. Portland, Oregon, USA, pp. 1–27. External Links: Link Cited by: 5.1.
• [6] W. M. Rand(1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 (336), pp. 846–850. Cited by: 3.
• [7] M. Recasens and E. Hovy(2011-10) BLANC: implementing the Rand index for coreference evaluation. Natural Language Engineering 17, pp. 485–510. External Links: ISSN 1469-8110, Document, Link Cited by: 1, 1, 1, 3, 3, 5.2, 6.
• [8] V. Stoyanov, N. Gilbert, C. Cardie and E. Riloff(2009) Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art. ACL ’09, Stroudsburg, PA, USA, pp. 656–664. External Links: ISBN 978-1-932432-46-6, Link Cited by: 1.
• [9] M. Vilain, J. Burger, J. Aberdeen, D. Connolly and L. Hirschman(1995) A model-theoretic coreference scoring scheme. pp. 45–52. Cited by: 1.