# Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation

Tong Wang
University of Toronto
tong@cs.toronto.edu &Graeme Hirst
University of Toronto
gh@cs.toronto.edu
###### Abstract

We replace the overlap mechanism of the Lesk algorithm with a simple, general-purpose Naive Bayes model that measures many-to-many association between two sets of random variables. Even with simple probability estimates such as maximum likelihood, the model gains significant improvement over the Lesk algorithm on word sense disambiguation tasks. With additional lexical knowledge from WordNet, performance is further improved to surpass the state-of-the-art results.

## 1 Introduction

To disambiguate a homonymous word in a given context, proposed a method that measured the degree of overlap between the glosses of the target and context words. Known as the Lesk algorithm, this simple and intuitive method has since been extensively cited and extended in the word sense disambiguation (WSD) community. Nonetheless, its performance in several WSD benchmarks is less than satisfactory (). Among the popular explanations is a key limitation of the algorithm, that “Lesk’s approach is very sensitive to the exact wording of definitions, so the absence of a certain word can radically change the results.” ().

Compounding this problem is the fact that many Lesk variants limited the concept of overlap to the literal interpretation of string matching (with their own variants such as length-sensitive matching (), etc.), and it was not until recently that overlap started to take on other forms such as tree-matching () and vector space models (). To address this limitation, a Naive Bayes model (NBM) is proposed in this study as a novel, probabilistic treatment of overlap in gloss-based WSD.

## 2 Related Work

In the extraordinarily rich literature on WSD, we focus our review on those closest to the topic of Lesk and NBM. In particular, we opt for the “simplified Lesk” (), where inventory senses are assessed by gloss-context overlap rather than gloss-gloss overlap. This particular variant prevents proliferation of gloss comparison on larger contexts () and is shown to outperform the original Lesk algorithm ().

To the best of our knowledge, NBMs have been employed exclusively as classifiers in WSD — that is, in contrast to their use as a similarity measure in this study. used NB classifier resembling an information retrieval system: a WSD instance is regarded as a document $d$, and candidate senses are scored in terms of “relevance” to $d$. When evaluated on a WSD benchmark (), the algorithm compared favourably to Lesk variants (as expected for a supervised method). proposed an ensemble model with multiple NB classifiers differing by context window size. trained an unsupervised NB classifier using the EM algorithm and empirically demonstrated the benefits of WordNet-assisted () feature selection over local syntactic features.

Among Lesk variants, extended the gloss of both inventory senses and the context words to include words in their related synsets in WordNet. Senses were scored by the sum of overlaps across all relation pairs, and the effect of individual relation pairs was evaluated in a later work (). Overlap was assessed by string matching, with the number of matching words squared so as to assign higher scores to multi-word overlaps.

Breaking away from string matching, measured overlap as similarity between gloss- and context-vectors, which were aggregated word vectors encoding second order co-occurrence information in glosses. An extension by differentiated context word senses and extended shorter glosses with related glosses in WordNet. measured overlap by concept similarity () between each inventory sense and the context words. Gloss overlaps from their earlier work actually out-performed all five similarity-based methods.

More recently, proposed a tree-matching algorithm that measured gloss-context overlap as the weighted sum of dependency-induced lexical distance. constructed a sentential similarity measure () using lexical similarity measures (), and overlap was measured by the cosine of their respective sentential vectors. A related approach () also used Wikipedia-induced concepts to encoded sentential vectors. These systems compared favourably to existing methods in WSD performance, although by using sense frequency information, they are essentially supervised methods.

Distributional methods have been used in many WSD systems in quite different flavours than the current study. proposed a Lesk variant where each gloss word is weighted by its idf score in relation to all glosses, and gloss-context association was incremented by these weights rather than binary, overlap counts. used distributional thesauri as a knowledge base to increase overlaps, which were, again, assessed by string matching.

In conclusion, the majority of Lesk variants focused on extending the gloss to increase the chance of overlapping, while the proposed NBM aims to make better use of the limited lexical knowledge available. In contrast to string matching, the probabilistic nature of our model offers a “softer” measurement of gloss-context association, resulting in a novel approach to unsupervised WSD with state-of-the-art performance in more than one WSD benchmark (Section 4).

## 3 Model and Task Descriptions

### 3.1 The Naive Bayes Model

Formally, given two sets $\mathbf{e}=\{e_{i}\}$ and $\mathbf{f}=\{f_{j}\}$ each consisting of multiple random events, the proposed model measures the probabilistic association $p(\mathbf{f}|\mathbf{e})$ between $\mathbf{e}$ and $\mathbf{f}$. Under the assumption of conditional independence among the events in each set, a Naive Bayes treatment of the measure can be formulated as:

 $\displaystyle p(\mathbf{f}|\mathbf{e})=$ $\displaystyle\prod_{j}p(f_{j}|\{e_{i}\})=\prod_{j}\frac{p(\{e_{i}\}|f_{j})p(f_% {j})}{p(\{e_{i}\})}$ (1) $\displaystyle=$ $\displaystyle\frac{\prod_{j}[p(f_{j})\prod_{i}p(e_{i}|f_{j})]}{\prod_{j}\prod_% {i}p(e_{i})},$

In the second expression, Bayes’s rule is applied not only to take advantage of the conditional independence among $e_{i}$’s, but also to facilitate probability estimation, since $p(\{e_{i}\}|f_{j})$ is easier to estimate in the context of WSD, where sample spaces of $\mathbf{e}$ and $\mathbf{f}$ become asymmetric (Section 3.2).

### 3.2 Model Application in WSD

In the context of WSD, $\mathbf{e}$ can be regarded as an instance of a polysemous word $w$, while $\mathbf{f}$ represents certain lexical knowledge about the sense $s$ of $w$ manifested by $\mathbf{e}$.11Think of the notations $\mathbf{e}$ and $\mathbf{f}$ mnemonically as exemplars and features, respectively. WSD is thus formulated as identifying the sense $s^{*}$ in the sense inventory $\mathcal{S}$ of $w$ s.t.:

 $s^{*}=\operatorname*{arg\,max}_{s\in\mathcal{S}}p(\mathbf{f}|\mathbf{e})$ (2)

In one of their simplest forms, $e_{i}$’s correspond to co-occurring words in the instance of $w$, and $f_{j}$’s consist of the gloss words of sense $s$. Consequently, $p(\mathbf{f}|\mathbf{e})$ is essentially measuring the association between context words of $w$ and definition texts of $s$, i.e., the gloss-context association in the simplified Lesk algorithm (). A major difference, however, is that instead of using hard, overlap counts between the two sets of words from the gloss and the context, this probabilistic treatment can implicitly model the distributional similarity among the elements $e_{i}$ and $f_{j}$ (and consequently between the sets $\mathbf{e}$ and $\mathbf{f}$) over a wider range of contexts. The result is a “softer” proxy of association than the binary view of overlaps in existing Lesk variants.

The foregoing discussion offers a second motivation for applying Bayes’s rule on the second expression in Equation (1): it is easier to estimate $p(e_{i}|f_{j})$ than $p(f_{j}|e_{i})$, since the vocabulary for the lexical knowledge features ($f_{j}$) is usually more limited than that of the contexts ($e_{i}$) and hence estimation of the former suffices on a smaller amount of data than that of the latter.

### 3.3 Incorporating Additional Lexical Knowledge

The input of the proposed NBM is bags of words, and thus it is straightforward to incorporate various forms of lexical knowledge (LK) for word senses: by concatenating a tokenized knowledge source to the existing knowledge representation $\mathbf{f}$, while the similarity measure remains unchanged.

The availability of LK largely depends on the sense inventory used in a WSD task. WordNet senses are often used in Senseval and SemEval tasks, and hence senses (or synsets, and possibly their corresponding word forms) that are semantic related to the inventory senses under WordNet relations are easily obtainable and have been exploited by many existing studies.

As pointed out by , however, “not all of these relations are equally helpful.” Relation pairs involving hyponyms were shown to result in better F-measure when used in gloss overlaps (). The authors attributed the phenomenon to the the multitude of hyponyms compared to other relations. We further hypothesize that, beyond sheer numbers, synonyms and hyponyms offer stronger semantic specification that helps distinguish the senses of a given ambiguous word, and thus are more effective knowledge sources for WSD.

Senses Hypernyms Hyponyms Synonyms
factory building complex, complex brewery, factory, mill, … works, industrial plant
life form organism, being perennial, crop… flora, plant life
Table 1: Lexical knowledge for the word plant under its two meanings factory and life form.

Take the word plant for example. Selected hypernyms, hyponyms, and synonyms pertaining to its two senses factory and life form are listed in Table 1. Hypernyms can be overly general terms (e.g., being). Although conceptually helpful for humans in coarse-grained WSD, this generality is likely to inflate the hypernyms’ probabilistic estimation. Hyponyms, on the other hand, help specify their corresponding senses with information that is possibly missing from the often overly brief glosses: the many technical terms as hyponyms in Table 1 — though rare — are likely to occur in the (possibly domain-specific) contexts that are highly typical of the corresponding senses. Particularly for the NBM, the co-occurrence is likely to result in stronger gloss-definition associations when similar contexts appear in a WSD instance.

We also observe that some semantically related words appear under rare senses (e.g., still as an alcohol-manufacturing plant, and annual as a one-year-life-cycle plant; omitted from Table 1). This is a general phenomenon in gloss-based WSD and is beyond the scope of the current discussion.22We do, however, refer curious readers to the work of for a novel treatment of a similar problem. Overall, all three sources of LK may complement each other in WSD tasks, with hyponyms particularly promising in both quantity and quality compared to hypernyms and synonyms.33Note that LK expansion is a feature of our model rather than a requirement. What type of knowledge to include is eventually a decision made by the user based on the application and LK availability.

### 3.4 Probability Estimation

A most open-ended question is how to estimate the probabilities in Equation (1). In WSD in particular, the estimation concerns the marginal and conditional probabilities of and between word tokens. Many options are available to this end in statistical machine learning (MLE, MAP, etc.), information theory (), as well as the rich body of research in lexical semantic similarity ).

Here we choose maximum likelihood — not only for its simplicity, but also to demonstrate model strength with a relatively crude probability estimation. To avoid underflow, Equation (1) is estimated as the following log probability:

 \hskip{-4.267913pt}\begin{aligned}&\displaystyle\sum_{i}\log\frac{c(f_{j})}{c(% \cdot)}+\sum_{i}\sum_{j}\log\frac{c(e_{i},f_{j})}{c(f_{j})}-|\mathbf{f}|\sum_{% j}\log\frac{c(e_{i})}{c(\cdot)}\\ \displaystyle=&\displaystyle(1-|\mathbf{e}|)\sum_{i}\log c(f_{j})-|\mathbf{f}|% \sum_{j}\log c(e_{i})\\ &\displaystyle+\sum_{i}\sum_{j}\log c(e_{i},f_{j})+|\mathbf{f}|(|\mathbf{e}|-1% )\log c(\cdot),\end{aligned}

where $c(x)$ is the count of word $x$, $c(\cdot)$ is the corpus size, $c(x,y)$ is the joint count of $x$ and $y$, and $|\mathbf{v}|$ is the dimension of vector $\mathbf{v}$.

Nonetheless, we do investigate how model performance responds to estimation quality. Specifically in WSD, a source corpus is defined as the source of the majority of the WSD instances in a given dataset, and a baseline corpus of a smaller size and less resemblance to the instances is used for all datasets. The assumption is that a source corpus offers better estimates for the model than the baseline corpus, and difference in model performance is expected when using probability estimation of different quality.

## 4 Evaluation

### 4.1 Data, Scoring, and Pre-processing

Various aspects of the model discussed in Section 3 are evaluated in the English lexical sample tasks from Senseval-2 () and SemEval-2007 (). Training sections are used as development data and test sections held out for final testing. Model performance is evaluated in terms of WSD accuracy using Equation (2) as the scoring function. Accuracy is defined as the number of correct responses over the number of instances. Because it is a rare event for the NBM to produce identical scores,44This has never occurred in the hundreds of thousands of runs in our development process. the model always proposes a unique answer and accuracy is thus equivalent to F-score commonly used in existing reports.

Multiword expressions (MWEs) in the Senseval-2 sense inventory are not explicitly marked in the contexts. Several of the top-ranking systems implemented their own MWE detection algorithms (). Without digressing to the details of MWE detection — and meanwhile, to ensure fair comparison with existing systems — we implement two variants of the prediction module, one completely ignorant of MWE and defaulting to INCORRECT for all MWE-related answers, while the other assuming perfect MWE detection and performing regular disambiguation algorithm on the MWE-related senses (not defaulting to CORRECT). All results reported for Senseval-2 below are harmonic means of the two outcomes.

Each inventory sense is represented by a set of LK tokens (e.g., definition texts, synonyms, etc.) from their corresponding WordNet synset (or in the coarse-grained case, a concatenation of tokens from all synsets in a sense group). The MIT-JWI library () is used for accessing WordNet. Usage examples in glosses (included by the library by default) are removed in our experiments.55We also compared the two Lesk baselines (with and without usage examples) on the development data but did not observe significant differences as reported by .

Basic pre-processing is performed on the contexts and the glosses, including lower-casing, stopword removal, lemmatization on both datasets, and tokenization on the Senseval-2 instances.66The SemEval-2007 instances are already tokenized. Stanford CoreNLP is used for lemmatization and tokenization. Identical procedures are applied to all corpora used for probability estimation.

Binomial test is used for significance testing, and with one exception explicitly noted in Section 4.3, all differences presented are statistically highly significant ($p<0.001$).

### 4.2 Comparing Lexical Knowledge Sources

Dataset \arraybackslashglo \arraybackslash\arraybackslashsyn \arraybackslashhpr \arraybackslashhpo \arraybackslashall \arraybackslash1st \arraybackslash2nd \arraybackslashLesk
Senseval-2 Coarse \arraybackslash.475 \arraybackslash\arraybackslash.478 \arraybackslash.494 \arraybackslash.518 \arraybackslash.523 \arraybackslash.469 \arraybackslash.367 \arraybackslash.262
Senseval-2 Fine \arraybackslash.362 \arraybackslash\arraybackslash.371 \arraybackslash.326 \arraybackslash.379 \arraybackslash.388 \arraybackslash.412 \arraybackslash.293 \arraybackslash.163
SemEval-2007 \arraybackslash.494 \arraybackslash\arraybackslash.511 \arraybackslash.507 \arraybackslash.550 \arraybackslash.573 \arraybackslash.538 \arraybackslash.521 \arraybackslash
Table 2: Lexical knowledge sources and WSD performance (F-measure) on the Senseval-2 (fine- and coarse-grained) and the SemEval-2007 dataset.

To study the effect of different types of LK in WSD (Section 3.3), for each inventory sense, we choose synonyms (syn), hypernyms (hpr), and hyponyms (hpo) as extended LK in addition to its gloss. The WSD model is evaluated with gloss-only (glo), individual extended LK sources, and the combination of all four sources (all). The results are listed in Table 2 together with existing results (1st and 2nd correspond to the results of the top two unsupervised methods in each dataset).88We excluded the results of UNED () in Senseval-2 because, by using sense frequency information that is only obtainable from sense-annotated corpora, it is essentially a supervised system.

By using only glosses, the proposed model already shows statistically significant improvement over the basic Lesk algorithm (92.4% and 140.5% relative improvement in Senseval-2 coarse- and fine-grained tracks, respectively).99Comparisons are made against the simplified Lesk algorithm () without usage examples. The comparison is unavailable in SemEval2007 since we have not found existing experiments with this exact configuration. Moreover, comparison between coarse- and fine-grained tracks reveals interesting properties of different LK sources. Previous hypotheses (Section 3.3) are empirically confirmed that WSD performance benefits most from hyponyms and least from hypernyms. Specifically, highly similar, fine-grained sense candidates apparently share more hypernyms in the fine-grained case than in the coarse-grained case; adding to the generality of hypernyms (both semantic and distributional), we postulate that their probability in the NBM is uniformly inflated among many sense candidates, and hence they decrease in distinguishability. Synonyms might help with regard to semantic specification, though their limited quantity also limits their benefits. These patterns on the LK types are consistent in all three experiments.

When including all four LK sources, our model outperforms the state-of-the-art systems with statistical significance in both coarse-grained tasks. For the fine-grained track, it achieves 2nd place after that of , which used a decision list () on manually selected corpora evidence for each inventory sense, and thus is not subject to loss of distinguishability in the glosses as Lesk variants are.

### 4.3 Probability Estimation

Figure 1: Model response to probability estimates of different quality on the SemEval-2007 dataset. Error bars indicate confidence intervals ($p<.001$), and the dashed line corresponds to the best reported result.

To evaluate model response to probability estimation of different quality (Section 3.4), source corpora are chosen as the majority value of the doc-source attribute of instances in each dataset, namely, the British National Corpus for Senseval-2 (94%) and the Wall Street Journal for SemEval-2007 (86%). The Brown Corpus is shared by both datasets as the baseline corpus. Figure 1 shows the comparison on the SemEval-2007 dataset. Across all experiments, higher WSD accuracy is consistently witnessed using the source corpus; differences are statistically highly significant except for hpo (which is significant with $p<0.01$).

## 5 Conclusions and Future Work

We have proposed a general-purpose Naive Bayes model for measuring association between two sets of random events. The model replaced string matching in the Lesk algorithm for word sense disambiguation with a probabilistic measure of gloss-context overlap. The base model on average more than doubled the accuracy of Lesk in Senseval-2 on both fine- and coarse-grained tracks. With additional lexical knowledge, the model also outperformed state of the art results with statistical significance on two coarse-grained WSD tasks.

For future work, we plan to apply the model in other shared tasks, including open-text WSD, so as to compare with more recent Lesk variants. We would also like to explore how to incorporate syntactic features and employ alternative statistical methods (e.g., parametric models) to improve probability estimation and inference. Other NLP problems involving compositionality in general might also benefit from the proposed many-to-many similarity measure.

## Acknowledgments

This study is funded by the Natural Sciences and Engineering Research Council of Canada. We thank Afsaneh Fazly, Navdeep Jaitly, and Varada Kolhatkar for the many inspiring discussions, as well as the anonymous reviewers for their constructive advice.