# Improved Iterative Correction for Distant Spelling Errors

Sergey Gubanov     Irina Galinskaya     Alexey Baytin
Yandex
16 Leo Tolstoy St., Moscow, 119021 Russia
{esgv,galinskaya,baytin}@yandex-team.ru
###### Abstract

Noisy channel models, widely used in modern spellers, cope with typical misspellings, but do not work well with infrequent and difficult spelling errors. In this paper, we have improved the noisy channel approach by iterative stochastic search for the best correction. The proposed algorithm allowed us to avoid local minima problem and improve the $F_{1}$ measure by 6.6% on distant spelling errors.

## 1 Introduction

A speller is an essential part of any program associated with text input and processing — e-mail system, search engine, browser, form editor etc. To detect and correct spelling errors, the state of the art spelling correction systems use the noisy channel approach [6, 8, 1]. Its models are usually trained on large corpora and provide high effectiveness in correction of typical errors (most of which consist of 1-2 wrong characters per word), but does not work well for complex (multi-character) and infrequent errors.

In this paper, we improved effectiveness of the noisy channel for the correction of complex errors. In most cases, these are cognitive errors in loan words (folsvagen $\to$ volkswagen), names of drugs (vobemzin $\to$ wobenzym), names of brands (scatcher $\to$ skechers), scientific terms (heksagidron $\to$ hexahedron) and last names (Shwartzneger $\to$ Schwarzenegger). In all these cases, the misspelled word contains many errors and the corresponding error model penalty cannot be compensated by the LM weight of its proper form. As a result, either the misspelled word itself, or the other (less complicated, more frequent) misspelling of the same word wins the likelihood race.

To compensate for this defect of the noisy channel, the iterative approach [3] is typically used. The search for the best variant is repeated several times, what allows correcting rather complex errors, but does not completely solve the problem of falling into local minima. To overcome this issue we suggest to consider more correction hypotheses. For this purpose we used a method based on the simulated annealing algorithm. We experimentally demonstrate that the proposed method outperforms the baseline noisy channel and iterative spellers.

Many authors employ machine learning to build rankers that compensate for the drawbacks of the noisy channel model: [9, 5]. These techniques can be combined with the proposed method by replacing posterior probability of single correction in our method with an estimate obtained via discriminative training method.

In our work, we focus on isolated word-error correction [7], which, in a sense, is a harder task, than multi-word correction, because there is no context available for misspelled words. For experiments we used single-word queries to a commercial search engine.

## 2 Baseline speller

### 2.1 Noisy channel spelling correction

Noisy channel is a probabilistic model that defines posterior probability $P(q_{0}|q_{1})$ of $q_{0}$ being the intended word, given the observed word $q_{1}$; for such model, the optimal decision rule $\mu$ is the following:

 $\begin{split}\mu(q_{1})&=\arg\max_{q_{0}}P(q_{0}|q_{1});\\ P(q_{0}|q_{1})&\propto P_{\rm dist}({q_{0}}\to q_{1})P_{\rm LM}({q_{0}}),\end{split}$ (1)

where $P_{\rm LM}$ is the source (language) model, and $P_{\rm dist}$ is the error model. Given $P(q_{0}|q_{1})$ defined, to correct the word $q_{1}$ we could iterate through all ever-observed words, and choose the one, that maximizes the posterior probability. However, the practical considerations demand that we do not rank the whole list of words, but instead choose between a limited number of hypotheses $h_{1},...,h_{K}$:

1. 1.

Given $q_{1}$, generate a set of hypotheses $h_{1},...,h_{K}$, such that

 $\sum_{k=1}^{K}P(q_{0}=h_{k}|q_{1})\approx 1;$ (2)
2. 2.

Choose the hypothesis $h_{k}$ that maximizes $P(q_{0}=h_{k}|q_{1})$.

If hypotheses constitute a major part of the posterior probability mass, it is highly unlikely that the intended word is not among them.

### 2.2 Baseline speller setup

In baseline speller we use a substring-based error model $P_{\rm dist}(q_{0}~{}\to~{}q_{1})$ described in [1], the error model training method and the hypotheses generator are similar to [4].

For building language ($P_{\rm LM^{\prime}}$) and error ($P_{\rm dist^{\prime}}$) models, we use words collected from the 6-months query log of a commercial search engine.

Hypotheses generator is based on A* beam search in a trie of words, and yields $K$ hypotheses $h_{k}$, for which the noisy channel scores $P_{\rm dist}(h_{k}~{}\to~{}q_{1})P_{\rm LM}(h_{k})$ are highest possible. Hypotheses generator has high K-best recall (see Section 4.2) — in 91.8% cases the correct hypothesis is found when $K=30$, which confirms the assumption about covering almost all posterior probability mass (see Equation 2).

## 3 Improvements for noisy channel spelling correction

While choosing $\arg\max$ of the posterior probability is an optimal decision rule in theory, in practice it might not be optimal, due to limitations of the language and error modeling. For example, vobemzin is corrected to more frequent misspelling vobenzin (instead of correct form wobenzym) by the noisy channel, because $P_{\rm dist}(vobemzin\to\,wobenzym)$ is too low (see Table 1).

$c$ $-\log P_{\rm dist}$ $-\log P_{\rm LM}$ $\sum$
vobenzin 2.289 31.75 34.04
wobenzym 12.52 26.02 38.54
Table 1: Noisy-channel scores for two corrections of vobemzin

There have been attempts [3] to apply other rules, which would overcome limitations of language and error models with compensating changes described further.

### 3.1 Iterative correction

Iterative spelling correction with $E$ iterations uses standard noisy-channel to correct the query $q$ repeatedly $E$ times. It is motivated by the assumption, that we are more likely to successfully correct the query if we take several short steps instead of one big step [3] .

Iterative correction is hill climbing in the space of possible corrections: on each iteration we make a transition to the best point in the neighbourhood, i.e. to correction, that has maximal posterior probability $P(c|q)$. As any local search method, iterative correction is prone to local minima, stopping before reaching the correct word.

### 3.2 Stochastic iterative correction

A common method of avoiding local minima in optimization is the simulated annealing algorithm, key ideas from which can be adapted for spelling correction task. In this section we propose such an adaptation. Consider: we do not always transition deterministically to the next best correction, but instead transition randomly to a (potentially any) correction with transition probability being equal to the posterior $P(c_{i}|c_{i-1})$, where $c_{i-1}$ is the correction we transition from, $c_{i}$ is the correction we transition to, and $P(\cdot|\cdot)$ is defined by Equation 1. Iterative correction then turns into a random walk: we start at word $c_{0}=q$ and stop after $E$ random steps at some word $c_{E}$, which becomes our answer.

To turn random walk into deterministic spelling correction algorithm, we de-randomize it, using the following transformation. Described random walk defines, for each word $w$, a probability $P(c_{E}=w|q)$ of ending up in $w$ after starting a walk from the initial query $q$. With that probability defined, our correction algorithm is the following: given query $q$, pick $c=\arg\max_{c_{E}}P(c_{E}|q)$ as a correction.

Probability of getting from $c_{0}=q$ to some $c_{E}=c$ is a sum, over all possible paths, of probabilities of getting from $q$ to $c$ via specific path $q=c_{0}\to c_{1}\to...\to c_{E-1}\to c_{E}=c$:

 $P(c_{E}|c_{0})=\sum_{\substack{c_{1}\in W\\ ...\\ c_{E-1}\in W}}\prod_{i=1}^{E}P(c_{i}|c_{i-1}),$ (3)
 $P(c_{i}|c_{i-1})=\frac{P_{\rm dist}(c_{i}\to c_{i-1})P_{\rm LM}(c_{i})}{P_{\rm observe% }(c_{i-1})},$ (4)

where $W$ is the set of all possible words, and $P_{\rm observe}(w)$ is the probability of observing $w$ as a query in the noisy-channel model.

Example: if we start a random walk from vobemzin and make 3 steps, we most probably will end up in the correct form wobenzym with $P=0.361$. A few of the most probable random walk paths are shown in Table 2. Note, that despite the fact that most probable path does not lead to the correct word, many other paths to wobenzym sum up to 0.361, which is greater than probability of any other word. Also note, that the method works only because multiple misspellings of the same word are presented in our model; for related research see [2].

$c_{0}\to{}c_{1}\to{}c_{2}\to{}c_{3}$ $P$
vobemzin$\to$vobenzin$\to$vobenzin$\to$vobenzin 0.074
vobemzin$\to$vobenzim$\to$wobenzym$\to$wobenzym 0.065
vobemzin$\to$vobenzin$\to$vobenzim$\to$vobenzim 0.052
vobemzin$\to$vobenzim$\to$vobenzim$\to$wobenzym 0.034
vobemzin$\to$wobenzym$\to$wobenzym$\to$wobenzym 0.031
vobemzin$\to$wobenzim$\to$wobenzym$\to$wobenzym 0.028
vobemzin$\to$wobenzyn$\to$wobenzym$\to$wobenzym 0.022
Table 2: Most probable random walk paths starting from $c_{0}=q=vobemzin$ (the correct form is in bold).

Also note, that while Equation 3 uses noisy-channel posteriors, the method can use an arbitrary discriminative model, for example the one from [5], and benefit from a more accurate posterior estimate.

This section describes some common heuristic improvements, that, where possible, were applied both to the baseline methods and to the proposed algorithm.

Basic building block of every mentioned algorithm is one-step noisy-channel correction. Each basic correction proceeds as described in Section 2.1: a small number of hypotheses $h_{1},...,h_{K}$ is generated for the query $q$, hypotheses are scored, and scores are recomputed into normalized posterior probabilities (see Equation 5). Posterior probabilities are then either used to pick the best correction (in baseline and simple iterative correction), or are accumulated to later compute the score defined by Equation 3.

 $\begin{split}{\rm score}(h_{i})&=P_{\rm dist}(h_{i}~{}\to~{}q)^{\lambda}P_{\rm LM% }(h_{i})\\ P(h_{i}|q)&={\rm score}(h_{i})\Big/\sum_{j=1}^{K}{\rm score}(h_{j})\end{split}$ (5)

A standard log-linear weighing trick was applied to noisy-channel model components, see e.g. [9]. $\lambda$ is the parameter that controls the trade-off between precision and recall (see Section 4.2) by emphasizing the importance of either the high frequency of the correction or its proximity to the query.

We have also found, that resulting posterior probabilities emphasize the best hypothesis too much: best hypothesis gets almost all probability mass and other hypotheses get none. To compensate for that, posteriors were smoothed by raising each probability to some power $\gamma<1$ and re-normalizing them afterward:

 $P_{\rm smooth}(h_{i}|q)=P(h_{i}|q)^{\gamma}\Big/\sum_{j=1}^{K}P(h_{j}|q)^{% \gamma}.$ (6)

In a sense, $\gamma$ is like temperature parameter in simulated annealing – it controls the entropy of the walk and the final probability distribution. Unlike in simulated annealing, we fix $\gamma$ for all iterations of the algorithm.

Finally, if posterior probability of the best hypothesis was lower than threshold $\alpha$, then the original query $q$ was used as the spell-checker output. (Posterior is defined by Equation 6 for the baseline and simple iterative methods and by Equations 3 and 6 for the proposed method). Parameter $\alpha$ controls precision/recall trade-off (as well as $\lambda$ mentioned above).

## 4 Experiments

### 4.1 Data

To evaluate the proposed algorithm we have collected two datasets. Both datasets were randomly sampled from single-word user queries from the 1-week query log of a commercial search engine. We annotated them with the help of professional analyst. The difference between datasets is that one of them contained only queries with low search performance: for which the number of documents retrieved by the search engine was less than a fixed threshold (we will address it as the ”hard” dataset), while the other dataset had no such restrictions (we will call it ”common”). Dataset statistics are shown in Table 3.

Dataset Queries Misspelled Avg. $-\log P_{\rm dist}$
Common 2240 224 (10%) 5.98
Hard 2542 1484 (58%) 9.23
Table 3: Evaluation datasets.

Increased average error model score and error rate of ”common” dataset compared to ”hard” shows, that we have indeed managed to collect hard-to-correct queries in the ”hard” dataset.

### 4.2 Experimental results

First of all, we evaluated the recall of hypotheses generator using K-best recall — the number of correct spelling corrections for misspelled queries among $K$ hypotheses divided by the total number of misspelled queries in the test set. Resulting recall with $K=30$ is 91.8% on ”hard” and 98.6% on ”common”.

Next, three spelling correction methods were tested: noisy channel, iterative correction and our method (stochastic iterative correction).

For evaluation of spelling correction quality, we use the following metrics:

• Precision: The number of correct spelling corrections for misspelled words generated by the system divided by the total number of corrections generated by the system;

• Recall: The number of correct spelling corrections for misspelled words generated by the system divided by the total number of misspelled words in the test set;

For hypotheses generator, $K=30$ was fixed: recall of 91.8% was considered big enough. Precision/recall tradeoff parameters $\lambda$ and $\alpha$ (they are applicable to each method, including baseline) were iterated by the grid $(0.2,0.25,0.3,...,1.5)\times(0,0.025,0.05,...,1.0)$, and $E$ (applicable to iterative and our method) and $\gamma$ (just our method) were iterated by the grid $(2,3,4,5,7,10)\times(0.1,0.15,...1.0)$; for each set of parameters, precision and recall were measured on both datasets. Pareto frontiers for precision and recall are shown in Figures 1 and 2.

Figure 1: Precision/recall Pareto frontiers on ”hard” dataset
Figure 2: Precision/recall Pareto frontiers on ”common” dataset

We were not able to reproduce superior performance of the iterative method over the noisy channel, reported by [3]. Supposedly, it is because the iterative method benefits primarily from the sequential application of split/join operations altering query decomposition into words; since we are considering only one-word queries, such decomposition does not matter.

On the ”hard” dataset the performance of the noisy channel and the iterative methods is inferior to our proposed method, see Figure 1. We tested all three methods on the ”common” dataset as well to evaluate if our handling of hard cases affects the performance of our approach on the common cases of spelling error. Our method performs well on the common cases as well, as Figure 2 shows. The performance comparison for the ”common” dataset shows comparable performance for all considered methods.

Noisy channel and iterative methods’ frontiers are considerably inferior to the proposed method on ”hard” dataset, which means that our method works better. The results on ”common” dataset show, that the proposed method doesn’t work worse than baseline.

Next, we optimized parameters for each method and each dataset separately to achieve the highest $F_{1}$ measure. Results are shown in Tables 4 and 5. We can see, that, given the proper tuning, our method can work better on any dataset (but it cannot achieve the best performance on both datasets at once). See Tables 4 and 5 for details.

Method $\lambda$ $\alpha$ $\gamma$ $E$ $F_{1}$
Noisy channel 0.6 0.1 - - 55.8
Iterative 0.6 0.1 - 2 55.9
Stochastic iterative 0.9 0.2 0.35 3 62.5
Table 4: Best parameters and $F_{1}$ on ”hard” dataset
Method $\lambda$ $\alpha$ $\gamma$ $E$ $F_{1}$
Noisy channel 0.75 0.225 - - 62.06
Iterative 0.8 0.275 - 2 63.15
Stochastic iterative 1.2 0.4 0.35 3 63.9
Table 5: Best parameters and $F_{1}$ on ”common” dataset

Next, each parameter was separately iterated (by a coarser grid); initial parameters for each method were taken from Table 4. Such iteration serves two purposes: to show the influence of parameters on algorithm performance, and to show differences between datasets: in such setup parameters are virtually tuned using ”hard” dataset and evaluated using ”common” dataset. Results are shown in Table 6.

 $\lambda$ = 0.5 0.6 0.7 0.8 $F_{1}$, common $F_{1}$, hard N.ch. It. Our N.ch. It. Our 45.3 45.9 37.5 54.9 54.8 50.0 49.9 50.5 41.5 55.8 55.9 56.6 50.4 50.4 44.1 54.5 55.1 59.6 52.7 52.7 46.0 52.6 53.0 61.5 53.5 53.5 49.3 50.3 50.6 62.5 55.4 55.0 50.9 47.0 47.3 61.8 53.7 53.4 52.7 44.3 44.6 60.8 52.5 52.5 53.7 41.9 42.3 58.8 52.2 52.6 54.6 39.5 39.9 56.6 51.4 51.8 55.0 36.8 37.3 53.6 41.0 41.5 33.0 52.9 53.1 58.3 49.9 50.6 35.6 55.8 55.9 59.7 59.4 59.8 43.2 55.8 55.6 61.6 60.8 61.3 49.4 51.0 51.0 62.5 54.0 54.0 54.9 46.3 46.3 61.1 46.3 46.3 57.3 39.2 39.2 58.4 25.8 25.8 53.9 22.3 22.3 50.3 50.6 53.6 55.9 60.4 50.6 49.4 55.9 62.5 50.6 46.4 55.9 62.1 50.6 46.7 55.9 60.1 10.1 6.0 49.4 51.5 51.4 61.4 49.4 62.5 47.5 62.0 45.8 60.8 45.2 60.3
Table 6: Per-coordinate iteration of parameters from Table 4; per-method maximum is shown in italic, per-dataset in bold

The proposed method is able to successfully correct distant spelling errors with edit distance of 3 characters (see Table 7).

Query Noisy channel Proposed method
akwamarin akvamarin aquamarine
maccartni maccartni mccartney
ariflaim ariflaim oriflame
epika epica replica
grid grid creed
Table 7: Correction examples for the noisy channel and the proposed method.

However, if our method is applied to shorter and more frequent queries (as opposed to ”hard” dataset), it tends to suggest frequent words as false-positive corrections (for example, grid is corrected to creed – Assassin’s Creed is popular video game). As can be seen in Table 5, in order to fix that, algorithm parameters need to be tuned more towards precision.

## 5 Conclusion and future work

In this paper we introduced the stochastic iterative correction method for spell check corrections. Our experimental evaluation showed that the proposed method improved the performance of popular spelling correction approach – the noisy channel model – in the correction of difficult spelling errors. We showed how to eliminate the local minima issue of simulated annealing and proposed a technique to make our algorithm deterministic.

The experiments conducted on the specialized datasets have shown that our method significantly improves the performance of the correction of hard spelling errors (by 6.6% $F_{1}$) while maintaining good performance on common spelling errors.

In continuation of the work we are considering to expand the method to correct errors in multi-word queries, extend the method to work with discriminative models, and use a query performance prediction method, which tells for a query whether our algorithm needs to be applied.

## References

• [1] E. Brill and R. C. Moore(2000) An improved error model for noisy channel spelling correction. pp. 286–293. Cited by: 1, 2.2.
• [2] M. Choudhury, M. Thomas, A. Mukherjee, A. Basu and N. Ganguly(2007) How difficult is it to develop a perfect spell-checker? a cross-linguistic analysis through complex network approach. pp. 81–88. Cited by: 3.2.
• [3] S. Cucerzan and E. Brill(2004) Spelling correction as an iterative process that exploits the collective knowledge of web users.. Vol. 4, pp. 293–300. Cited by: 1, 3.1, 3, 4.2.
• [4] H. Duan and B. P. Hsu(2011) Online spelling correction for query completion. pp. 117–126. Cited by: 2.2.
• [5] J. Gao, X. Li, D. Micol, C. Quirk and X. Sun(2010) A large scale ranker-based system for search query spelling correction. pp. 358–366. Cited by: 1, 3.2.
• [6] M. D. Kernighan, K. W. Church and W. A. Gale(1990) A spelling correction program based on a noisy channel model. pp. 205–210. Cited by: 1.
• [7] K. Kukich(1992) Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR) 24 (4), pp. 377–439. Cited by: 1.
• [8] E. Mays, F. J. Damerau and R. L. Mercer(1991) Context based spelling correction. Information Processing & Management 27 (5), pp. 517–522. Cited by: 1.
• [9] C. Whitelaw, B. Hutchinson, G. Y. Chung and G. Ellis(2009) Using the web for language independent spellchecking and autocorrection. pp. 890–899. Cited by: 1, 3.3.