# Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition

###### Abstract

There are two dominant approaches to Chinese word segmentation: word-based and character-based models, each with respective strengths. Prior work has shown that gains in segmentation performance can be achieved from combining these two types of models; however, past efforts have not provided a practical technique to allow mainstream adoption. We propose a method that effectively combines the strength of both segmentation schemes using an efficient dual-decomposition algorithm for joint inference. Our method is simple and easy to implement. Experiments on SIGHAN 2003 and 2005 evaluation datasets show that our method achieves the best reported results to date on 6 out of 7 datasets.

{CJK}

UTF8gbsn

## 1 Introduction

Chinese text is written without delimiters between words; as a result, Chinese word segmentation (CWS) is an essential foundational step for many tasks in Chinese natural language processing. As demonstrated by [15, 2, 3, 10], the quality and consistency of segmentation has important downstream impacts on system performance in machine translation, POS tagging and parsing.

State-of-the-art performance in CWS is high, with F-scores in the upper 90s. Still, challenges remain. Unknown words, also known as out-of-vocabulary (oov) words, lead to difficulties for word- or dictionary-based approaches. Ambiguity can cause errors when the appropriate segmentation is determined contextually, such as æè½ (“talent”) and æ / è½ (“just able”) [8].

There are two primary classes of models: character-based, where the foundational units for processing are individual Chinese characters [23, 19, 24, 20], and word-based, where the units are full words based on some dictionary or training lexicon [1, 25]. \newciteSun:2010:COLING details their respective theoretical strengths: character-based approaches better model the internal compositional structure of words and are therefore more effective at inducing new oov words; word-based approaches are better at reproducing the words of the training lexicon and can capture information from significantly larger contextual spans. Prior work has shown performance gains from combining these two types of models to exploit their respective strengths, but such approaches are often complex to implement and computationally expensive.

In this work, we propose a simple and principled joint decoding method for combining character-based and word-based segmenters based on dual decomposition. This method has strong optimality guarantees and works very well empirically. It is easy to implement and does not require retraining of existing character- and word-based segmenters. Perhaps most importantly, this work presents a much more practical and usable form of classifier combination in the CWS context than existing methods offer.

Experimental results on standard SIGHAN 2003 and 2005 bake-off evaluations show that our model outperforms the character and word baselines by a significant margin. In particular, out approach improves oov recall rates and segmentation consistency, and gives the best reported results to date on 6 out of 7 datasets.

## 2 Models for CWS

In this section, we describe the character-based and word-based models we use as baselines, review existing approaches to combination, and describe our algorithm for joint decoding with dual decomposition.

### 2.1 Character-based Models

In the most commonly used contemporary approach to character-based segmentation, first proposed by [23], CWS is seen as a character sequence tagging task, where each character is tagged on whether it is at the beginning, middle, or end of a word. Conditional random fields (CRF) [11] have been widely adopted for this task, and give state-of-the-art results [19]. In a first-order linear-chain CRF model, the conditional probability of a label sequence $\mathbf{y}$ given a word sequence $\mathbf{x}$ is defined as:

 $\displaystyle\vspace{-10em}P(\mathbf{y}|\mathbf{x})=\frac{1}{Z}{\sum\limits_{t% =1}^{|\mathbf{y}|}{\exp\left(\theta\cdot{f}(x,y_{t},y_{t+1})\right)}}$

$f(x,y_{t},y_{t-1})$ are feature functions that typically include surrounding character n-gram and morphological suffix/prefix features. These types of features capture the compositional properties of characters and are likely to generalize well to unknown words. However, the Markov assumption in CRF limits the context of such features; it is difficult to capture long-range word features in this model.

### 2.2 Word-based Models

Word-based models search through lists of word candidates using scoring functions that directly assign scores to each. Early word-based segmentation work employed simple heuristics like dictionary-lookup maximum matching [4]. More recently, \newciteZhang:2007:ACL reported success using a linear model trained with the average perceptron algorithm [5]. Formally, given input $\mathbf{x}$, their model seeks a segmentation $F(\mathbf{x})$ such that:

 $\displaystyle F(\mathbf{y}|\mathbf{x})=\underset{\mathbf{y}\in\mathrm{GEN}(% \mathbf{x})}{\operatorname{argmax}}{\left(\alpha\cdot\phi(\mathbf{\mathbf{y}})% \right)}$

Searching through the entire $\mathrm{GEN}(\mathbf{x})$ space is intractable even with a local model, so a beam-search algorithm is used. The search algorithm consumes one character input token at a time, and iterates through the existing beams to score two new alternative hypotheses by either appending the new character to the last word in the beam, or starting a new word at the current position.

{algorithm}

[!ht] Dual decomposition inference algorithm, and modified Viterbi and beam-search algorithms. {algorithmic} \STATE$\forall i\in\{1\;\mathrm{to}\;|\mathbf{x}|\}\colon\;\forall k\in\{0,1\}\colon u% _{i}(k)=0$ \FOR$t\leftarrow 1$ to $T$ \STATE$\mathbf{y^{c*}}=\underset{\mathbf{y}}{\operatorname{argmax}}\;P(\mathbf{y^{% \textit{c}}}|\mathbf{x})+\sum\limits_{i\in|\mathbf{x}|}{u_{i}(y_{i}^{\textit{c% }})}$ \STATE$\mathbf{y^{w*}}=\underset{\mathbf{y}\in\mathrm{GEN}(\mathbf{x})}{\operatorname% {argmax}}\;F(\mathbf{y^{\textit{w}}}|\mathbf{x})-\sum\limits_{j\in|\mathbf{x}|% }{u_{j}(y_{j}^{\textit{w}})}$ \IF $\mathbf{y^{c*}}=\mathbf{y^{w*}}$ \RETURN$\left(\mathbf{y^{\textit{c}*}},\mathbf{y^{\textit{w}*}}\right)$ \ENDIF\FORALL$i\in\{1\;\mathrm{to}\;|\mathbf{x}|\}$ \STATE$\forall k\in\{0,1\}:u_{i}(k)=u_{i}(k)+\alpha_{t}(2k-1)(y_{i}^{w*}-y_{i}^{c*})$ \ENDFOR\ENDFOR\RETURN$\left(\mathbf{y^{c*}},\mathbf{y^{w*}}\right)$
\STATEViterbi: \STATE$V_{1}(1)=1,V_{1}(0)=0$ \FOR$i=2\;\mathrm{to}\;|\mathbf{x}|$ \STATE$\forall k\in\{0,1\}:V_{i}(k)=\underset{\mathbf{k^{\prime}}}{\operatorname{% argmax}}P_{i}(k|k^{\prime})V_{i-1}{k^{\prime}}+u_{i}(k)$ \ENDFOR
\STATEBeam-Search: \FOR$i=1\;\mathrm{to}\;|\mathbf{x}|$ \FOR item $v=\{w_{0},\cdots,w_{j}\}$ in $\mathrm{beam}(i)$ \STATEappend $x_{i}$ to $w_{j}$, $\mathrm{score}(v)\stackrel{+}{=}u_{i}(0)$ \STATE$v=\{w_{0},\cdots,w_{j},x_{i}\}$, $\mathrm{score}(v)\stackrel{+}{=}u_{i}(1)$ \ENDFOR\ENDFOR

 Char-based CRF Word-based Perceptron Dual-decomp Char-based CRF Word-based Perceptron Academia Sinica Peking Univ. R P F${}_{1}$ R${}_{\mathrm{oov}}$ C R P F${}_{1}$ R${}_{\mathrm{oov}}$ C 95.2 93.6 94.4 58.9 0.064 94.6 95.3 94.9 77.8 0.089 95.8 95.0 95.4 69.5 0.060 94.1 95.5 94.8 76.7 0.099 95.9 94.9 95.4 67.7 0.055 94.8 95.7 95.3 78.7 0.086 City Univ. of Hong Kong Microsoft Research R P F${}_{1}$ R${}_{\mathrm{oov}}$ C R P F${}_{1}$ R${}_{\mathrm{oov}}$ C 94.7 94.0 94.3 76.1 0.065 96.4 96.6 96.5 71.3 0.074 94.3 94.0 94.2 71.7 0.073 97.0 97.2 97.1 74.6 0.063 95.0 94.4 94.7 75.3 0.062 97.3 97.4 97.4 76.0 0.055
Table 1: Results on SIGHAN 2005 datasets. R${}_{\mathrm{oov}}$ denotes oov recall, and C denotes segmentation consistency. Best number in each column is highlighted in bold.

### 2.3 Combining Models with Dual Decomposition

Various mixing approaches have been proposed to combine the above two approaches [22, 12, 18, 17, 20]. These mixing models perform well on standard datasets, but are not in wide use because of their high computational costs and difficulty of implementation.

Dual decomposition (DD) [14] offers an attractive framework for combining these two types of models without incurring high costs in model complexity (in contrast to [18]) or decoding efficiency (in contrast to bagging in [22, 17]). DD has been successfully applied to similar situations for combining local with global models; for example, in dependency parsing [9], bilingual sequence tagging [21] and word alignment [6].

The idea is that jointly modelling both character-sequence and word information can be computationally challenging, so instead we can try to find outputs that the two models are most likely to agree on. Formally, the objective of DD is:

 $\displaystyle\max_{\mathbf{y^{\textit{c}}},\mathbf{y^{\textit{w}}}}\;P(\mathbf% {y^{\textit{c}}}|\mathbf{x})+F(\mathbf{y^{\textit{w}}}|\mathbf{x})\;s.t.\;% \mathbf{y^{\textit{c}}}=\mathbf{y^{\textit{w}}}$ (1)

where $\mathbf{y^{\textit{c}}}$ is the output of character-based CRF, $\mathbf{y^{\textit{w}}}$ is the output of word-based perceptron, and the agreements are expressed as constraints. s.t. is a shorthand for “such that”.

Solving this constrained optimization problem directly is difficult. Instead, we take the Lagrangian relaxation of this term as:

 $\displaystyle L\left(\mathbf{y^{\textit{c}}},\mathbf{y^{\textit{w}}},\mathbf{U% }\right)=$ (2) $\displaystyle P(\mathbf{y^{\textit{c}}}|\mathbf{x})+F(\mathbf{y^{\textit{w}}}|% \mathbf{x})+\sum\limits_{i\in|\mathbf{x}|}{u_{i}(y_{i}^{\textit{c}}-y_{i}^{% \textit{w}})}$

where $\mathbf{U}$ is the set of Lagrangian multipliers that consists of a multiplier $u_{i}$ at each word position $i$.

We can rewrite the original objective with the Lagrangian relaxation as:

 $\displaystyle\max_{\mathbf{y^{\textit{c}}},\mathbf{y^{\textit{w}}}}\min_{% \mathbf{U}}\;L\left(\mathbf{y^{\textit{c}}},\mathbf{y^{\textit{w}}},\mathbf{U}\right)$ (3)

We can then form the dual of this problem by taking the $\min$ outside of the $\max$, which is an upper bound on the original problem. The dual form can then be decomposed into two sub-components (the two $\max$ problems in Eq. 4), each of which is local with respect to the set of Lagrangian multipliers:

 $\displaystyle\min_{\mathbf{U}}\Biggl($ $\displaystyle\max_{y^{\textit{c}}}\left[P(\mathbf{y^{\textit{c}}}|\mathbf{x})+% \sum\limits_{i\in|\mathbf{x}|}{u_{i}(y_{i}^{\textit{c}}})\right]$ (4) $\displaystyle+$ $\displaystyle\max_{y^{\textit{w}}}\left[F(\mathbf{y^{\textit{w}}}|\mathbf{x})-% \sum\limits_{j\in|\mathbf{x}|}{u_{j}(y_{j}^{\textit{w}})}\right]\Biggr)$

This method is called dual decomposition (DD) [14]. Similar to previous work [13], we solve this DD problem by iteratively updating the sub-gradient as depicted in Algorithm 2.2.11See \newciteRush:2012:JAIR for a full introduction to DD. In each iteration, if the best segmentations provided by the two models do not agree, then the two models will receive penalties for the decisions they made that differ from the other. This penalty exchange is similar to message passing, and as the penalty accumulates over iterations, the two models are pushed towards agreeing with each other. We also give an updated Viterbi decoding algorithm for CRF and a modified beam-search algorithm for perceptron in Algorithm 2.2. $T$ is the maximum number of iterations before early stopping, and $\alpha_{t}$ is the learning rate at time $t$. We adopt a learning rate update rule from \newciteKoo:2010:EMNLP where $\alpha_{t}$ is defined as ${\frac{1}{N}}$, where $N$ is the number of times we observed a consecutive dual value increase from iteration $1$ to $t$.

## 3 Experiments

We conduct experiments on the SIGHAN 2003 [16] and 2005 [7] bake-off datasets to evaluate the effectiveness of the proposed dual decomposition algorithm. We use the publicly available Stanford CRF segmenter [19]22http://nlp.stanford.edu/software/segmenter.shtml as our character-based baseline model, and reproduce the perceptron-based segmenter from \newciteZhang:2007:ACL as our word-based baseline model.

We adopted the development setting from [25], and used CTB sections 1-270 for training and sections 400-931 for development in hyper-parameter setting; for all results given in tables, the models are trained and evaluated on the standard train/test split for the given dataset. The optimized hyper-parameters used are: $\ell_{2}$ regularization parameter $\lambda$ in CRF is set to $3$; the perceptron is trained for 10 iterations with beam size 200; dual decomposition is run to max iteration of 100 ($T$ in Algo. 2.2) with step size 0.1 ($\alpha_{t}$ in Algo. 2.2).

Beyond standard precision (P), recall (R) and F${}_{1}$ scores, we also evaluate segmentation consistency as proposed by [3], who have shown that increased segmentation consistency is correlated with better machine translation performance. The consistency measure calculates the entropy of segmentation variations — the lower the score the better. We also report out-of-vocabulary recall (R${}_{\mathrm{oov}}$) as an estimation of the model’s generalizability to previously unseen words.

## 4 Results

Table 1 shows our empirical results on SIGHAN 2005 dataset. Our dual decomposition method outperforms both the word-based and character-based baselines consistently across all four subsets in both F${}_{1}$ and oov recall (R${}_{\mathrm{oov}}$). Our method demonstrates a robustness across domains and segmentation standards regardless of which baseline model was stronger. Of particular note is DD’s is much more robust in R${}_{\mathrm{oov}}$, where the two baselines swing a lot. This is an important property for downstream applications such as entity recognition. The DD algorithm is also more consistent, which would likely lead to improvements in applications such as machine translation [3].

The improvement over our word- and character-based baselines is also seen in our results on the earlier SIGHAN 2003 dataset. Table 2 puts our method in the context of earlier systems for CWS. Our method achieves the best reported score on 6 out of 7 datasets.

SIGHAN 2005
AS PU CU MSR
Best 05 95.2 95.0 94.3 96.4
Zhang et al. 06 94.7 94.5 94.6 96.4
Z&C 07 94.6 94.5 95.1 97.2
Sun et al. 09 - 95.2 94.6 97.3
Sun 10 95.2 95.2 95.6 96.9
Dual-decomp 95.4 95.3 94.7 97.4
SIGHAN 2003
Best 03 96.1 95.1 94.0
Peng et al. 04 95.6 94.1 92.8
Z&C 07 96.5 94.0 94.6
Dual-decomp 97.1 95.4 94.9
Table 2: Performance of dual decomposition in comparison to past published results on SIGHAN 2003 and 2005 datasets. Best reported F${}_{1}$ score for each dataset is highlighted in bold. Z&C 07 refers to \newciteZhang:2007:ACL. Best 03, 05 are results of the winning systems for each dataset in the respective shared tasks.

## 5 Discussion and Error Analysis

On the whole, dual decomposition produces state-of-the-art segmentations that are more accurate, more consistent, and more successful at inducing oov words than the baseline systems that it combines. On the SIGHAN 2005 test set, in over 99.1% of cases the DD algorithm converged within 100 iterations, which gives an optimality guarantee. In 77.4% of the cases, DD converged in the first iteration. The number of iterations to convergence histogram is plotted in Figure 1.

#### Error analysis

In many cases the relative confidence of each model means that dual decomposition is capable of using information from both sources to generate a series of correct segmentations better than either baseline model alone. The example below shows a difficult-to-segment proper name comprised of common characters, which results in undersegmentation by the character-based CRF and oversegmentation by the word-based perceptron, but our method achieves the correct middle ground.

 Gloss Tian Yage / ’s / creations . Gold . ç°éå / ç / åä½ . CRF . ç°éåç / åä½ . PCPT . ç°é / å / ç / åä½ . DD . ç°éå / ç / åä½

A powerful feature of the dual decomposition approach is that it can generate correct segmentation decisions in cases where a voting or product-of-experts model could not, since joint decoding allows the sharing of information at decoding time. In the following example, both baseline models miss the contextually clear use of the word ç¹å¿ (“sweets / snack food”) and instead attach ç¹ to the prior word to produce the otherwise common compound ä¸ç¹ç¹ (“a little bit”); dual decomposition allows the model to generate the correct segmentation.

 Gloss Enjoy / a bit of / snack food / , … . Gold . äº«å / ä¸ç¹ / ç¹å¿ / ï¼ . CRF . äº«å / ä¸ç¹ç¹ / å¿ / ï¼ . PCPT . äº«å / ä¸ç¹ç¹ / å¿ / ï¼ . DD . äº«å / ä¸ç¹ / ç¹å¿ / ï¼
We found more than 400 such surprisingly accurate instances in our dual decomposition output.

Finally, since dual decomposition is a method of joint decoding, it is still liable to reproduce errors made by the constituent systems.

Figure 1: No. of iterations till DD convergence.

## 6 Conclusion

In this paper we presented an approach to Chinese word segmentation using dual decomposition for system combination. We demonstrated that this method allows for joint decoding of existing CWS systems that is more accurate and consistent than either system alone, and further achieves the best performance reported to date on standard datasets for the task. Perhaps most importantly, our approach is straightforward to implement and does not require retraining of the underlying segmentation models used. This suggests its potential for broader applicability in real-world settings than existing approaches to combining character-based and word-based models for Chinese word segmentation.

## References

• [1] G. Andrew(2006) A hybrid Markov/semi-Markov conditional random field for sequence segmentation. Cited by: 1.
• [2] M. Bai, K. Chen and J. S. Chang(2008) Improving word alignment by adjusting chinese word segmentation. Cited by: 1.
• [3] P. Chang, M. Galley and C. Manning(2008) Optimizing chinese word segmentation for machine translation performance. Cited by: 1, 3, 4.
• [4] K. Chen and S. Liu(1992) Word identification for mandarin chinese sentences. Cited by: 2.2.
• [5] M. Collins(2002) Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. Cited by: 2.2.
• [6] J. DeNero and K. Macherey(2011) Model-based aligner combination using dual decomposition. Cited by: 2.3.
• [7] T. Emerson(2005) The second international Chinese word segmentation bakeoff. Cited by: 3.
• [8] J. Gao, M. Li and C. Huang(2003) Improved source-channel models for Chinese word segmentation. Cited by: 1.
• [9] T. Koo, A. M. Rush, M. Collins, T. Jaakkola and D. Sontag(2010) Dual decomposition for parsing with non-projective head automata. Cited by: 2.3.
• [10] J. K. Kummerfeld, D. Tse, J. R. Curran and D. Klein(2013) An empirical examination of challenges in chinese parsing. Cited by: 1.
• [11] J. Lafferty, A. McCallum and F. Pereira(2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: 2.1.
• [12] D. Lin(2009) Combining language modeling and discriminative classification for word segmentation. Cited by: 2.3.
• [13] A. M. Rush and M. Collins(2012) A tutorial on dual decomposition and Lagrangian relaxation for inference in natural language processing. JAIR 45, pp. 305–362. Cited by: 2.3.
• [14] A. M. Rush, D. Sontag, M. Collins and T. Jaakkola(2010) On dual decomposition and linear programming relaxations for natural language processing. Cited by: 2.3, 2.3.
• [15] Y. Shi and M. Wang(2007) A dual-layer crfs based joint decoding method for cascaded segmentation and labeling tasks. Cited by: 1.
• [16] R. Sproat and T. Emerson(2003) The first international Chinese word segmentation bakeoff. Cited by: 3.
• [17] W. Sun(2010) Word-based and character-basedword segmentation models: comparison and combination. Cited by: 2.3, 2.3.
• [18] X. Sun, Y. Zhang, T. Matsuzaki, Y. Tsuruoka and J. Tsujii(2009) A discriminative latent variable chinese segmenter with hybrid word/character information. Cited by: 2.3, 2.3.
• [19] H. Tseng, P. Chang, G. Andrew, D. Jurasfky and C. Manning(2005) A conditional random field word segmenter for sighan bakeoff 2005. Cited by: 1, 2.1, 3.
• [20] K. Wang, C. Zong and K. Su(2010) A character-based joint model for chinese word segmentation. Cited by: 1, 2.3.
• [21] M. Wang, W. Che and C. D. Manning(2013) Joint word alignment and bilingual named entity recognition using dual decomposition. Cited by: 2.3.
• [22] X. Wang, X. Lin, D. Yu, H. Tian and X. Wu(2006) Chinese word segmentation with maximum entropy and n-gram language model. Cited by: 2.3, 2.3.
• [23] N. Xue(2003) Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, pp. 29–48. Cited by: 1, 2.1.
• [24] R. Zhang, G. Kikui and E. Sumita(2006) Subword-based tagging by conditional random fields for Chinese word segmentation. Cited by: 1.
• [25] Y. Zhang and S. Clark(2007) Chinese segmentation with a word-based perceptron algorithm. Cited by: 1, 3.