# Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

Xiaolin Wang    Masao Utiyama    Andrew Finch    Eiichiro Sumita
National Institute of Information and Communications Technology
{xiaolin.wang,mutiyama,andrew.finch,eiichiro.sumita}@nict.go.jp
###### Abstract

Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.

## 1 Introduction

Many languages, especially Asian languages such as Chinese, Japanese and Myanmar, have no explicit word boundaries, thus word segmentation (WS), that is, segmenting the continuous texts of these languages into isolated words, is a prerequisite for many natural language processing applications including SMT.

Though supervised-learning approaches which involve training segmenters on manually segmented corpora are widely used [], yet the criteria for manually annotating words are arbitrary, and the available annotated corpora are limited in both quantity and genre variety. For example, in machine translation, there are various parallel corpora such as BTEC for tourism-related dialogues [] and PatentMT in the patent domain [], but researchers working on Chinese-related tasks often use the Stanford Chinese segmenter [] which is trained on a small amount of annotated news text.

In contrast, UWS, spurred by the findings that infants are able to use statistical cues to determine word boundaries [], relies on statistical criteria instead of manually crafted standards. UWS learns from unsegmented raw text, which are available in large quantities, and thus it has the potential to provide more accurate and adaptive segmentation than supervised approaches with less development effort being required.

The approaches of explicitly modeling the probability of words[] significantly outperformed a heuristic approach [] on the monolingual Chinese SIGHAN-MSR corpus [], which inspired the work of this paper.

However, bilingual approaches that model word probabilities suffer from computational complexity. Xu et al. [] proposed a bilingual method by adding alignment into the generative model, but was only able to test it on small-scale BTEC data. Nguyen et al. [] used the local best alignment to increase the speed of the Gibbs sampling in training but the impact on accuracy was not explored.

This paper is dedicated to bilingual UWS on large-scale corpora to support SMT. To this end, we model bilingual UWS under a similar framework with monolingual UWS in order to improve efficiency, and replace Gibbs sampling with expectation maximization (EM) in training.

We aware that variational bayes (VB) may be used for speeding up the training of DP-based or PYP-based bilingual UWS. However, VB requires formulating the $m$ expectations of $(m-1)$-dimensional marginal distributions, where $m$ is the number of hidden variables. For UWS, the hidden variables are indicators that identify substrings of sentences in the corpus as words. These variables are large in number and it is not clear how to apply VB to UWS, and as far the authors aware there is no previous work related to the application of VB to monolingual UWS. Therefore, we have not explored VB methods in this paper, but we do show that our method is superior to the existing methods.

The contributions of this paper include,

• state-of-the-art accuracy in monolingual UWS;

• the first bilingual UWS method practical for large corpora;

• improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.

## 2 Methods

This section describes our unified monolingual and bilingual UWS scheme. Table 1 lists the main notation. The set $\mathcal{F}$ is chosen to represent an unsegmented foreign language sentence (a sequence of characters), because an unsegmented sentence can be seen as the set of all possible segmentations of the sentence denoted $F$, i.e. $F\in\mathcal{F}$.

Notation Meaning
$\mathcal{F}$ an unsegmented foreign sentence
$\mathcal{F}_{k}^{k^{\prime}}$ unsegmented substring of the un-
derlying string of $\mathcal{F}$ from $k$ to $k^{\prime}$
$F$ a segmented foreign sentence
$f_{j}$ the $j$-th foreign word
$\mathcal{M}$ monolingual segmentation model
$P_{\small\mathcal{M}}(x)$ probability of $x$ being a word ac-
cording to $M$
$E$ a tokenized English sentence
$e_{i}$ the $i$-th English word
($\mathcal{F}$,$E$) a bilingual sentence pair
$\mathcal{B}$ bilingual segmentation model
$P_{\small\mathcal{B}}(x|e_{i})$ probability of $x$ being a word ac-
cording to $B$ given $e_{i}$
Table 1: Main Notation.

Monolingual and bilingual WS can be formulated as follows, respectively,

 $\displaystyle\hat{F}(\mathcal{F})=\mathop{argmax}_{F\in\mathcal{F}}P(F|% \mathcal{F},\mathcal{M}),$ (1) $\displaystyle\hat{F}(\mathcal{F},E)=\mathop{argmax}_{F\in\mathcal{F}}\sum_{a}P% (F,a|\mathcal{F},E,\mathcal{B}),$ (2)

where $a$ is an alignment between $F$ and $E$. The English sentence $E$ is used in the generation of a segmented sentence $F$.

UWS learns models by maximizing the likelihood of the unsegmented corpus, formulated as,

 $\displaystyle\hat{\mathcal{M}}=\mathop{\mathrm{argmax}}_{\mathcal{M}}\prod_{% \mathcal{F}\in\mathbb{F}}\Big(\sum_{F\in\mathcal{F}}P(F|\mathcal{M})\Big),$ (3) $\displaystyle\hat{\mathcal{B}}=\mathop{\mathrm{argmax}}_{\mathcal{B}}\prod_{(% \mathcal{F},E)\in\mathbb{B}}\Big(\sum_{F\in\mathcal{F}}\sum_{a}P(F,a|\mathcal{% F},E,\mathcal{B})\Big).$ (4)

Our method of learning $\mathcal{M}$ and $\mathcal{B}$ proceeds in a similar manner to the EM algorithm. The following two operations are performed iteratively for each sentence (pair).

• Exclude the previous expected counts of the current sentence (pair) from the model, and then derive the current sentence in all possible ways, calculating the new expected counts for the words (see Section 2.1), that is, we calculate the expected probabilities of the $\mathcal{F}_{k}^{k^{\prime}}$ being words given the data excluding $\mathcal{F}$, i.e. $\mathbf{E}_{\mathbb{F}/\{\mathcal{F}\}}(P(\mathcal{F}_{k}^{k^{\prime}}|% \mathcal{F}))=P(\mathcal{F}_{k}^{k^{\prime}}|\mathcal{F},\mathcal{M})$ in a similar manner to the marginalization in the Gibbs sampling process which we are replacing;

• Update the respective model $\mathcal{M}$ or $\mathcal{B}$ according to these expectations (see Section2.2).

### 2.1 Expectation

#### 2.1.1 Monolingual Expectation

$P(\mathcal{F}_{k}^{k^{\prime}}|\mathcal{F},\mathcal{M})$ is the marginal probability of all the possible $F\in\mathcal{F}$ that contain $\mathcal{F}_{k}^{k^{\prime}}$ as a word, which can be calculated efficiently through dynamic programming (the process is similar to the foreward-backward algorithm in training a hidden Markov model (HMM) []):

 $\displaystyle\quad P_{a}(k)=\sum_{u=1}^{U}P_{a}(k-u)P_{\mathcal{M}}(\mathcal{F% }_{k-u}^{k})$ $\displaystyle\quad P_{b}(k^{\prime})=\sum_{u=1}^{U}P_{b}(k^{\prime}+u)P_{% \mathcal{M}}(\mathcal{F}_{k^{\prime}}^{k^{\prime}+u})$ $\displaystyle P(\mathcal{F}_{k}^{k^{\prime}}|\mathcal{F},\mathcal{M})=P_{a}(k)% P_{\mathcal{M}}(\mathcal{F}_{k}^{k^{\prime}})P_{b}(k^{\prime}),$ (5)

where $U$ is the predefined maximum length of foreign language words, $P_{a}(k)$ and $P_{b}(k^{\prime})$ are the forward and backward probabilities, respectively. This section uses a unigram model for description convenience, but the method can be extended to $n$-gram models.

#### 2.1.2 Bilingual Expectation

$P(\mathcal{F}_{k}^{k^{\prime}}|\mathcal{F},E,\mathcal{B})$ is the marginal probability of all the possible $F\in\mathcal{F}$ that contain $\mathcal{F}_{k}^{k^{\prime}}$ as a word and are aligned with $E$, formulated as:

 $\displaystyle P(\mathcal{F}_{k}^{k^{\prime}}|\mathcal{F},E,\mathcal{B})=\sum_{% \substack{F\in\mathcal{F}\\ \mathcal{F}_{k}^{k^{\prime}}\in F}}\sum_{a}P(F,a|E,\mathcal{B})$ $\displaystyle\quad\approx\sum_{\substack{F\in\mathcal{F}\\ F_{j_{k}}=\mathcal{F}_{k}^{k^{\prime}}}}\sum_{a}\prod_{j=1}^{J}P(a_{j}|j,I,J)P% _{\mathcal{B}}(f_{j}|e_{a_{j}})$ $\displaystyle\quad=\sum_{\substack{F\in\mathcal{F}\\ f_{j_{k}}=\mathcal{F}_{k}^{k^{\prime}}}}\prod_{j=1}^{J}\sum_{a}P(a_{j}|j,I,J)P% _{\mathcal{B}}(f_{j}|e_{a_{j}}),$ (6)

where $J$ and $I$ are the number of foreign and English words, respectively, and $a_{j}$ is the position of the English word that is aligned to $f_{j}$ in the alignment $a$. For the alignment we employ an approximation to IBM model 2 [] described below.

We define the conditional probability of $f_{j}$ given the corresponding English sentence $E$ and the model $\mathcal{B}$ as:

 $\displaystyle P_{\mathcal{B}}(f_{j}|E)$ $\displaystyle=\sum_{a}P(a_{j}|j,I,J)P_{\mathcal{B}}(f_{j}|e_{a_{j}})$ (7)

Then, the previous dynamic programming method can be extended to the bilingual expectation

 $\displaystyle P_{a}(k|E)=\sum_{u=1}^{U}P_{a}(k-u|E)P_{\mathcal{B}}(\mathcal{F}% _{k-u}^{k}|E)$ $\displaystyle P_{b}(k^{\prime}|E)=\sum_{u=1}^{U}P_{b}(k^{\prime}+u|E)P_{% \mathcal{B}}(\mathcal{F}_{k^{\prime}}^{k^{\prime}+u}|E)$ $\displaystyle P(F_{k}^{k^{\prime}}|\mathcal{F},E,\mathcal{B})=P_{a}(k|E)P_{% \mathcal{B}}(\mathcal{F}_{k}^{k^{\prime}}|E)P_{b}(k^{\prime}|E).$ (8)

Eq. 7 can be rewritten (as in IBM model 2):

 $\displaystyle P_{\mathcal{B}}(f_{j}|E)$ $\displaystyle=\sum_{i=1}^{I}P^{*}(i|j,I,J)P_{\mathcal{B}}(f_{j}|e_{i})$ (9) $\displaystyle P^{*}(i|j,I,J)$ $\displaystyle=\sum_{a:a_{j}=i}P(a_{j}|,j,I,J)$

In order to maintain both speed and accuracy, the following window function is adopted

 $\displaystyle P^{*}(i|j,I,J)\approx P^{*}(i|k,I,K)=$ $\displaystyle\,\left\{\begin{array}[]{ll}e^{-|i-kI/K|}/\sigma&|i-kI/K|% \leqslant\delta_{b}/2\\ \lambda_{\phi}&e_{i}\textrm{ is empty word}\\ 0&\textrm{otherwise}\\ \end{array}\right.$ (10)

where $K$ is the number of characters in $\mathcal{F}$, and the $k$-th character is the start of the word $f_{j}$, since $j$ and $J$ are unknown during the computation of dynamic programming. $\delta_{b}$ is the window size, $\lambda_{\phi}$ is the prior probability of an empty English word, and $\sigma$ ensures all the items sum to 1.

### 2.2 Maximization

Inspired by [], we employ a Pitman-Yor process model to build the segmentation model $\mathcal{M}$ or $\mathcal{B}$. The monolingual model $\mathcal{M}$ is

 $\displaystyle P_{\mathcal{M}}(f_{j})=$ $\displaystyle\quad\quad\frac{\mathrm{max}\big(n\small(f_{j}\small)-d,0\big)+(% \theta+d\cdot n_{\mathcal{M}})G_{0}\small(f_{j}\small)}{\sum_{f_{j}^{\prime}}n% \small(f_{j}^{\prime}\small)+\theta}$ $\displaystyle\quad n_{\mathcal{M}}=\big|\{f_{j}\,|\,n(f_{j})\geqslant d\}\big|,$ (11)

where $f_{j}$ is a foreign language word, and $n(f_{j})$ is the observed counts of $f_{j}$, $\theta$ is named the strength parameter, $G_{0}(f_{j})$ is named the base distribution of $f_{j}$, and $d$ is the discount.

The bilingual model is

 $\displaystyle P_{\mathcal{B}}(f_{j}|e_{i})=$ $\displaystyle\quad\frac{\mathrm{max}\big(n\small(f_{j},e_{i}\small)-d,0\big)+(% \theta+d\cdot n_{e_{i}})G_{0}\small(f_{j}|e_{i}\small)}{\sum_{f_{j}^{\prime}}n% \small(f_{j}^{\prime},e_{i}\small)+\theta}$ $\displaystyle\qquad\quad n_{e_{i}}=\big|\{x\,|\,n(x,e_{i})\geqslant d\}\big|.$ (12)

In Eqs. 11 and 12,

 $\displaystyle\quad\,n(f_{j})=\sum_{\substack{\mathcal{F}\in\mathbb{F}}}P(f_{j}% |\mathcal{F},\mathcal{M})$ (13) $\displaystyle n(f_{j},e_{i})=$ $\displaystyle\sum_{\substack{(\mathcal{F},E)\in\mathbb{B}}}P(f_{j}|\mathcal{F}% ,E,\mathcal{B})\frac{P^{*}(i|j,I,J)P_{\mathcal{B}}(f_{j}|e_{i})}{\sum_{i^{% \prime}=1}^{I}P^{*}(i^{\prime}|j,I,J)P_{\mathcal{B}}(f_{j}|e_{i^{\prime}})}.$ (14)

## 3 Complexity Analysis

The computational complexity of our method is linear in the number of iterations, the size of the corpus, and the complexity of calculating the expectations on each sentence or sentence pair. In practical applications, the size of the corpus is fixed, and we found empirically that the number of iterations required by the proposed method for convergence is usually small (less than five iterations). We now look in more detail at the complexity of the expectation calculation in monolingual and bilingual models.

The monolingual expectation is calculated according to Eq. 5; the complexity is linear in the length of sentences and the square of the predefined maximum length of words. Thus its overall complexity is

 $O_{\textrm{monoling}}^{\textrm{unigram}}=O(N_{\textrm{i}}|\mathbb{F}|KU^{2}),$ (15)

where $N_{\textrm{i}}$ is the number of iterations, $K$ is the average number of characters per sentence, and $U$ is the predefined maximum length of words.

For the monolingual bigram model, the number of states in the HMM is $U$ times more than that of the monolingual unigram model, as the states at specific position of $F$ are not only related to the length of the current word, but also related to the length of the word before it. Thus its complexity is $U^{2}$ times the unigram model’s complexity:

 $O_{\textrm{monoling}}^{\textrm{bigram}}=O(N_{\textrm{i}}|\mathbb{F}|KU^{4}).$ (16)

The bilingual expectation is given by Eq. 8, whose complexity is the same as the monolingual case. However, the complexity of calculating the transition probability, in Eqs. 9 and 10, is $O(\delta_{b})$. Thus its overall complexity is:

 $O_{\textrm{biling}}^{\textrm{unigram}}=O(N_{\textrm{i}}|\mathbb{F}|KU^{2}% \delta_{b}).$ (17)

## 4 Experiments

In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU, can be improved.

### 4.1 Experimental Settings

#### 4.1.1 Experimental Corpora

Two monolingual corpora and two bilingual corpora are used (Table 2). CHILDES [] is the most common test corpus for UWS methods. The SIGHAN-MSR corpus [] consists of manually segmented simplified Chinese news text, released in the SIGHAN bakeoff 2005 shared tasks.

The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation . We removed the United Nations corpus and the traditional Chinese data sets from the constraint training resources. The data sets of NIST Eval 2002 to 2005 were used as the development for MERT tuning []. This data set mainly consists of news text 33It also contains a small number of web blogs. PatentMT9 is from the shared task of NTCIR-9 patent machine translation . The training set consists of 1 million parallel sentences extracted from patent documents, and the development set and test set both consist of 2000 sentences.

Corpus Type # Sentences # Characters
CHILDES Mono. 9,790 95,809
SIGHAN-MSR Mono. 90,903 4,234,824
OpenMT06 Biling. 437,004 19,692,605
PatentMT9 Biling. 1,004,000 63,130,757
Table 2: Experimental Corpora

#### 4.1.2 Performance Measurement and Baseline Methods

For the monolingual tasks, the F${}_{1}$ score against the gold annotation is adopted to measure the accuracy. The results reported in related papers are listed for comparison.

For the bilingual tasks, the publicly available system of Moses [] with default settings is employed to perform machine translation, and BLEU [] was used to evaluate the quality. Character-based segmentation, LDC segmenter and Stanford Chinese segmenters were used as the baseline methods.

#### 4.1.3 Parameter settings

The parameters are tuned on held-out data sets. The maximum length of foreign language words is set to 4. For the PYP model, the base distribution adopts the formula in [], and the strength parameter is set to $1.0$, and the discount is set to $1.0\times 10^{-6}$.

For bilingual segmentation,the size of the alignment window is set to $6$; the probability $\lambda_{\phi}$ of foreign language words being generated by an empty English word, was set to $0.3$.

The training was started from assuming that there was no previous segmentations on each sentence (pair), and the number of iterations was fixed. It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings. The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations.

### 4.2 Monolingual Segmentation Results

In monolingual segmentation, the proposed methods with both unigram and bigram models were tested. Experimental results show that they are competitive to state-of-the-art baselines in both accuracy and speed (Table LABEL:tab:mono:acc). Note that the comparison of speed is only for reference because the times are obtained from their respective papers.

{threeparttable}
 Method Accuracy Time CHILD. MSR CHILD. MSR NPY(bigram)\tnotea 0.750 0.802 17 m – NPY(trigram)\tnotea 0.757 0.807 – – HDP(bigram)\tnoteb 0.723 – 10 h – Fitness\tnotec – 0.667 – – Prop.(unigram) 0.729 0.804 3 s 50 s Prop.(bigram) 0.774 0.806 15 s 2530 s
{tablenotes}
• a

by (Mochihashi et al.,2009);

• b

by (Goldwater et al.,2009);

• c

by (Zhao and Kit, 2008).

• Table 3: Results on Monolingual Corpora.

### 4.3 Bilingual Segmentation Results

Table 4 presents the BLEU scores for Moses using different segmentation methods. Each experiment was performed three times. The proposed method with monolingual bigram model performed poorly on the Chinese monolingual segmentation task; thus, it was not tested. We intended to test [], but found it impracticable on large-scale corpora.

The experimental results show that the proposed UWS methods are comparable to the Stanford segmenters on the OpenMT06 corpus, while achieves a 0.96 BLEU increase on the PatentMT9 corpus. This is because this corpus is out-of-domain for the supervised segmenters. The CTB and PKU Stanford segmenter were both trained on annotated news text, which was the major domain of OpenMT06.

Method BLEU OpenMT06 PatentMT9 29.50 $\pm$ 0.03 28.36 $\pm$ 0.09 31.33 $\pm$ 0.10 30.22 $\pm$ 0.14 31.68 $\pm$ 0.25 30.77 $\pm$ 0.13 31.54 $\pm$ 0.13 30.86 $\pm$ 0.04 31.47 $\pm$ 0.18 31.62 $\pm$ 0.06 31.61 $\pm$ 0.14 31.73 $\pm$ 0.05
Table 4: Results on Bilingual Corpora.

Table 5 presents the run times of the proposed methods on the bilingual corpora. The program is single threaded and implemented in C++. The time cost of the bilingual models is about 5 times that of the monolingual model, which is consistent with the complexity analysis in Section 3.

Method Time
OpenMT06 PatentMT9
Prop.(mono.) 28 m 1 h 01 m
Prop.(biling.) 2 h 25 m 5 h 02 m
Table 5: Time Costs on Bilingual Corpora.

## 5 Conclusion

This paper is devoted to large-scale Chinese UWS for SMT. An efficient unified monolingual and bilingual UWS method is proposed and applied to large-scale bilingual corpora.

Complexity analysis shows that our method is capable of scaling to large-scale corpora. This was verified by experiments on a corpus of 1-million sentence pairs on which traditional MCMC approaches would struggle [].

The proposed method does not require any annotated data, but the SMT system with it can achieve comparable performance compared to state-of-the-art supervised word segmenters trained on precious annotated data. Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus. Thus, we believe that the proposed method would benefit SMT related to low-resource languages where annotated data are scare, and would also find application in domains that differ too greatly from the domains on which supervised word segmenters were trained.

In future research, we plan to improve the bilingual UWS through applying VB and integrating more accurate alignment models such as HMM models and IBM model 4.