Polynomial Time Joint Structural Inference for Sentence Compression

Xian Qian    Yang Liu
The University of Texas at Dallas
800 W. Campbell Rd., Richardson, TX, USA
{qx,yangl}@hlt.utdallas.edu
Abstract

We propose two polynomial time inference algorithms to compress sentences under bigram and dependency-factored objectives. The first algorithm is exact and requires $O(n^{6})$ running time. It extends Eisner’s cubic time parsing algorithm by using virtual dependency arcs to link deleted words. Two signatures are added to each span, indicating the number of deleted words and the rightmost kept word within the span. The second algorithm is a fast approximation of the first one. It relaxes the compression ratio constraint using Lagrangian relaxation, and thereby requires $O(n^{4})$ running time. Experimental results on the popular sentence compression corpus demonstrate the effectiveness and efficiency of our proposed approach.

1 Introduction

Sentence compression aims to shorten a sentence by removing uninformative words to reduce reading time. It has been widely used in compressive summarization [9, 8, 10, 3, 15]. To make the compressed sentence readable, some techniques consider the n-gram language models of the compressed sentence [4, 12]. Recent studies used a subtree deletion model for compression [1, 13, 15], which deletes a word only if its modifier in the parse tree is deleted. Despite its empirical success, such a model fails to generate compressions that are not subject to the subtree constraint (see Figure 1). In fact, we parsed the Edinburgh sentence compression corpus using the MSTparser11http://sourceforge.net/projects/mstparser/, and found that $2561$ of $5379$ sentences ($47.6\%$) do not satisfy the subtree deletion model.

Figure 1: The compressed sentence is not a subtree of the original sentence. Words in gray are removed.

Methods beyond the subtree model are also explored. Trevor et al. proposed synchronous tree substitution grammar [5], which allows local distortion of the tree topology and can thus naturally capture structural mismatches. [7, 17] proposed the joint compression model, which simultaneously considers the n-gram model and dependency parse tree of the compressed sentence. However, the time complexity greatly increases since the parse tree dynamically depends on the compression. They used Integer Linear Programming (ILP) for inference which requires exponential running time in the worst case.

In this paper, we propose a new exact decoding algorithm for the joint model using dynamic programming. Our method extends Eisner’s cubic time parsing algorithm by adding signatures to each span, which indicate the number of deleted words and the rightmost kept word within the span, resulting in $O(n^{6})$ time complexity and $O(n^{4})$ space complexity. We further propose a faster approximate algorithm based on Lagrangian relaxation, which has $TO(n^{4})$ running time and $O(n^{3})$ space complexity ($T$ is the iteration number in the subgradient decent algorithm). Experiments on the popular Edinburgh dataset show that the proposed approach is 10 times faster than a high-performance commercial ILP solver.

Figure 2: Graph illustration for the objective function. In this example, words $x_{2}$, $x_{i},x_{i+1}$, $x_{j}$ are kept, others are deleted. The value of the objective function is $w^{\text{tok}}_{2}+w^{\text{tok}}_{i}+w^{\text{tok}}_{i+1}+w^{\text{tok}}_{j}$ + $w^{\text{dep}}_{0i}+w^{\text{dep}}_{i2}+w^{\text{dep}}_{ii+1}+w^{\text{dep}}_{% ij}+w^{\text{bgr}}_{2i}+w^{\text{bgr}}_{ii+1}+w^{\text{bgr}}_{i+1j}$.

We define the sentence compression task as: given a sentence composed of $n$ words, $\mathbf{x}=x_{1},\dots,x_{n}$, and a length $L\leq n$, we need to remove $(n-L)$ words from $\mathbf{x}$, so that the sum of the weights of the dependency tree and word bigrams of the remaining part is maximized. Formally, we solve the following optimization problem:

 $\displaystyle\max_{\mathbf{z},\mathbf{y}}$ $\displaystyle\sum_{i}w^{\text{tok}}_{i}z_{i}+\sum_{i,j}w^{\text{dep}}_{ij}z_{i% }z_{j}y_{ij}$ (1) $\displaystyle+\sum_{i s.t. $\displaystyle\mathbf{z}\text{\ is\ binary\ },\sum_{i}z_{i}=L$ $\displaystyle\mathbf{y}\text{\ is\ a\ projective\ parse\ tree\ over\ the}$ $\displaystyle\quad\text{subgraph:\ }\{x_{i}|z_{i}=1\}$

where $\mathbf{z}$ is a binary vector, $z_{i}$ indicates $x_{i}$ is kept or not. $\mathbf{y}$ is a square matrix denoting the projective dependency parse tree over the remaining words, $y_{ij}$ indicates if $x_{i}$ is the head of $x_{j}$ (note that each word has exactly one head). $w^{\text{tok}}_{i}$ is the informativeness of $x_{i}$, $w^{\text{bgr}}_{ij}$ is the score of bigram $x_{i}x_{j}$ in an n-gram model, $w^{\text{dep}}$ is the score of dependency arc $x_{i}\rightarrow x_{j}$ in an arc-factored dependency parsing model. Hence, the first part of the objective function is the total score of the kept words, the second and third parts are the scores of the parse tree and bigrams of the compressed sentence, $z_{i}z_{j}\prod_{i indicates both $x_{i}$ and $x_{j}$ are kept, and are adjacent after compression. A graph illustration of the objective function is shown in Figure 2.

3 Proposed Method

3.1 Eisner’s Cubic Time Parsing Algorithm

Throughout the paper, we assume that all the parse trees are projective. Our method is a generalization of Eisner’s dynamic programming algorithm [6], where two types of structures are used in each iteration, incomplete spans and complete spans. A span is a subtree over a number of consecutive words, with the leftmost or the rightmost word as its root. An incomplete span denoted as $I^{i}_{j}$ is a subtree inside a single arc $x_{i}\rightarrow x_{j}$, with root $x_{i}$. A complete span is denoted as $C^{i}_{j}$, where $x_{i}$ is the root of the subtree, and $x_{j}$ is the furthest descendant of $x_{i}$.

Eisner’s algorithm searches the optimal tree in a bottom up order. In each step, it merges two adjacent spans into a larger one. There are two rules for merging spans: one merges two complete spans into an incomplete span, the other merges an incomplete span and a complete span into a large complete span.

3.2 Exact $O(n^{6})$ Time Algorithm

Figure 3: Connect deleted words using virtual arcs.

First we consider an easy case, where the bigram scores $w^{\text{bgr}}_{ij}$ in the objective function are ignored.

The scores of unigrams $w_{i}^{\text{tok}}$ can be transfered to the dependency arcs, so that we can remove all linear terms $w^{\text{tok}}_{i}z_{i}$ from the objective function. That is:

 $\displaystyle\sum_{i}w^{\text{tok}}_{i}z_{i}+\sum_{i,j}w^{\text{dep}}_{ij}z_{i% }z_{j}y_{ij}$ $\displaystyle=$ $\displaystyle\sum_{i,j}(w^{\text{dep}}_{ij}+w^{\text{tok}}_{j})z_{i}z_{j}y_{ij}$

This can be easily verifed. If $z_{j}=0$, then in both equations, all terms having $z_{j}$ are zero; If $z_{j}=1$, i.e., $x_{j}$ is kept, since it has exactly one head word $x_{k}$ in the compressed sentence, the sum of the terms having $z_{j}$ is $w_{j}^{\text{tok}}+w^{\text{dep}}_{kj}$ for both equations.

Therefore, we only need to consider the scores of arcs. For any compressed sentence, we could augment its dependency tree by adding a virtual arc $i-1\rightarrow i$ for each deleted word $x_{i}$. If the first word $x_{1}$ is deleted, we connect it to the root of the parse tree $x_{0}$, as shown in Figure 3. In this way, we derive a full parse tree of the original sentence. This is a one-to-one mapping. We can reversely get the the compressed parse tree by removing all virtual arcs from the full parse tree. We restrict the score of all the virtual arcs to be zero, so that scores of the two parse trees are equivalent.

Now the problem is to search the optimal full parse tree with $n-L$ virtual arcs.

Figure 4: Merging rules for dependency-factored sentence compression. Incomplete spans and complete spans are represented by trapezoids and triangles respectively.

We modify Eisner’s algorithm by adding a signature to each span indicating the number of virtual arcs within the span. Let $I^{i}_{j}(k)$ and $C^{i}_{j}(k)$ denote the incomplete and complete spans with $k$ virtual arcs respectively. When merging two spans, there are 4 cases, as shown in Figure 4.

• Case 1 Link two complete spans by a virtual arc : $I^{i}_{i+1}(1)=C^{i}_{i}(0)+C^{i+1}_{i+1}(0)$.

The two complete spans must be single words, as the length of the virtual arc is 1.

• Case 2 Link two complete spans by a non-virtual arc: $I^{i}_{j}(k)=C^{i}_{r}(k^{\prime})+C^{j}_{r+1}(k^{\prime\prime}),k^{\prime}+k^% {\prime\prime}=k$.

• Case 3 Merge an incomplete span and a complete span. The incomplete span is covered by a virtual arc: $I^{i}_{j}(j-i)=I^{i}_{i+1}(1)+C^{i+1}_{j}(j-i-1)$. The number of the virtual arcs within $C^{i+1}_{j}$ must be $j-i-1$, since the descendants of the modifier of a virtual arc $x_{j}$ must be removed.

• Case 4 Merge an incomplete span and a complete span. The incomplete span is covered by a non-virtual arc: $C^{i}_{j}(k)=I^{i}_{r}(k^{\prime})+C^{r}_{j}(k^{\prime\prime}),k^{\prime}+k^{% \prime\prime}=k$.

The score of the new span is the sum of the two spans. For case 2, the weight of the dependency arc $i\rightarrow j$, $w^{\text{dep}}_{ij}$ is also added to the final score. The root node is allowed to have two modifiers: one is the modifier in the compressed sentence, the other is the first word if it is removed.

For each combination, the algorithm enumerates the number of virtual arcs in the left and right spans, and the split position (e.g., $k^{\prime},k^{\prime\prime},r$ in case 2), thus it takes $O(n^{3})$ running time. The overall time complexity is $O(n^{5})$ and the space complexity is $O(n^{3})$.

Next, we consider the bigram scores. The following proposition is obvious.

Proposition 1.

For any right-headed span $I^{i}_{j}$ or $C^{i}_{j}$, $i>j$, words $x_{i},x_{j}$ must be kept.

Proof.

Suppose $x_{j}$ is removed, there must be a virtual arc $j-1\rightarrow j$ which is a conflict with the fact that $x_{j}$ is the leftmost word. As $x_{j}$ is a descendant of $x_{i}$, $x_{i}$ must be kept. ∎

When merging two spans, a new bigram is created, which connects the rightmost kept words in the left span and the leftmost kept word in the right span. According to the proposition above, if the right span is right-headed, its leftmost word is kept. If the right span is left-headed, there are two cases: its leftmost word is kept, or no word in the span is kept. In any case, we only need to consider the leftmost word in the right span.

Let $I^{i}_{j}(k,p)$ and $C^{i}_{j}(k,p)$ denote the single and complete span with $k$ virtual arcs and the rightmost kept word $x_{p}$. According to the proposition above, we have, for any right-headed span $p=i$.

We slightly modify the two merging rules above, and obtain:

• Case 2’ Link two complete spans by a non-virtual arc: $I^{i}_{j}(k,j)=C^{i}_{r}(k^{\prime},p)+C^{j}_{r+1}(k^{\prime\prime},j),k^{% \prime}+k^{\prime\prime}=k$. The score of the new span is the sum of the two spans plus $w^{\text{dep}}_{ij}+w^{\text{bgr}}_{p,r+1}$.

• Case 4’ Merge an incomplete span and a complete span. The incomplete span is covered by a non-virtual arc. For left-headed spans, the rule is $C^{i}_{j}(k,q)=I^{i}_{r}(k^{\prime},p)+C^{r}_{j}(k^{\prime\prime},q),k^{\prime% }+k^{\prime\prime}=k$, and the score of the new span is the sum of the two spans plus $w^{\text{bgr}}_{pr}$; for right-headed spans, the rule is $C^{i}_{j}(k,i)=I^{i}_{r}(k^{\prime},i)+C^{r}_{j}(k^{\prime\prime},r)$, and the score of the new span is the sum of the two spans.

The modified algorithm requires $O(n^{6})$ running time and $O(n^{4})$ space complexity.

3.3 Approximate $O(n^{4})$ Time Algorithm

In this section, we propose an approximate algorithm where the length constraint $\sum_{i}z_{i}=L$ is relaxed by Lagrangian Relaxation. The relaxed version of Problem (1) is

 $\displaystyle\min_{\lambda}\max_{\mathbf{z},\mathbf{y}}$ $\displaystyle\sum_{i}w^{\text{tok}}_{i}z_{i}+\sum_{i,j}w^{\text{dep}}_{ij}z_{i% }z_{j}y_{ij}$ (2) $\displaystyle+\sum_{i $\displaystyle+\lambda(\sum_{i}z_{i}-L)$ s.t. $\displaystyle\mathbf{z}\text{\ is\ binary\ }$ $\displaystyle\mathbf{y}\text{\ is\ a\ projective\ parse\ tree\ over\ the}$ $\displaystyle\quad\text{subgraph:\ }\{x_{i}|z_{i}=1\}$

Fixing $\lambda$, the optimal $\mathbf{z},\mathbf{y}$ can be found using a simpler version of the algorithm above. We drop the signature of the virtual arc number from each span, and thus obtain an $O(n^{4})$ time algorithm. Space complexity is $O(n^{3})$. Fixing $\mathbf{z},\mathbf{y}$, the dual variable is updated by

 $\displaystyle\lambda=\lambda+\alpha(L-\sum_{i}z_{i})$

where $\alpha>0$ is the learning rate. In this paper, our choice of $\alpha$ is the same as [16].

4 Experiments

4.1 Data and Settings

We evaluate our method on the data set from [4]. It includes 82 newswire articles with manually produced compression for each sentence. We use the same partitions as [10], i.e., 1,188 sentences for training and 441 for testing.

Our model is discriminative – the scores of the unigrams, bigrams and dependency arcs are the linear functions of features, that is, $w_{i}^{\text{tok}}=\mathbf{v}^{T}\mathbf{f}(x_{i})$, where $\mathbf{f}$ is the feature vector of $x_{i}$, and $\mathbf{v}$ is the weight vector of features. The learning task is to estimate the feature weight vector based on the manually compressed sentences.

We run a second order dependency parser trained on the English Penn Treebank corpus to generate the parse trees of the compressed sentences. Then we augment these parse trees by adding virtual arcs and get the full parse trees of their corresponding original sentences. In this way, the annoation is transformed into a set of sentences with their augmented parse trees. The learning task is similar to training a parser. We run a CRF based POS tagger to generate POS related features.

We adopt the compression evaluation metric as used in [10] that measures the macro F-measure for the retained unigrams ($F_{ugr}$), and the one used in [4] that calculates the F1 score of the grammatical relations labeled by RASP [2].

We compare our method with other 4 state-of-the-art systems. The first is linear chain CRFs, where the compression task is casted as a binary sequence labeling problem. It usually achieves high unigram F1 score but low grammatical relation F1 score since it only considers the local interdependence between adjacent words. The second is the subtree deletion model [1] which is solved by integer linear programming (ILP)22We use Gurobi as the ILP solver in the paper. http://www.gurobi.com/. The third one is the bigram model proposed by McDonald [12] which adopts dynamic programming for efficient inference. The last one jointly infers tree structures alongside bigrams using ILP [17]. For fair comparison, systems were restricted to produce compressions that matched their average gold compression rate if possible.

 Features for unigram $x_{i}$ $w_{i-2}$, $w_{i-1}$, $w_{i}$, $w_{i+1}$, $w_{i+2}$ $t_{i-2}$, $t_{i-1}$, $t_{i}$, $t_{i+1}$, $t_{i+2}$ $w_{i}t_{i}$ $w_{i-1}w_{i}$, $w_{i}w_{i+1}$ $t_{i-2}t_{i-1}$, $t_{i-1}t_{i}$, $t_{i}t_{i+1}$, $t_{i+1}t_{i+2}$ $t_{i-2}t_{i-1}t_{i}$, $t_{i-1}t_{i}t_{i+1}$, $t_{i}t_{i+1}t_{i+2}$ whether $w_{i}$ is a stopword Features for selected bigram $x_{i}x_{j}$ distance between the two words: $j-i$ $w_{i}w_{j}$, $w_{i-1}w_{j}$, $w_{i+1}w_{j}$, $w_{i}w_{j-1}$, $w_{i}w_{j+1}$ $t_{i}t_{j}$, $t_{i-1}t_{j}$, $t_{i+1}t_{j}$, $t_{i}t_{j-1}$, $t_{i}t_{j+1}$ Concatenation of the templates above $\{t_{i}t_{k}t_{j}|i Dependency Features for arc $x_{h}\rightarrow x_{m}$ distance between the head and modifier $h-m$ dependency type direction of the dependency arc (left/right) $w_{h}w_{m}$, $w_{h-1}w_{m}$, $w_{h+1}w_{m}$, $w_{h}w_{m-1}$, $w_{h}w_{m+1}$ $t_{h}t_{m}$, $t_{h-1}t_{m}$, $t_{h+1}t_{m}$, $t_{h}t_{m-1}$, $t_{h}t_{m+1}$ $t_{h-1}t_{h}t_{m-1}t_{m}$, $t_{h}t_{h+1}t_{m-1}t_{m}$ $t_{h-1}t_{h}t_{m}t_{m+1}$, $t_{h}t_{h+1}t_{m}t_{m+1}$ Concatenation of the templates above $\{t_{h}t_{k}t_{m}|x_{k}$ lies between $x_{h}$ and $x_{m}\}$
Table 1: Feature templates. $w_{i}$ denotes the word form of token $x_{i}$ and $t_{i}$ denotes the POS tag of $x_{i}$.

4.2 Features

Three types of features are used to learn our model: unigram features, bigram features and dependency features, as shown in Table 1. We also use the in-between features proposed by [11], which were shown to be very effective for dependency parsing.

System C Rate $F_{uni}$ RASP Sec.
Ours(Approx) $0.68$ 0.802 0.598 0.056
Ours(Exact) $0.68$ $0.805$ $0.599$ $0.610$
Subtree $0.68$ $0.761$ $0.575$ $0.022$
TM13 $0.68$ $0.804$ $0.599$ $0.592$
McDonald06 $0.71$ $0.776$ $0.561$ $0.010$
CRFs $0.73$ $0.790$ $0.501$ $0.002$
Table 2: Comparison results under various quality metrics, including unigram F1 score ($F_{uni}$), syntactic F1 score (RASP), and compression speed (seconds per sentence). C Rate is the compression ratio of the system generated output. For fair comparison, systems were restricted to produce compressions that matched their average gold compression rate if possible.

4.3 Results

We show the comparison results in Table 2. As expected, the joint models (ours and TM13) consistently outperform the subtree deletion model, since the joint models do not suffer from the subtree restriction. They also outperform McDonald’s, demonstrating the effectiveness of considering the grammar structure for compression. It is not surprising that CRFs achieve high unigram F scores but low syntactic F scores as they do not consider the fluency of the compressed sentence.

Compared with TM13’s system, our model with exact decoding is not significantly faster due to the high order of the time complexity. On the other hand, our approximate approach is much more efficient, about 10 times faster than TM13’ system, and achieves competitive accuracy with the exact approach. Note that it is worth pointing out that the exact approach can output compressed sentences of all lengths, whereas the approximate method can only output one sentence at a specific compression rate.

5 Conclusion

In this paper, we proposed two polynomial time decoding algorithms using joint inference for sentence compression. The first one is an exact dynamic programming algorithm, and requires $O(n^{6})$ running time. This one does not show significant advantage in speed over ILP. The second one is an approximation of the first algorithm. It adopts Lagrangian relaxation to eliminate the compression ratio constraint, yielding lower time complexity $TO(n^{4})$. In practice it achieves nearly the same accuracy as the exact one, but is much faster.33Our code is available at http://code.google.com/p/sent-compress/

The main assumption of our method is that the dependency parse tree is projective, which is not true for some other languages. In that case, our method is invalid, but [17] still works. In the future, we will study the non-projective cases based on the recent parsing techniques for 1-endpoint-crossing trees [14].

Acknowledgments

We thank three anonymous reviewers for their valuable comments. This work is partly supported by NSF award IIS-0845484 and DARPA under Contract No. FA8750-13-2-0041. Any opinions expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

References

• [1] T. Berg-Kirkpatrick, D. Gillick and D. Klein(2011-06) Jointly learning to extract and compress. pp. 481–490. Cited by: 1, 4.1.
• [2] T. Briscoe and J. Carroll(2002) Robust accurate statistical annotation of general text. External Links: Link Cited by: 4.1.
• [3] Y. Chali and S. A. Hasan(2012) On the effectiveness of using sentence compression models for query-focused multi-document summarization. See , pp. 457–474. Cited by: 1.
• [4] J. Clarke and M. Lapata(2008) Global inference for sentence compression: an integer linear programming approach. J. Artif. Intell. Res. (JAIR) 31, pp. 399–429. Cited by: 1, 4.1, 4.1.
• [5] T. Cohn and M. Lapata(2009-04) Sentence compression as tree transduction. J. Artif. Int. Res. 34 (1), pp. 637–674. External Links: ISSN 1076-9757, Link Cited by: 1.
• [6] J. M. Eisner(1996) Three new probabilistic models for dependency parsing: an exploration. COLING ’96, Stroudsburg, PA, USA, pp. 340–345. External Links: Cited by: 3.1.
• [7] P. Genest and G. Lapalme(2012) Fully abstractive approach to guided summarization. pp. 354–358. Cited by: 1.
• [8] C. Li, F. Liu, F. Weng and Y. Liu(2013-10) Document summarization via guided sentence compression. Cited by: 1.
• [9] F. Liu and Y. Liu(2009-08) From extractive to abstractive meeting summaries: can it be done by sentence compression?. pp. 261–264. Cited by: 1.
• [10] A. F. T. Martins and N. A. Smith(2009) Summarization with a joint model for sentence extraction and compression. pp. 1–9. Cited by: 1, 4.1, 4.1.
• [11] R. McDonald, K. Crammer and F. Pereira(2005) Online large-margin training of dependency parsers. Cited by: 4.2.
• [12] R. McDonald(2006-04) Discriminative Sentence Compression with Soft Syntactic Constraints. Cited by: 1, 4.1.
• [13] H. Morita, R. Sasano, H. Takamura and M. Okumura(2013-08) Subtree extractive summarization via submodular maximization. pp. 1023–1032. Cited by: 1.
• [14] E. Pitler, S. Kannan and M. Marcus(2013) Finding optimal 1-endpoint-crossing trees. Cited by: 5.
• [15] X. Qian and Y. Liu(2013-10) Fast joint compression and summarization via graph cuts. pp. 1492–1502. Cited by: 1.
• [16] A. M. Rush, D. Sontag, M. Collins and T. Jaakkola(2010-10) On dual decomposition and linear programming relaxations for natural language processing. pp. 1–11. Cited by: 3.3.
• [17] K. Thadani and K. McKeown(2013-08) Sentence compression with joint structural inference. Cited by: 1, 4.1, 5.