# Single Document Summarization based on Nested Tree Structure

Yuta Kikuchi${}^{\dagger}$ &Tsutomu Hirao${}^{\ddagger}$ &Hiroya Takamura${}^{\dagger}$
${}^{\dagger}$Tokyo Institute of technology
4295, Nagatsuta, Midori-ku, Yokohama, 226-8503, Japan
{kikuchi,takamura,oku}@lr.pi.titech.ac.jp
${}^{\ddagger}$NTT Communication Science Laboratories, NTT Corporation
2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan
{hirao.tsutomu,nagata.masaaki}@lab.ntt.co.jp &Manabu Okumura${}^{\dagger}$ &Masaaki Nagata${}^{\ddagger}$
###### Abstract

Many methods of text summarization combining sentence selection and sentence compression have recently been proposed. Although the dependency between words has been used in most of these methods, the dependency between sentences, i.e., rhetorical structures, has not been exploited in such joint methods. We used both dependency between words and dependency between sentences by constructing a nested tree, in which nodes in the document tree representing dependency between sentences were replaced by a sentence tree representing dependency between words. We formulated a summarization task as a combinatorial optimization problem, in which the nested tree was trimmed without losing important content in the source document. The results from an empirical evaluation revealed that our method based on the trimming of the nested tree significantly improved the summarization of texts.

## 1 Introduction

Extractive summarization is one well-known approach to text summarization and extractive methods represent a document (or a set of documents) as a set of some textual units (e.g., sentences, clauses, and words) and select their subset as a summary. Formulating extractive summarization as a combinational optimization problem greatly improves the quality of summarization [16, 8, 20]. There has recently been increasing attention focused on approaches that jointly optimize sentence extraction and sentence compression [21, 19, 17, 10, 2, 3]. We can only extract important content by trimming redundant parts from sentences.

However, as these methods did not include the discourse structures of documents, the generated summaries lacked coherence. It is important for generated summaries to have a discourse structure that is similar to that of the source document. Rhetorical Structure Theory (RST) [14] is one way of introducing the discourse structure of a document to a summarization task [15, 6, 11]. Hirao et al. recently transformed RST trees into dependency trees and used them for single document summarization [11]. They formulated the summarization problem as a tree knapsack problem with constraints represented by the dependency trees.

Figure 1: Overview of our method. The source document is represented as a nested tree. Our method simultaneously selects a rooted document subtree and sentence subtree from each node.

We propose a method of summarizing a single document that utilizes dependency between sentences obtained from rhetorical structures and dependency between words obtained from a dependency parser. We have explained our method with an example in Figure 1. First, we represent a document as a nested tree, which is composed of two types of tree structures: a document tree and a sentence tree. The document tree is a tree that has sentences as nodes and head modifier relationships between sentences obtained by RST as edges. The sentence tree is a tree that has words as nodes and head modifier relationships between words obtained by the dependency parser as edges. We can build the nested tree by regarding each node of the document tree as a sentence tree. Finally, we formulate the problem of single document summarization as that of combinatorial optimization, which is based on the trimming of the nested tree. Our method jointly utilizes relations between sentences and relations between words, and extracts a rooted document subtree from a document tree whose nodes are arbitrary subtrees of the sentence tree.

Elementary Discourse Units (EDUs) in RST are defined as the minimal building blocks of discourse. EDUs roughly correspond to clauses. Most methods of summarization based on RST use EDUs as extraction textual units. We converted the rhetorical relations between EDUs to the relations between sentences to build the nested tree structure. We could thus take into account both relations between sentences and relations between words.

## 2 Related work

Extracting a subtree from the dependency tree of words is one approach to sentence compression [21, 19, 17, 10]. However, these studies have only extracted rooted subtrees from sentences. We allowed our model to extract a subtree that did not include the root word (See the sentence with an asterisk $*$ in Figure 1). The method of Filippova and Strube [9] allows the model to extract non-rooted subtrees in sentence compression tasks that compress a single sentence with a given compression ratio. However, it is not trivial to apply their method to text summarization because no compression ratio is given to sentences. None of these methods use the discourse structures of documents.

Daumé III and Marcu [6] proposed a noisy-channel model that used RST. Although their method generated a well-organized summary, no optimality of information coverage was guaranteed and their method could not accept large texts because of the high computational cost. In addition, their method required large sets of data to calculate the accurate probability. There have been some studies that have used discourse structures locally to optimize the order of selected sentences [18, 5].

 - The scare over Alar, a growth regulator - that makes apples redder and crunchier - but may be carcinogenic, - made consumers shy away from the Delicious, - though they were less affected than the McIntosh.
Figure 2: Example of one sentence. Each line corresponds to one EDU.

## 3 Generating summary from nested tree

### 3.1 Building Nested Tree with RST

A document in RST is segmented into EDUs and adjacent EDUs are linked with rhetorical relations to build an RST-Discourse Tree (RST-DT) that has a hierarchical structure of the relations. There are 78 types of rhetorical relations between two spans, and each span has one of two aspects of a nucleus and a satellite. The nucleus is more salient to the discourse structure, while the other span, the satellite, represents supporting information. RST-DT is a tree whose terminal nodes correspond to EDUs and whose nonterminal nodes indicate the relations. Hirao et al. converted RST-DTs into dependency-based discourse trees (DEP-DTs) whose nodes corresponded to EDUs and whose edges corresponded to the head modifier relationships of EDUs. See Hirao et al. for details [11].

Our model requires sentence-level dependency. Fortunately we can simply convert DEP-DTs to obtain dependency trees between sentences. We specifically merge EDUs that belong to the same sentence. Each sentence has only one root EDU that is the parent of all the other EDUs in the sentence. Each root EDU in a sentence has the parent EDU in another sentence. Hence, we can determine the parent-child relations between sentences. As a result, we obtain a tree that represents the parent-child relations of sentences, and we can use it as a document tree. After the document tree is obtained, we use a dependency parser to obtain the syntactic dependency trees of sentences. Finally, we obtain a nested tree.

### 3.2 ILP formulation

 max. $\displaystyle\displaystyle\sum_{i}^{n}\sum_{j}^{m_{i}}w_{ij}z_{ij}$ s.t. $\displaystyle\sum_{i}^{n}\sum_{j}^{m_{i}}z_{ij}\leq L;$ (1) $\displaystyle x_{\text{parent}(i)}\geq x_{i};$ $\displaystyle\forall i$ (2) $\displaystyle z_{\text{parent}(i,j)}-z_{ij}+r_{ij}\geq 0;$ $\displaystyle\forall i,j$ (3) $\displaystyle x_{i}\geq z_{ij};$ $\displaystyle\forall i,j$ (4) $\displaystyle\sum_{j}^{m_{i}}z_{ij}\geq\small{\text{min}(\theta,\text{len}(i))% }x_{i};$ $\displaystyle\forall i$ (5) $\displaystyle\sum_{j}^{m_{i}}r_{ij}=x_{i};$ $\displaystyle\forall i$ (6) $\displaystyle\sum_{j\notin R_{c}(i)}r_{ij}=0;$ $\displaystyle\forall i$ (7) $\displaystyle r_{ij}\leq z_{ij};$ $\displaystyle\forall i,j$ (8) $\displaystyle r_{ij}+z_{\text{parent}(i,j)}\leq 1;$ $\displaystyle\forall i,j$ (9) $\displaystyle r_{i\text{root}(i)}=z_{i\text{root}(i)};$ $\displaystyle\forall i$ (10) $\displaystyle\sum_{j\in\text{sub}(i)}z_{ij}\geq x_{i};$ $\displaystyle\forall i$ (11) $\displaystyle\sum_{j\in\text{obj}(i)}z_{ij}\geq x_{i};$ $\displaystyle\forall i$ (12)
Figure 3: ILP formulation ($x_{i},z_{ij},r_{ij}\in\{0,1\}$)

Our method generates a summary by trimming a nested tree. In particular, we extract a rooted document subtree from the document tree, and sentence subtrees from sentence trees in the document tree. We formulate our problem of optimization in this section as that of integer linear programming. Our model is shown in Figure 3.

Let us denote by $w_{ij}$ the term weight of word $ij$ (word $j$ in sentence $i$). $x_{i}$ is a variable that is one if sentence $i$ is selected as part of a summary, and $z_{ij}$ is a variable that is one if word $ij$ is selected as part of a summary. According to the objective function, the score for the resulting summary is the sum of the term weights $w_{ij}$ that are included in the summary. We denote by $r_{ij}$ the variable that is one if word $ij$ is selected as a root of an extracting sentence subtree. Constraint (1) guarantees that the summary length will be less than or equal to limit $L$. Constraints (2) and (3) are tree constraints for a document tree and sentence trees. $r_{ij}$ in Constraint (3) allows the system to extract non-rooted sentence subtrees, as we previously mentioned. Function $\text{parent}(i)$ returns the parent of sentence $i$ and function $\text{parent}(i,j)$ returns the parent of word $ij$. Constraint (4) guarantees that words are only selected from a selected sentence. Constraint (5) guarantees that each selected sentence subtree has at least $\theta$ words. Function $\text{len}(i)$ returns the number of words in sentence $i$. Constraints (6)-(10) allow the model to extract subtrees that have an arbitrary root node. Constraint (6) guarantees that there is only one root per selected sentence. We can set the candidate for the root node of the subtree by using constraint (7). The $R_{c}(i)$ returns a set of the nodes that are the candidates of the root nodes in sentence $i$. It returned the parser’s root node and the verb nodes in this study. Constraint (8) maintains consistency between $z_{ij}$ and $r_{ij}$. Constraint (9) prevents the system from selecting the parent node of the root node. Constraint (10) guarantees that the parser’s root node will only be selected when the system extracts a rooted sentence subtree. The $\text{root}(i)$ returns the word index of the parser’s root. Constraints (11) and (12) guarantee that the selected sentence subtree has at least one subject and one object if it has any. The $sub(i)$ and $obj(i)$ return the word indices whose dependency tag is “SUB” and “OBJ”.

### 3.3 Additional constraint for grammaticality

We added two types of constraints to our model to extract a grammatical sentence subtree from a dependency tree:

 $\displaystyle z_{ik}$ $\displaystyle=$ $\displaystyle z_{il},$ (13) $\displaystyle\sum_{k\in s(i,j)}z_{ik}$ $\displaystyle=$ $\displaystyle|s(i,j)|x_{i}.$ (14)

Equation (13) means that words $z_{ik}$ and $z_{il}$ have to be selected together, i.e., a word whose dependency tag is PMOD or VC and its parent word, a negation and its parent word, a word whose dependency tag is SUB or OBJ and its parent verb, a comparative (JJR) or superlative (JJS) adjective and its parent word, an article (a/the) and its parent word, and the word “to” and its parent word. Equation (14) means that the sequence of words has to be selected together, i.e., a proper noun sequence whose POS tag is PRP\$, WP%, or POS and a possessive word and its parent word and the words between them. The $s(i,j)$ returns the set of word indices that are selected together with word $ij$.

Table 1: ROUGE score of each model. Note that the top two rows are both our proposals.
ROUGE-1
Sentence subtree 0.354
Rooted sentence subtree 0.352
Sentence selection 0.254
EDU selection [11] 0.321
$\text{LEAD}_{\text{EDU}}$ 0.240
$\text{LEAD}_{\text{snt}}$ 0.157

## 4 Experiment

### 4.1 Experimental Settings

We experimentally evaluated the test collection for single document summarization contained in the RST Discourse Treebank (RST-DTB) [4] distributed by the Linguistic Data Consortium (LDC) 11http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? catalogId=LDC2002T07. The RST-DTB Corpus includes 385 Wall Street Journal articles with RST annotations, and 30 of these documents also have one manually prepared reference summary. We set the length constraint, $L$, as the number of words in each reference summary. The average length of the reference summaries corresponded to approximately 10% of the length of the source document. This dataset was first used by Marcu et al. for evaluating a text summarization system [15]. We used ROUGE [13] as an evaluation criterion.

We compared our method (sentence subtree) with that of EDU selection [11]. We examined two other methods, i.e., rooted sentence subtree and sentence selection. These two are different from our method in the way that they select a sentence subtree. Rooted sentence subtree only selects rooted sentence subtrees 22We achieved this by making $R_{c}(i)$ only return the parser’s root node in Figure 7.. Sentence selection does not trim sentence trees. It simply selects full sentences from a document tree33We achieved this by setting $\theta$ to a very large number.. We built all document trees from the RST-DTs that were annotated in the corpus.

We set the term weight, $w_{ij}$, for our model as:

 $\displaystyle\displaystyle w_{ij}=\frac{log(1+tf_{ij})}{depth(i)^{2}},$ (15)

where $tf_{ij}$ is the term frequency of word $ij$ in a document and $depth(i)$ is the depth of sentence $i$ within the sentence-level DEP-DT that we described in Section 3.1. For Constraint (5), we set $\theta$ to eight.

### 4.2 Results and Discussion

#### 4.2.1 Comparing ROUGE scores

Original sentence : John Kriz, a Moody’s vice president, {said} Boston Safe Deposit’s performance has been hurt this year by a mismatch in the maturities of its assets and liabilities.
Rooted subtree selection : John Kriz a Moody’s vice president [{said}] Boston Safe Deposit’s performance has been hurt this year
Subtree selection : Boston Safe Deposit’s performance has [been] hurt this year
Original sentence : Recent surveys by Leo J. Shapiro & Associates, a market research firm in Chicago, {suggest} that Sears is having a tough time attracting shoppers because it hasn’t yet done enough to improve service or its selection of merchandise.
Rooted subtree selection : surveys [{suggest}] that Sears is having a time
Subtree selection : Sears [is] having a tough time attracting shoppers
Figure 4: Example sentences and subtrees selected by each method.

We have summarized the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores for each method in Table 1. The score for sentence selection is low (0.254). However, introducing sentence compression to the system greatly improved the ROUGE score (0.354). The score is also higher than that with EDU selection, which is a state-of-the-art method. We applied a multiple test by using Holm’s method and found that our method significantly outperformed EDU selection and sentence selection. The difference between the sentence subtree and the rooted sentence subtree methods was fairly small. We therefore qualitatively analyzed some actual examples that will be discussed in Section 4.2.2. We also examined the ROUGE scores of two LEAD44LEAD methods simply take the first $K$ textual units from a source document until the summary length reaches $L$. methods with different textual units: EDUs ($\text{LEAD}_{\text{EDU}}$) and sentences ($\text{LEAD}_{\text{SNT}}$). Although LEAD works well and often obtains high ROUGE scores for news articles, the scores for $\text{LEAD}_{\text{EDU}}$ and $\text{LEAD}_{\text{SNT}}$ were very low.

#### 4.2.2 Qualitative Evaluation of Sentence Subtree Selection

This subsection compares the methods of subtree selection and rooted subtree selection. Figure 4 has two example sentences for which both methods selected a subtree as part of a summary. The $\{\cdot\}$ indicates the parser’s root word. The $[\cdot]$ indicates the word that the system selected as the root of the subtree. Subtree selection selected a root in both examples that differed from the parser’s root. As we can see, subtree selection only selected important subtrees that did not include the parser’s root, e.g., purpose-clauses and that-clauses. This capability is very effective because we have to contain important content in summaries within given length limits, especially when the compression ratio is high (i.e., the method has to generate much shorter summaries than the source documents).

#### 4.2.3 Fragmentation of Information

Many studies that have utilized RST have simply adopted EDUs as textual units [14, 6, 11, 12]. While EDUs are textual units for RST, they are too fine grained as textual units for methods of extractive summarization. Therefore, the models have tended to select small fragments from many sentences to maximize objective functions and have led to fragmented summaries being generated. Figure 2 has an example of EDUs. A fragmented summary is generated when small fragments are selected from many sentences. Hence, the number of sentences in the source document included in the resulting summary can be an indicator to measure the fragmentation of information. We counted the number of sentences in the source document that each method used to generate a summary55Note that the number for the EDU method is not equal to selected textual units because a sentence in the source document may contain multiple EDUs.. The average for our method was 4.73 and its median was four sentences. In contrast, methods of EDU selection had an average of 5.77 and a median of five sentences. This meant that our method generated a summary with a significantly smaller number of sentences66We used the Wilcoxon signed-rank test ($p<0.05$).. In other words, our method relaxed fragmentation without decreasing the ROUGE score. There are boxplots of the numbers of selected sentences in Figure 5. Table 2 lists the number of words in each textual unit extracted by each method. It indicates that EDUs are shorter than the other textual units. Hence, the number of sentences tends to be large.

Figure 5: Number of sentences that each method selected.
Table 2: Average number of words that individual extracted textual units contained.
Subtree Sentence EDU 15.29 18.96 9.98

## 5 Conclusion

We proposed a method of summarizing a single document that included relations between sentences and relations between words. We built a nested tree and formulated the problem of summarization as that of integer linear programming. Our method significantly improved the ROUGE score with significantly fewer sentences than the method of EDU selection. The results suggest that our method relaxed the fragmentation of information. We also discussed the effectiveness of sentence subtree selection that did not restrict rooted subtrees. Although ROUGE scores are widely used as evaluation metrics for text summarization systems, they cannot take into consideration linguistic qualities such as human readability. Hence, we plan to conduct evaluations with people77For example, the quality question metric from the Document Understanding Conference (DUC)..

We only used the rhetorical structures between sentences in this study. However, there were also rhetorical structures between EDUs inside individual sentences. Hence, utilizing these for sentence compression has been left for future work. In addition, we used rhetorical structures that were manually annotated. There have been related studies on building RST parsers [7, 1] and by using such parsers, we should be able to apply our model to other corpora or to multi-document settings.

## References

• [1] Cited by: 5.
• [2] M. Almeida and A. Martins(2013-08) Fast and robust compressive summarization with dual decomposition and multi-task learning. pp. 196–206. Cited by: 1.
• [3] T. Berg-Kirkpatrick, D. Gillick and D. Klein(2011-06) Jointly learning to extract and compress. Portland, Oregon, USA, pp. 481–490. External Links: Link Cited by: 1.
• [4] L. Carlson, D. Marcu and M. E. Okurowski(2001) Building a discourse-tagged corpus in the framework of rhetorical structure theory. pp. 1–10. Cited by: 4.1.
• [5] J. Christensen, Mausam, S. Soderland and O. Etzioni(2013) Towards coherent multi-document summarization. pp. 1163–1173. Cited by: 2.
• [6] H. Daumé III and D. Marcu(2002) A noisy-channel model for document compression. ACL, pp. 449–456. Cited by: 1, 2, 4.2.3.
• [7] D. duVerle and H. Prendinger(2009) A novel discourse parser based on support vector machine classification. pp. 665–673. Cited by: 5.
• [8] E. Filatova and V. Hatzivassiloglou(2004) A formal model for information selection in multi-sentence text extraction. Cited by: 1.
• [9] K. Filippova and M. Strube(2008) Dependency tree based sentence compression. pp. 25–32. Cited by: 2.
• [10] D. Gillick and B. Favre(2009) A scalable global model for summarization. pp. 10–18. Cited by: 1, 2.
• [11] T. Hirao, Y. Yoshida, M. Nishino, N. Yasuda and M. Nagata(2013) Single-document summarization as a tree knapsack problem. pp. 1515–1520. Cited by: 1, 3.1, 1, 4.1, 4.2.3.
• [12] K. Knight and D. Marcu(2000) Statistics-based summarization - step one: sentence compression. pp. 703–710. Cited by: 4.2.3.
• [13] C. Lin(2004) ROUGE: a package for automatic evaluation of summaries. pp. 74–81. Cited by: 4.1.
• [14] W. C. Mann and S. A. Thompson(1988) Rhetorical structure theory: toward a functional theory of text organization. Text, pp. 243–281. Cited by: 1, 4.2.3.
• [15] D. Marcu(1998) Improving summarization through rhetorical parsing tuning. pp. 206–215. Cited by: 1, 4.1.
• [16] R. T. McDonald(2007) A study of global inference algorithms in multi-document summarization.. pp. 557–564. Cited by: 1.
• [17] H. Morita, R. Sasano, H. Takamura and M. Okumura(2013) Subtree extractive summarization via submodular maximization. pp. 1023–1032. Cited by: 1, 2.
• [18] H. Nishikawa, T. Hasegawa, Y. Matsuo and G. Kikui(2010) Opinion summarization with integer linear programming formulation for sentence extraction and ordering. pp. 910–918. Cited by: 2.
• [19] X. Qian and Y. Liu(2013) Fast joint compression and summarization via graph cuts. pp. 1492–1502. Cited by: 1, 2.
• [20] H. Takamura and M. Okumura(2009) Text summarization model based on the budgeted median problem. pp. 1589–1592. Cited by: 1.
• [21] K. Tomita, H. Takamura and M. Okumura(2009) A new approach of extractive summarization combining sentence selection and compression. IPSJ SIG Notes, pp. 13–20. Cited by: 1, 2.