Automatic Summarisation

From SAM
Jump to: navigation, search

Automatic Summarisation (also known as Automatic Text Summarisation) is a research area in the context of Natural Language Processing whose goal is to process, synthesise and present the information to users, avoiding the arduous task of having to read everything, as well as facilitating the process of guiding the user in what it is important in texts.


The task of automatic summarization is especially challenging, because it requires the text to be summarised to be understood, in order to distinguish between relevant and irrelevant information. Moreover, it becomes more challenging when the information is taken from different documents, where it is essential to identify similar and/or repeated information, so it can be included in the summary only once.

Any type of summary previously mentioned has to be created following a summarization process, thus allowing to transform the source document or documents into a summary. In this process, there are three fundamental stages[1]:

  • Topic identification. It consists of determining the particular subject the document is about. It is usually approached by assigning each unit (words, sentences, phrases, etc.) a score which is indicative of its importance. In the end, the top score units up to a desired length are extracted.
  • Interpretation or topic fusion. During this stage, the topics identified as important are fused, represented in new terms, and expressed using a new formulation, which includes concepts or words not found in the original text. This stage is what distinguishes extractive from abstractive summarisation.
  • Summary generation. This stage only makes sense if abstractive summaries are generated. In these cases, natural language generation techniques (text planning, sentence planning, and sentence realisation) are needed to produce the final text of the summary.

Relevance to SAM

Novel automatic summarisation techniques will be analysed and tested for managing in a more efficient manner the vast amount of information available, as well as to provide summaries of the most relevant Assets. Summarisation of private communities comments and recommendations into a contextual scope is a novelty that will imply to explore new strategies and approached to be able to have coherent summarisation results.

In the context of SAM, Automatic Summarisation will be present in T6.4 Business Intelligence and Social Mining, where NLP and Graph Analysis techniques such as Sentiment Analysis, Automatic Summarisation, Semantic Analysis, influence diffusion models, etc. will be implemented in the Business Intelligence environment to provide companies with the appropriate information supporting making further decisions.

State of the Art Analysis

Automatic Summarisation approaches can be characterised according to many features (for instance, according to the type of input/output, the purpose of the summary, or the type of reader). Although it has been traditionally focused on text, the input to the summarisation process can also be multimedia information, such as images; video or audio, as well as on-line information or hypertexts. Furthermore, we can talk about summarising only one document (single-document summarisation) or multiple ones (multi-document summarisation).

Regarding the output, a summary may be an extract (i.e. when a selection of “significant” sentences of a document is shown); and abstract, when the summary can serve as a substitute for the original document and new vocabulary is added, or even a headline (or title). It is also possible to distinguish between generic summaries and query-focused summaries (also known as user-focused or topic-focused). The first type of summaries can serve as a surrogate of the original text as they may try to represent all relevant facts of the source text. In the latter, the content of a summary is biased towards a user need, query or topic.

Concerning the style of the output, a broad distinction is normally made between two types of summaries. Indicative summaries are used to indicate what topics are addressed in the source text. As a result, they can give a brief idea of what the original text is about. The other type, informative summaries, are intended to cover the topics in the source text and provide more detailed information. Apart from these two types of summaries, another one can also be taken into account, i. e. critical evaluative abstracts. This kind of summaries focuses on expressing the author’s point of view about a specific topic or subject, and they include reviews, opinions, feedback, recommendations, etc., with a strong dependence on cultural interpretation. That is the reason why they are so difficult to produce automatically, and therefore most systems only attempt to generate either indicative or informative summaries, by just summarising what appears in the source document.

In recent years, new types of summaries have appeared. For instance, the birth of the Web 2.0 has encouraged the emergence of new types of textual genres, containing a high degree of subjectivity, thus allowing the generation of sentiment-based summaries. Furthermore, update summaries are another example of new type of summary. They assume that users have already a background and they only need the most recent information about a specific topic. Finally, concerning the language of the summary, it can be distinguished between mono-lingual, multi-lingual, and cross-lingual summaries, depending on the number of languages dealt with. The cases where the input and the output language is the same lead to mono-lingual summaries. However, if different languages are involved, the summarisation approach is considered multi-lingual or cross-lingual. For example, if a summarisation system produces a Spanish summary from one or more documents in Spanish, that is the case of a mono-lingual system. On the other hand, if it is able to deal with several languages, such as Spanish, English or German, and produces summaries in the same languages as the input document, we would have a multi-lingual summarisation system. Beyond these approaches, if the summary is in Spanish, but the original documents are in English, the summariser would deal with cross-linguality, since the input and output languages are different.

Currently, the application of summarization in other fields is gaining more and more importance. For instance, the TAC 214 Biomedical Summarization track has been proposed this year in the context of the Text Analysis Conferences, where biomedical literature has to be concisely summarized.


Although there are a wide number of different techniques and approaches that have been proposed to generate automatic summaries, they can be classified into five main groups, depending on the nature of the techniques employed. These are:

  • Statistical-based: this type of approach was the first one for producing automatic summaries. It consists of using statistical counts for determining the relevance of the units in the texts, so the most relevant ones would be selected to form part of the final summary. For instance, a relevant sentence could be computed assuming that the most frequent words in a document are indicative of its main topics [2] [3]. However, not all the words have to be taken into consideration. For example, stop words, i.e. words without carrying any semantic information, such as “a” or “the”, are not used for computing the term frequency.
  • Topic-based: This technique consists in determining the relevance of a sentence by means of the phrases or words it contains [4] [5] [6]. For instance, sentences containing phrases like “in conclusion” or “the aim of this paper ” may introduce new topics, and also be good indicators of relevant information.
  • Graph-based: The use of graph-based ranking algorithms has also been shown to be effective in summarisation. Basically, the nodes of the graph represent text elements (i.e. normally words or sentences), whereas edges are links between those text elements, previously defined (for instance, semantic relations, such as synonymy)[7] [8][9]. On the basis of the text representation as a graph, the idea is that the topology of the graph will reveal interesting things about the salient elements of the text, for example concerning the connectivity of the different elements.
  • Discourse-based: it is also possible to generate summaries exploiting discourse relations [10][11][12]. Rhetorical Structure Theory is one of the most employed linguistic theories for producing automatic summaries. Given the different types of rhetoric structures that a text can contain, the sentences in a document are classified according these structures and then ranked based on the importance of them, discarding the satellite relations, and preserving the nucleous ones. Other linguistic approaches rely on lexical chains, with the aim of maintaining the coherence in the final summary [13] [14].
  • Machine learning-based: this type of approaches applies one or more computational algorithms confirgured with relevant features (e.g., presence of keywords, or sentence position) to a big collection of data, in order to determine which of the sentences are good candidate to belong to the summary [15][16]. The advantage of using machine learning for automatic summarisation is that it allows to easily test the performance of a high number of features. However, these approaches also need a big training corpus in order to be able to obtain conclusive results. In this case, the corpus usually consists of a set of human-written summaries, or annotated source documents containing which sentences are important for the summary, and which are not.


Automatic evaluation of summaries is aslo a very challenging task. This is reflected in the advances achieved in this particular area, which are slower than in other tasks also based on Natural Language Processing. Whereas many researchers focus their attention on which the best approaches are for identifying the most relevant sentences in a document, methods accounting for the quality of a summary are not that much exploited.

Methods for evaluating summarisation approaches can be broadly classified into two categories: intrinsic and extrinsic. The former evaluates a summary itself, for example according to its information content, whereas the latter tests the effectiveness of a summary on other Natural Language Application application (e.g. question answering). In this case, the summary would be beneficial for other applications, despite not being appropriate for serving as a surrogate of a document to be directly used by humans.

As far as intrinsic evaluation is concerned, there are different criteria that can be taken into account to evaluate a summary. The most widespread intrinsic methodologies focus on evaluating the informativeness of a summary by comparing its content to a human-written one, also known as model summary or gold standard. However, due to the inherent subjectivity associated to summaries, it is very difficult to build a fair model summary. In contrast, other evaluation approaches are more concerned with a qualitative evaluation, which aims at evaluating the quality of a summary with respect to different criteria, such as grammaticality or coherence.

With respect to the extrinsic evaluation methods, several scenarios have been proposed as methods for TS evaluation inspired by different disciplines. Examples of these scenarios are: The Shannon Game, which aims at quantifying information content by guessing tokens, so that the original document can be recreated; The Question Game, which tests the readers’ understanding of the summary and his/her ability to convey the main concepts; The Classification Game, which consists of determining the category either for original documents or for summaries, by measuring the correspondence between them; and Keyword Association, in which a list of keywords is provided and the goal of the task is to check whether summaries contain such words or not. Other extrinsic evaluation methods include relevance assessment, in which subjects are asked to determine the relevance of a topic, either in a summary or source document; or reading comprehension, which involves answering multiple-choice tests having read the summary or the whole document. Also, in an extrinsic manner, auotmatic summaries have been also evaluated in the context of wider Natural Language Procesing tasks, such as Information Retrieval, Text Classification or Sentiment Analysis.


There is a considerable number of applications in which automatic summarisation is extremely beneficial, providing competitive advantages in the current information society. For instance, in the scientific context, automatic summarisation can be used for generating abstracts of research papers, thus avoiding authors the difficult task of having to synthesize in a small paragraph the main topics addressed in their research. These abstracts are very important not only for providing readers an overview of the article, but also for being used by automatic systems for indexing, searching and retrieving information without having to process the whole document. Moreover, another application would be the automatic generation of newsletters, containing information which is of interest for a particular group of experts.

Another scenario where automatic summaries could be very beneficial is to produce a summary containing the information retrieve by a search engine about a specific topic. A user would not be able to read and process all the documents retrieved, just probably looking at the first resutls in order to check whether they contain the information they want. Generating an automatic summary for the search performed could provide the user basic information about the topic of interest.

In the context of the Web 2.0, and with the explosion of the sentiment analysis research field, summarisation is also combined with sentiment analysis systems for producing short fragments of texts with the most relevant opinions about a product, service, place, etc.

Related Projects

There have been a number of projects specifically dealing with automatic summarization. In the following link you can see a comprehensive description of some of them:

A good recent book that offers an update and overview of the most cutting edge approaches and evaluation methods is the Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding. Through the different chapters of this book, the reader will gain an update and useful background on Automatic Summarization.

Tools, Frameworks and Services

The DUC and TAC competitions are the most relevant summarisation generation and summarization frameworks in this research field. The tasks involved in the DUC conferences also changed over the editions adapting to new user requirements and scenarios, starting at the beginning with generic single-document summarisation, and continuing further on with query-focused multi-document summarisation. Since 2008, DUC conferences have been no longer organised, because they have become part of the TAC, within which a summarisation track is included. In TAC 2008, two different tasks related to summarisation were proposed. The first followed the same idea as the update summarisation task in the DUC 2007, which consisted of building summaries containing updated information with respect to a given set of news documents, whereas the second one, was a pilot task whose aim was to generate opinion summaries from blogs. In further TAC editions a new task concerning the automatic evaluation of summaries was included.

There are a number of automatic summarisers available that could be used online or via an application. Among them, we find:

SAM Approach

Summarisation functionalities are included in the Social Mining subcomponent. This component is part of the Analytic component, which includes the Business Intelligence subcomponent.

Architecture and Dependencies

The Social Mining subcomponent, which is presented in the following image, is responsible for processing User Generated Content (UGC) by using NLP technologies. To carry out the responsibilities of this subcomponent, it interacts with the Cloud Storage component for retrieving UGC. After that, the Social Mining controller performs different actions to provide the Business Intelligence subcomponent with sentiment analysis, content characterisation and summarisation functionalities provided by the Semantic Services. More specifically, this subcomponent will make use of semantic and sentiment analysis features extracted from user comments to provide advanced reports by means of the Business Intelligence subcomponent. Figure beneath shows the different Social Mining subcomponents, the logical connections that have been established between them and the relationship with other components and actors in the SAM platform.


Implementation and Technologies

After Extended Analysis and comparison the most appropriate technologies for the backend have been selected.

Frontend Technologies (User Interface)

Social Mining subcomponent does not include any user interface.

Backend Technologies (Web Services)

The Social Mining prototype uses the JAX-RS technologies and more specifically the Jersey framework[17] in order to implement the RESTful Web Services for this component. In addition, the Social Mining API uses the Swagger framework[18] to obtain an interactive documentation. This should considerably ease the implementation, deployment and testing of the Social Mining environment.


A summary of the tasks carried out for the Social Mining subcomponent during the first and second versions of the prototype is shown in the following table. The tasks marked as [mock-up] have not been developed for this prototype, but their expected behaviour has been documented by mock-up interfaces including input and output schemas.

Subcomponent Task
Social Mining Controller Queries Asset’s UGC for a given Asset to process the following tasks [mock-up]:
  • Invokes and controls (by means of the Sentiment Analysis Controller) sentiment analysis functionalities from the Semantic Services component to extract sentiment and emotion features from the UGC
  • Invokes and controls (by means of the Characterisation Controller) content characterisation functionalities from the Semantic Services component to retrieve related Assets from the UGC [mock-up]
  • Invokes and controls (by means of the Sentiment Analysis Controller) summarisation functionalities from the Semantic Services component for summarising large amounts of UGC, keeping the most representative information from them [mock-up]

Functionality and UI Elements

This section explains how to use the Social Mining subcomponent with the available interfaces for performing automatic summarisation to large amounts of Asset’s UGC from the Cloud Storage. Currently, this functionality is provided as a mock-up, since they depend on other SAM components currently under development. The following section describe details for accessing and using this interfaces.

Summarising Social Data

The Social Mining subcomponent provides, through the Social Mining Controller, a RESTful interface for text summarisation (see figure beneath). This interface recovers UGC related to a specific Asset (by querying the Cloud Storage) given its identifier. It summarises large amounts of UGC and keeps the most representative ones by means of the functionalities provided by the Semantic Services component. For the first prototype, this subcomponent has been implemented as a mock-up.
Figure above includes all necessary data requests, response parameters and their expected types and values. Concerning the input of this operation, constituted by a JSON object with the following attributes:

  • assetIDs: List of Asset identifiers in order to obtain their UGC
  • startDate: Initial date (mmddyyyy) to consider when retrieving assets (inclusive)
  • endDate: Final date (mmddyyyy) to consider when retrieving assets (exclusive)
  • depth: Since, the SAM platform can deal with complex Assets (i.e. an Asset can be linked to another Asset) the depth attribute is used to establish a numeric value to determine whether UGC from linked Assets should also be retrieved and analysed.
  • compressionRatio: Numeric value to indicate the reduction percentage of the UGC to keep only the most representative comments in the output.

Below there is an example result of this functionality using the input in the above figure. This interface provides a JSON object with an attribute named summary that contains the most representative comments from the input UGC:

{  "summary": "Casino Royale is a great movie." }

Lastest Developments

The improvements in this functionality basically have been focused on reducing memory usage and resource concurrency control, since it makes use of extra resources like the WordNet [19] dictionary. As part of the dissemination results it can be said this technology has been reported in scientific conferences as affirm Vicente [20] and its results can be found in detail in the table.

Research results

Technology Corpus Precision Recall
Entity Linking (SAM Assets) 30 documents of the MultiLing 2015 dataset [21] 45% (state-of-the-art is 54%) 45% (state-of-the-art is 51%)


  1. Hovy, Eduard. 2005. The Oxford Handbook of Computational Linguistics. Oxford University Press. Chap. Text Summarization, pages 583–598.
  2. McCargar, Victoria. 2005. Statistical Approaches to Automatic Text Summarization. Bulletin of the American Society for Information Science and Technology, 30(4), 21–25.
  3. Lloret, Elena, & Palomar, Manuel. 2009. A Gradual Combination of Features for Building Automatic Summarisation Systems. Pages 16–23 of: Proceedings of the 12th International Conference on Text, Speech and Dialogue.
  4. Edmundson, H. P. 1969. New Methods in Automatic Extracting. Pages 23–42 of: Inderjeet Mani and Mark Maybury, editors, Advances in Automatic Text Summarization. MIT Press
  5. Harabagiu, Sanda, & Lacatusu, Finley. 2005. Topic Themes for Multi-document Summarization. Pages 202–209 of: Proceedings of the 28th annual international ACM SIGIR conference on Research and Development in Information Retrieval
  6. Teng, Zhi, Liu, Ye, Ren, Fuji, Tsuchiya, Seiji, & Ren, Fuji. 2008. Single Document Summarization Based on Local Topic Identification and Word Frequency. Pages 37–41 of: Proceedings of the Seventh Mexican International Conference on Artificial Intelligence
  7. Erkan, Günes, & Radev, Dragomir R. 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research (JAIR), 22, 457–479.
  8. Wan, Xiaojun, Yang, Jianwu, & Xiao, Jianguo. 2007. Towards a Unified Approach Based on Affinity Graph to Various Multi-document Summarizations. Pages 297–308 of: Proceedings of the 11th European Conference.
  9. Plaza, Laura, D ıaz, Alberto, & Gerv ́as, Pablo. 2008. Concept-Graph Based Biomedical Automatic Summarization Using Ontologies. Pages 53–56 of: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing.
  10. Marcu, Daniel. 1999. Discourse Trees Are Good Indicators of Im- portance in Text. Pages 123–136 of: Inderjeet Mani and Mark Maybury, editors, Advances in Automatic Text Summarization. MIT Press
  11. Cristea, Dan, Postolache, Oana, & Pistol, Ionut. 2005. Summarisation Through Discourse Structure. Pages 632–644 of: Proceedings of the Computational Linguistics and Intelligent Text Processing, 6th International Conference.
  12. Khan, Afnan Ullah, Khan, Shahzad, & Mahmood, Waqar. 2005. MRST: A New Technique For Information Summarization
  13. Ercan, Gonenc, & Cicekli, Ilyas. 2008. Lexical Cohesion Based Topic Modeling for Summarization. Pages 582–592 of: Proceedings of the 9th International Conference in Computational Linguistics and Intelligent Text Processing.
  14. Barzilay, Regina, & Elhadad, Michael. 1999. Using Lexical Chains for Text Summarization. Pages 111–122 of: Inderjeet Mani and Mark Maybury, editors, Advances in Automatic Text Summarization. MIT Press.
  15. Kupiec, Julian, Pedersen, Jan, & Chen, Francine. 1995. A Trainable Document Summarizer. Pages 68–73 of: Proceedings of the 18th annual international ACM SIGIR Conference on Research and Development in Information Retrieval.
  16. Svore, Krysta M., Vanderwende, Lucy, & Burges, Christopher J.C. 2007. Enhancing Single-Document Summarization by Combining RankNet and Third-Party Sources. Pages 448–457 of: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
  18. [1]
  20. Vicente, M.; Alcón, O; Lloret, E. The University of Alicante at MultiLing 2015: approach, results and further insights. Proceedings of the SIGDIAL 2015 Conference, 259Association for Computational Linguistics. pp 250. 2015