Natural Language Processing

From SAM
Revision as of 11:42, 1 October 2015 by Admin (talk | contribs) (Add link to Semantic Analysis wiki page)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Natural Language Processing (NLP) (also known as Computational Linguistics) is the scientific and engineering discipline concerned with dealing with human language from a computational perspective [1].

It is an interdisciplinary field related to Liguistics, Computer Science, Cognitive Science and Artificial Intelligence, providing formal and computational models of linguistic phenomena. These models can be developed following two remarkably different approches: "knowledge-based" and "corpus-based"[2]. In a knowledge-based approach, a human expert in the field manually defines a set of rules to build the model (e.g., introduce a rule that says that everytime you find the term "viagra" in the body of an e-mail, the content must be classified as spam). On the other hand, corpus-based approaches rely on a large number of text and machine learning algorithms, that automatically learn from these texts to solve a particular task (e.g., provide a large set of spam and non-spam emails as the imput to Naïve Bayes, and let the algorithm learn to differentiate between spam and non-spam emails).


NLP techniques are applied to different levels of human language analysis:

  • Morphological analysis and Part-of-Speeech (POS) tagging: this level is focused on the computational analysis of words structure (root, afixes, lemma, etc.), and on the assignment of part-of-speech tags (with morphological information such as genre, number, tense, etc.) to each word of a given text based on both its definition as well as its context.
  • Syntactic analysis (also known as "parsing"). This level focuses on the computational analysis of sentence structure, i.e., the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other[3].
  • Semantic analysis. It consists of assigning the correct meaning to each word (lexical semantics) or sentence (compositional semantics) of a text in a specific context. Different techniques are used to represent linguistic meaning: predicate logic, semantic networks, conceptual graphs, and semantic vector spaces. Semantic analysis includes meaning representation and automated reasoning.
  • Pragmatic Analysis. It refers to all linguistic phenomena that exceed the boundaries of sentence, from anaphora resolution to ironic detection.

One of the main issues that ever task in the field of NLP must face is the phenomenon of language ambiguity. The variability of natural language allows uttering the same sentence in many different ways. There are semantically similar sentences that are completely different from a lexical point of view (e.g., "I crave a hamburger" vs. "I long for a patty"). On the other hand, the problem of semantic ambiguity arises when one word has multiple different meanings, such in the case of bank (does it refer to a long pile or heap, or to an institution for safeguarding money?).

Relevance to SAM

Different NLP techniques will be applied in SAM. Some of them have been already described in different sections of this Wiki, such as Automatic Summarisation, Semantic Analysis and Sentiment Analysis. Besides that, other NLP techniques will be applied in tasks such as "T4.3 Data Characterisation Services", where tools and interfaces to access ontologies used to annotate and classify Assets are defined. In this task, different NLP techniques will be explored in order to provide acceptable characterisation of data and Asset suggestion.

Also "WP6 Context Analysis & Dynamic Creation of Social Communities" will benefit from NLP techniques. The main goal of this work package is to develop the social capabilities of SAM. SAM provides an innovative approach for the implementation and consumption of media related data in context-centric social communities. In order to achieve this context-centric approach, innovative strategies will be provided based on NLP technologies, content characterisation, and social mining. By combining these pieces together, SAM will provide context analysis mechanisms for detection of context changes and creation of dynamic communities.

State of the Art Analysis

Below there is a list of the most common tasks related to the NLP area. Some of these tasks, called Final Tasks, have direct applications on the real world, while others are commonly employed as intermediate tasks that are used to aid in solving different final NLP tasks.

Final Tasks

Intermediate Tasks

Tools, Frameworks and Services

There are different NLP resources available as a result of scientific initiatives. To name but a few:

Semantic Resources:

  • MultiWordNet (MWN) aligns the Italian and English lexical dictionaries conceptualized by Domain labels
  • EuroWordNet (EWN) was developed to align Basque, Catalan, English, Italian and Spanish lexical dictionaries
  • Multilingual Central Repository (MCR) integrates into the EWN framework an upgraded version of the EWN Top Concept ontology, the MWN Domains
  • Suggested Upper Merged Ontology (SUMO) consists of hundreds of thousands of semantic relations and properties automatically acquired from corpora
  • Integration of Semantic Resources based in WN (ISR-WN)[4]
  • Babelnet is a multilingual encyclopedic dictionary which covers terms in fifty languages. It is structured as a semantic network which connects named entities and concepts in a large network of semantic relations. More than 9 million entries made up this semantic network.

Part-Of-Speech Taggers:

  • Freeling consists of a library providing language analysis services (such as morphological analysis, date/time recognition, PoS tagging, etc.)
  • StanfordParser consists of different probabilistic natural language parsers, which are highly optimized applying lexicalized dependency parsers

Semantic Similarity Tools:

  • WordNetSimilarity implements different semantic similarity and relatedness measures which are based on WordNet lexical database. It supports the following semantic measures: Patwardhan-Pedersen, Resnik, Jiang-Conrath, Lin, Leacock-Chodorow, Wu-Palmer, Banerjee-Pedersen, and Hirst-St.Onge.


  1. Stanford Encyclopedia of Philosophy
  2. What is Computational Linguistics? ACLWeb