The objective of a Content Gateway is to implement strategies, tools and techniques to allow an easy integration of heterogeneous content sources into a system. As a part of this methodology, several techniques and implementations of mechanisms for data extraction from 3rd party systems (such as CMS, web sites and DBs) are used.
- 1 Introduction
- 2 Relevance to SAM
- 3 State of the Art Analysis
- 3.1 Data Extraction
- 3.2 Data Extraction Techniques
- 3.3 Data Transformation
- 3.4 Data Characterisation
- 3.5 Related Projects
- 3.6 Tools, Frameworks and Services
- 4 SAM Approach
- 5 Articles
- 6 References
With the implementation of the Content Gateways advanced techniques, it is possible facilitate the acquisition and further characterisation of the data. For instance, when a content provider produces metadata about their main information, a content gateway offers the possibility to prepare and characterise automatically this metadata for enhancement and further configuration.The main areas to cover in a Content Gateway are: Data Extraction, Data Transformation and Data Characterisation.
Relevance to SAM
Sometimes companies experience a lack of efficiency in the exchange data mechanisms. Therefore, it is highly necessary to improve the performance of companies in the area of Content Extraction mechanisms. In order to reach this goal, it is possible to design approaches based on existing projects and alternative information capture techniques for non-structured data (e.g. web scraping , web content mining , text mining , etc.).
State of the Art Analysis
A gateway will comprise of standard components and custom components with functionality developed or created for connecting to a specific external system type and/or instance. A gateways mission is to communicate with a specific system, meaning that a significant part of a gateway implementation is tailored for specific (e.g. SAP ERP) technology or communication/interface protocol.
Gateways are only about connecting 3rd party systems, having an agnostic view about what is the content exchanged and what the format is. Checking and processing of the information gathered through the gateways will be performed at the destination components (as per example the Process Execution component). Gateways will take no decisions themselves. However, gateways can call Transformation Service directly to translate from the external format to predefined (common) formats.
Apart from the Communication, Data Extraction and synchronisation and data transformation, it is possible to include an additional step, in which the formats will be analysed and the data semantically characterised.
Data Extraction Techniques
In order to implement the data extraction of information, different techniques can be explored depending on the data origin which could be divided in: Unstructured, Structured and Semi-structured.
Unstructured Data Sources
In this section, different techniques which are used to obtain a proper structured content from non-structured sources are explained:
- Web scraping . Normally web browsers like Chrome, Explorer, Firefox, etc are used to view the information published in the Net. The functionalities provided by the website is the tool to interact with the information (search, add comments). Sometimes this information could be useful for applications or could be interesting store it in the computer. Web scraping is a technique which allows extracting this information from websites and storing it in some place like databases or local machines to exploit it in the future..
- Web content mining  is generally the second step in Web data mining. The objective is to classify the information based on search parameters. To define the content relevance in a web page the different entities of information (text, graphs, and images) are scanned and mined.
- Text mining  is the analysis of data contained in natural language text. Text mining can help to get valuable business insights from text-based content such as word documents, email and postings on social media streams like Facebook, Twitter and LinkedIn. Mining unstructured data with natural language processing (NLP), statistical modelling and machine learning techniques can be challenging, however, because natural language text is often inconsistent. It contains ambiguities caused by inconsistent syntax and semantics, including slang, language specific to vertical industries and age groups, double ententes and sarcasm.
Structured and Semi-structured data sources
The objective of those techniques based on extracting information from structured or semi-structured data sources(databases, XML, EXCEL, etc) is to spend a minimal effort to make semantic mapping with a specific structure. This process of interrelating information from diverse sources is known as Semantic integration.
Transformation is a feature with a specific vocabulary as follows:
- Transformation is the total actions necessary to convert syntax or content A(s) into syntax or content B(s) (the ‘S’ is used since there can be multiple inputs and outputs).
- Mapping is the design time act of creating the commands/instructions/code to make this conversion. Typically a mapping file will contain the source scheme, output schema and the linkage between them both. The mapping file format is specific to the Translation engine.
- Translation is the run time act of morphing instance data from As to Bs according to the Mapping.
- The Translation Engine is the software engine that performs the translation; typically this may be either:
- A generic Translation engine, that can run maps related to that specific engine at runtime (or an index) to the map as well as input files and output locations
- A specific Translation engine which can run, or is created, to run just that map at runtime and is provided with input files and output locations
In this area, ontology exploitation is a relevant technology. Semantic data characterisation could be carried out to identify Ontology concepts when importing content assets using a Content Gateway. This alignment between asset structure and content and Ontology concepts could facilitate further enrichment and exploitation of these assets. This process should be carried out in a semi-supervised way, providing suggestions for automatic alignment that will be confirmed by a human actor afterwards. As a result, semantic inference could be carried out on these semantically characterised Assets. Novel techniques in artificial intelligence could be employed to allow accessing to semantic repositories, facilitating the retrieval and inference of knowledge from these sources of information.
EU STASIS  is a Research and Development project sponsored under the Europeans Commission’s 6th Framework programme as well as its projects members. Its objective is for Research, Development and Validation of open, webServices based, distributed semantic services for SME empowerment within the Automotive, Furniture and other sectors. It commenced September 1st 2006 and lasts for ~3 years until September 2009 with a total budget of €4M. 12 Partners are involved including Commercial Companies (TIE, iSoft) Academics (Universities of Sunderland, Oldenburg, Modena & Reggio Emilia, Tsinghua) and User Organisations (AIDIMA, Mariner, Shanghai Sunline, Foton, TANET, ZF Friedrichshafen AG) and these are led by the managing partner TIE. Partners are spread across Europe and China. The STASIS project, in which TIE and ASCORA personnel were partners, provides concepts, mechanisms, and tools that support data schema mapping using semantics. Particular focus is put on the linkage of electronic business data to enable the information exchange between companies. The approach is a transparent one so that actual users are released from the need for defining semantic description manually and do not need to invest in expensive technical personnel. SAM Relationship
Eurostars OPDM  aims at developing the technology that is capable of providing the partners within a supply chain (e.g. manufacturers, content providers, online shops, mail order companies, consumers…) with the right means to overcome some of the most prevalent challenges in e-commerce: Efficient product data management, efficient data processing workflows and the provision of structured, useful and complete product information. OPDM has set its goal to tackle these issues using Semantic-Web technologies, Intelligent data interpretation and mapping and sophisticated transformation algorithms.OPDM use meta-models to describe the features of products and services. These models are based on the GoodRelations ontology - a standard for product data representation and interchange on the Web. Using the OPDM application, domain experts and ordinary users may easily create, edit and extend varied product domain ontologies (standardized vocabularies for product features) in a (self-regulating) collaborative manner.
Tools, Frameworks and Services
TSI - TIE Smart Integrator
TSI makes use of a semantic based format characterisation, mapping and transformation approaches to map to and from different data sources like XML, XSD, RDF, Flat Files (FF), CSV, and RDBMS. The aim of TSI is to provide intelligent and automatic mapping suggestion based on smart semantic algorithms, vocabularies and further transform the data from one format to another. It uses the crowd sourced means of capturing the concepts to facilitate the mapping between two different schemas. TSI creates and manages semantic relationship between entities based on a Logical Data Model (LDM). Each schema format (XSD, RDB, FF, RDF) imported into the tool is transformed to an internal neutral format based on the LDM Meta model.
TSI supports features such as:
- One standard workbench (UI) to map and transform data
- Supported data sources format (input and output): XML, RDF/OWL, RDB, CSV, Ex-cel
- Map suggestions between elements of schemas are auto generated based on some strong semantic algorithms; Maps are persisted and can be re-used and shared
- Extra functions (string, numeric, conditional, custom) support during mapping to ma-nipulate the data transformation. These functions are called Methlets in TSI. Meth-lets are intended to modify the semantic links between the different data models, by adding easy-to-use, common functionalities that allow a more accurate mapping functionality.
- Semantic data and maps are stored in a triple data store and connected via SPARQL endpoints
- Support on the UI to query and search the repository using String or SPARQL query
- Mapped file generated is in java archive (.jar) format and can re-used where appli-cable
- Data transformation can be done by running the mapped .jar file from the workbench or from command line or via web services
Web Scraping Tools
- Data Toolbar
Web Content Mining Tools
Text Mining Tools
The Content Gateways is the component in charge of data gathering from external data sources, including 3rd party systems . The objective is to implement strategies, tools and techniques to allow easy integration of heterogeneous content sources into the SAM Platform. The most important goals are the following:
- Define the mapping between the data source structures and the data destination structures and store it in the Mapping Repository to make it available for re-use.
- Importing or linking the necessary information into the SAM Platform from Content Providers’ 3rd party systems
- Extracting data from external resources, such as Social Media services or Wikipedia, implementing internal mechanisms such as API wrappers, web crawlers or scraping techniques
Architecture and Dependencies
Similar to SAM, the STASIS project relies on semantics for data annotation, storage and transformation. SAM intends to use the STASIS approach of semantic mapping through TIE TSI (TIE Semantic Integrator) – the product TIE has built based on STASIS project and that will be the base for SAM gateways developments. TSI is a tool, which offers a user-friendly user interface with a wide range of features in order to define the mapping between the data source structures and the data destination structures. It supports different kinds of transformations and different data type formats based on the most common data file structures such as XML, CSV or RDF. TSI will also suggest mappings based on the semantic entities inferred from the data source and data destination formats definitions (Database structure, XML schema, etc.). The mapping definition is stored in the Cloud Storage so that it can be used for further import or data access operations. The Content Gateways component will have a web interface providing the management and control of importations provided by Content Providers. Using the Interconnection Bus component (TIE SmartBridge), the Content Gateways component is able to perform complex data import workflows. The execution of the importations can be started manually or by using a scheduled task. Finally, new screen-scraping techniques should be tested in order to accelerate the development of content extraction mechanisms.
Implementation and Technologies
SAM intends to use the STASIS approach of semantic mapping through TIE TSI (TIE Semantic Integrator), the product TIE has built based on STASIS project and that will be the base for SAM gateways developments (see DOW Section B 22.214.171.124). After comparing the technologies in the previous section, it can be inferred that all technologies are very similar in almost all parameters with the exception of TIE Semantic Integrator (TSI). The key difference is that it provides semantic mapping services. This feature allows the creation of a mapping between different formats in a semi-automatic way. Based on the semantic entities of the source and destination maps, it provides automatic weighted link suggestions that the user can confirm or discard. Besides, TSI provides a user interface providing a superior usability when compared to the rest of the compared tools. TSI is completely integrated with TSB (TIE Smart Bridge), the technology selected to implement the Interconnection Bus component, offering a complete semantic data and operational (communications) interoperability framework providing SAM Platform with a complete (data and communication) interoperability solution.
Web Scraping Tool
Several scrapping tools have been compared, and based on the comparison table (see Figure 78), the best candidate is Scrapy because it combines crawling and scraping techniques. However, it is necessary to experiment more with the different tools in order to verify which one provides the better results. In order to improve the efficiency of the development, it may prove beneficial to combine crawling and scrapping techniques in order to extract information from semi-structured and non-structured information.
A summary of the tasks carried out for each subcomponent of the first version of the prototype is shown in the following table.
|Gateway Control||This subcomponent is in charge of orchestrating the different internal operations in the Content Gateway component|
|Semantic Integrator Editor||The Semantic Integrator Editor is a user interface used by the Media Broadcasters and Information Brokers in order to define the mapping between the data source structures and the data destination structures. It will also suggest mappings based on the semantic entities inferred from the data source and data destination formats definitions (Database structure, XML schema, etc…). For instance, in order to import an XML format into a Database, the Semantic Integrator Editor will propose the stakeholder with a mapping suggestion, which it can approve or edit. Once the map is developed and approved, it is stored in the Mapping Repository so that it can be used for further import or data access operations|
|Mapping Repository and Mapping Repository UI||This subcomponent will store the transformation maps and their definitions in the Cloud Storage in order to be reused. This component will have a cache memory to accelerate the process. Through this subcomponent, the information about the mapping purpose, its input and output data formats descriptions, etc. are available in the system. The Media Broadcasters and Information Brokers will be able to manage the transformation information through the Mapping Repository User Interface|
|Web Data Extraction||This subcomponent implements functionality to extract data from external resources, such as social media services or Wikipedia, taking into account that this extracted information could be unstructured, structured or semi-structured. It will implement internal mechanisms such as API wrappers, web crawlers or scrapping techniques to describe and gather the data|
Functionality and UI Elements
Semantic Integrator Editor
- Web Usage Mining
- Web Content Mining, Screen Scraping
- Text Mining: The Next Data Frontier
- Semantic integration: Loosely coupling the meaning of data
- web scraping http://en.wikipedia.org/wiki/Web_scraping
- web content mining http://en.wikipedia.org/wiki/Web_content_mining#Web_content_mining
- text mining http://en.wikipedia.org/wiki/Text_mining#Text_mining_and_text_analytics
- web scraping http://en.wikipedia.org/wiki/Web_scraping web scraping
- EU STASIS http://tiekinetix.com/innovation/projects/stasis
- Eurostars OPDM http://www.opdm-project.org/