Overview

Many different artificial intelligence techniques can be used to explore and exploit large document corpora that are available inside organizations and on the Web. While natural language is symbolic in nature and the first approaches in the field were based on symbolic and rule-based methods, like ontologies, semantic networks and knowledge bases, many of the most widely used methods are currently based on statistical approaches. Each of these two main schools of thought in natural language processing, knowledge-based and statistical, have their limitations and strengths and there is an increasing trend that seeks to combine them in complementary ways to get the best of both worlds.

This tutorial will cover the foundations and modern practical applications of knowledge-based and statistical methods, techniques and models and their combination for exploiting large document corpora. The tutorial will first focus on the foundations that can be used to this purpose, including knowledge graphs and word embeddings, and will then show how these techniques can be effectively combined in NLP tasks (and other data modalities in addition to text) related to research and commercial projects where the instructors currently participate.

Motivation

For several decades, semantic systems were predominantly developed around knowledge graphs at different degrees of expressivity. Through the explicit representation of knowledge in well-formed, logically sound ways, knowledge graphs provide knowledge-based text analytics with rich, expressive and actionable descriptions of the domain of interest and support logical explanations of reasoning outcomes. On the downside, knowledge graphs can be costly to produce since they require a considerable amount of human effort to manually encode knowledge in the required formats. Additionally, such knowledge representations can sometimes be excessively rigid and brittle in the face of different natural language processing applications, like e.g. classification, named entity recognition, sentiment analysis and question answering.

In parallel, the last decade has witnessed a shift towards statistical methods to text understanding due to the increasing availability of raw data and cheaper computing power. Such methods have proved to be powerful and convenient in many linguistic tasks. Particularly, recent results in the field of distributional semantics have shown promising ways to capture the meaning of each word in a text corpus as a vector in dense, low-dimensional spaces. Among their applications, word embeddings have proved to be useful in term similarity, analogy and relatedness, as well as many downstream tasks in natural language processing.

Aimed towards Semantic Web researchers and practitioners, this tutorial shows how it is possible to bridge the gap between knowledge-based and statistical approaches to further knowledge-based natural language processing. Following a practical and hands-on approach, the tutorial tries to address a number of fundamental questions to achieve this goal, including:

  • How can machine learning extend previously captured knowledge explicitly represented as knowledge graphs in cost-efficient and practical ways.
  • What are the main building blocks and techniques enabling such hybrid approach to natural language processing.
  • How can structured and statistical knowledge representations be seamlessly integrated.
  • How can the quality of the resulting hybrid representations be inspected and evaluated.
  • How can all this improve the overall quality and coverage of our knowledge graphs.

Plus, we explore how the approaches we introduce in the tutorial can be used in the analysis of cross-lingual, cross-modal document corpora involving text but also e.g. images, diagrams and figures.

Description of the tutorial

The intended length of the tutorial is half day, with plenty of practical content and examples. We plan to have an interactive session where both instructors and participants can engage in rich discussions on the topic. We will close with some time for discussion. Some familiarity on the matter is expected but otherwise this should not prevent potential attendees from coming.

The agenda consists of two main blocks (fundamentals and applications), and addresses the following main points:

Fundamentals

  • Capturing meaning from text as word embeddings.
  • Knowledge graph embeddings.
  • Building a vecsigrafo – generating hybrid knowledge representations from text corpora and knowledge graphs.
  • Evaluating vecsigrafos beyond visual inspection and intrinsic methods.
  • Vecsigrafos for curating and interlinking knowledge graphs.

Applications

  • Applications in cross-lingual natural language processing.
  • Beyond text understanding: Cross-modal extensions.

Target application domains

  • Classic literature
  • Fake news detection
  • Scientific information management

Audience

This tutorial seeks to be of special value for members of the Semantic Web community although it is also useful for related communities, e.g. Machine Learning and Computational Linguistics. We welcome researchers and practitioners both from industry and academia, as well as other participants with an interest in hybrid approaches to knowledge-based natural language processing.

Instructors

The tutorial is offered by the following members of the Cogito Research Lab at Expert System.

Jose Manuel Gomez-Perez (jmgomez@expertsystem.com) works in the intersection of several areas of Artificial Intelligence, including Natural Language Processing, Knowledge Graphs and Machine Learning. His vision is to enable machines to understand multilingual text and corelate it with other information modalities like images and diagrams in a way similar to how humans read, building a partnership between both. At Expert System, Jose Manuel leads the Research Lab in Madrid, where he focuses on the combination of structured knowledge graphs and neural representations to extend COGITO‘s capabilities. Before Expert System, he worked at iSOCO, one of the first European companies to deliver semantic and natural language processing solutions on the Web. He consults for organizations like HAVAS, ING and the European Space Agency and is the co-founder of ROHub, the platform for scientific information management based on research objects. An ACM member and Marie Curie fellow, Jose Manuel holds a Ph.D. in Computer Science and Artificial Intelligence from UPM. He regularly publishes in top scientific conferences and journals of the field and his views have appeared in magazines like Nature and Scientific American.

Ronald Denaux (rdenaux@expertsystem.com) is a senior researcher at Expert System. Ronald obtained his MSc in Computer Science from the Technical University Eindhoven, The Netherlands. After a couple of years working in industry as a software developer for a large IT company in The Netherlands, Ronald decided to go back to academia. He obtained a PhD, again in Computer Science, from the University of Leeds, UK. Ronald’s research interests have revolved around making semantic web technologies more usable for end users, which has required research into (and resulted in various research publications in) the areas of Ontology Authoring and Reasoning, Natural Language Interfaces, Dialogue Systems, Intelligent User Interfaces and User Modelling. Besides research, Ronald also participates in knowledge transfer and product development.