Motivation

For several decades, semantic systems were predominantly developed around knowledge graphs at different degrees of expressivity. Through the explicit representation of knowledge in well-formed, logically sound ways, knowledge graphs provide knowledge-based text analytics with rich, expressive and actionable descriptions of the domain of interest and support logical explanations of reasoning outcomes. On the downside, knowledge graphs can be costly to produce since they require a considerable amount of human effort to manually encode knowledge in the required formats. Additionally, such knowledge representations can sometimes be excessively rigid and brittle in the face of different natural language processing applications, like e.g. classification, named entity recognition, sentiment analysis and question answering.

In parallel, the last decade has witnessed a shift towards statistical methods to text understanding due to the increasing availability of raw data and cheaper computing power. Such methods have proved to be powerful and convenient in many linguistic tasks. Particularly, recent results in the field of distributional semantics have shown promising ways to capture the meaning of each word in a text corpus as a vector in dense, low-dimensional spaces. Among their applications, word embeddings have proved to be useful in term similarity, analogy and relatedness, as well as many downstream tasks in natural language processing.

Aimed towards Semantic Web researchers and practitioners, this tutorial shows how it is possible to bridge the gap between knowledge-based and statistical approaches to further knowledge-based natural language processing. Following a practical and hands-on approach, the tutorial tries to address a number of fundamental questions to achieve this goal, including:

  • How can machine learning extend previously captured knowledge explicitly represented as knowledge graphs in cost-efficient and practical ways.
  • What are the main building blocks and techniques enabling such hybrid approach to natural language processing.
  • How can structured and statistical knowledge representations be seamlessly integrated.
  • How can the quality of the resulting hybrid representations be inspected and evaluated.
  • How can all this improve the overall quality and coverage of our knowledge graphs.

Plus, we explore how the approaches we introduce in the tutorial can be used in the analysis of cross-lingual, cross-modal document corpora involving text but also e.g. images, diagrams and figures.