Many different Artificial Intelligence techniques can be used to explore and exploit large document corpora available inside organizations and on the Web. While natural language is symbolic in nature and the first approaches in the field were based on symbolic and rule-based methods (e.g. ontologies, semantic networks and knowledge bases) most widely used methods are currently based on statistical approaches. Each of these two main schools of thought in natural language processing, knowledge-based and statistical, have their limitations and strengths and there is an increasing trend that seeks to combine them in complementary ways to get the best of both worlds.

This tutorial covers the foundations and modern practical applications of knowledge-based and statistical methods, techniques and models and their combination for exploiting large document corpora. The tutorial will first focus on the foundations of many of the techniques that can be used to this purpose, including knowledge graphs, word embeddings and neural networks, and will then show how these techniques are being effectively combined in practical applications, including commercial projects where the instructors currently participate.


For several decades, semantic systems were predominantly developed around knowledge graphs at different degrees of expressivity. Through the explicit representation of knowledge in well-formed, logically sound ways, knowledge graphs provide knowledge-based text analytics with rich, expressive and actionable descriptions of the domain of interest and support logical explanations of reasoning outcomes. On the downside, knowledge graphs can be costly to produce since they require a considerable amount of human effort to manually encode knowledge in the required formats. Additionally, such knowledge representations can sometimes be excessively rigid and brittle in the face of different natural language processing applications, like e.g. question answering.

In parallel, the last decade has witnessed a shift towards statistical methods to text understanding due to the increasing availability of raw data and cheaper computing power. Such methods have proved to be powerful and convenient in many linguistic tasks. Particularly, recent results in the field of distributional semantics have shown promising ways to learn language models from text, encoding the meaning of each word in the corpus as a vector in dense, low-dimensional spaces. Among their applications, word embeddings have proved to be useful in term similarity, analogy and relatedness, as well as many downstream tasks in natural language processing.

Aimed towards Semantic Web researchers and practitioners, this tutorial shows how it is possible to bridge the gap between knowledge-based and statistical approaches to further knowledge-based natural language processing. Following a practical and hands-on approach, the tutorial tries to address a number of fundamental questions to achieve this goal, including:

  • How can Machine Learning techniques be used to complement the knowledge already captured explicitly in knowledge graphs, extending and curating them in cost-efficient and practical ways,
  • what are the main building blocks and techniques enabling such hybrid approach to natural language processing,
  • how can structured and statistical knowledge representations be seamlessly integrated,
  • how can the quality of the resulting hybrid representations be inspected and evaluated, and
  • how can this improve the quality and coverage of our knowledge graphs.

Description of the tutorial

The length of the tutorial is half day, with plenty of practical content and examples. We plan to have an interactive session where both instructors and participants can engage in rich discussions on the topic. Some familiarity on the matter is expected but otherwise this should not prevent potential attendees from coming. The agenda will address the following main points.

  • Creating a language model through word embeddings.
  • Extending word embeddings with structured knowledge.
  • Creating knowledge graph embeddings.
  • Building a vecsigrafo – bringing knowledge from text into knowledge graphs.
  • Evaluating vecsigrafos beyond visual inspection and intrinsic methods.
  • Applications in cross-lingual natural language processing.


This tutorial seeks to be of special value for members of the Semantic Web community although it is also useful for related communities, e.g. Machine Learning and Computational Linguistics. We welcome researchers and practitioners both from industry and academia, as well as other participants with an interest in hybrid approaches to knowledge-based natural language processing.


The tutorial is offered by the following members of the Cogito Research Lab at Expert System and Recogn.ai.

Jose Manuel Gomez-Perez (jmgomez@expertsystem.com) works in the intersection of several areas of Artificial Intelligence, including Natural Language Processing, Knowledge Graphs and Machine Learning. His vision is to enable machines to understand multilingual text and corelate it with other information modalities like images and diagrams in a way similar to how humans read, building a partnership between both. At Expert System, Jose Manuel leads the Research Lab in Madrid, where he focuses on the combination of structured knowledge graphs and neural representations to extend COGITO‘s capabilities. Before Expert System, he worked at iSOCO, one of the first European companies to deliver semantic and natural language processing solutions on the Web. He consults for organizations like HAVAS, ING and the European Space Agency and is the co-founder of ROHub, the platform for scientific information management based on research objects. An ACM member and Marie Curie fellow, Jose Manuel holds a Ph.D. in Computer Science and Artificial Intelligence from UPM. He regularly publishes in top scientific conferences and journals of the field and his views have appeared in magazines like Nature and Scientific American.

Ronald Denaux (rdenaux@expertsystem.com) is a senior researcher at Expert System. Ronald obtained his MSc in Computer Science from the Technical University Eindhoven, The Netherlands. After a couple of years working in industry as a software developer for a large IT company in The Netherlands, Ronald decided to go back to academia. He obtained a PhD, again in Computer Science, from the University of Leeds, UK. Ronald’s research interests have revolved around making semantic web technologies more usable for end users, which has required research into (and resulted in various research publications in) the areas of Ontology Authoring and Reasoning, Natural Language Interfaces, Dialogue Systems, Intelligent User Interfaces and User Modelling. Besides research, Ronald also participates in knowledge transfer and product development.

Daniel Vila (daniel@recogn.ai) is co-founder of recogn.ai, a Madrid-based startup and spin-off from UPM, building next generation solutions for text analytics and content management using the AI methods. Daniel holds a PhD in Artificial Intelligence by Universidad Politécnica de Madrid (2016), where he worked at the Ontology Engineering Group and developed the solution supporting a large knowledge graph combining NLP and semantic technologies: the datos.bne.es data service from the National Library of Spain.