From Relation Extraction to Knowledge Graphs Paper

M.Sc. thesis, research project in the areas of machine learning and natural language processing.
Python, Tensorflow, Stanford Core NLP

This master thesis tackles the problem of building a Knowledge Graph of concepts using Relation Extraction from texts. Concepts consist of short phrases made of adjective and nouns. The first part of the work relates to developing different models (CNN, RNN, Bi-RCNN) with the ability to classify the semantic relationship among two concepts (Relation Classification). We develop advanced methods such as convolutional neural networks as well as recurrent neural networks and improve them significantly by using counter examples (negative sampling as well as data augmentation).
The second part of this work focuses on building a dataset containing the type of relations Iprova is interested in, train our best model on it and apply it on concepts with sentences extracted from different corpora in order to build representative Knowledge Graphs from them. Furthermore, to increase the rate of true positives, we tune confidence thresholds for each relation to minimize the false positives while having a high precision and maximizing recall.
The relations predicted with our models are compared with state of the art systems using the F1-Score on the SemEval-2010 Task 8 dataset and outperform all other models in the literature. However, Knowledge Graphs built using the concept extraction system as well as such models, cannot be used as they are for an automatic system. One of the reasons is that providing a rate of true positives around 80% is not enough to be considered as a high precision system. Moreover, we can be faced with mixture of semantic (e.g., homography) and also hypothetic relations which might be true only in specific case (e.g., does a lung contain metastases ?).
Finally, this kind of Knowledge Graph currently doesn’t exist (at least publicly) up to our knowledge. We bring a tool to model domains of interest providing related concepts with relations among them as well as a state of the art model for Relation Classification task of SemEval-2010 Task 8. Furthermore, our proposed systems can be easily improved by using pairwise ranking loss function to strengthen the ability of our models but also by inferring new relations using prior knowledge from Knowledge Graphs in order to increase the number of relations.


This thesis is confidential and unfortunately, the non-disclosure agreement and the confidentiality contract don't allow me to talk about everything.

Bi-Recurrent Convolutional Neural Network

We implement advanced neural networks for the task of Relation Classification, combining deep convolutional and recurrent neural networks. Without using data augmentation or negative sampling, we already achieved state of the art and by using these techniques, we clearly outperforms all other models in the literature.

Shortest dependency path in the dependency parse tree, word embeddings, part-of-speech tags, name entity categories and WordNet hypernyms

As main feature, we use shortest dependency path in the dependency parse tree among the two entities because these represent a trimmed version of the sentence, containing most important information. Moreover, the relationship between two entities are also directed, which is important for the Relation Classification task. We use special embeddings to improve the performances and also use additional features such that part-of-speech tags, name entitiy categories as well as WordNet hypernyms.

Building representative Knowledge Graphs from different corpora

Once our model for Relation Classification achieves state of the art, we use it on different corpora in order to extract interesting relations among concepts. In order to increase the rate of true positives (manual human evaluation), we optimize the confidence thresholds for each relation to minimize the false positives while having a high precision and maximizing recall.


Type M.Sc. thesis
Degree M.Sc. EPFL at Iprova, last semester
Course -
Duration ~1'000 hours
EPFL Supervisor Dr. Jean-Cédric Chappelier
Iprova Supervisor Bernard Maccari