Research:Wikipedia Knowledge Graph with DeepDive
Only very little amount of information in Wikipedia is structured. Most of the information is embedded in text and extracting it is a non-trivial challenge. In this project, we try to populate specific areas of Wikidata, a structured component of Wikipedia, using DeepDive tool to extract relations embedded in the text. We extract more than 140,000 relations with more than 90% average precision.
This report is structured as follows: first we introduce Wikidata and disambiguate it from Wikipedia. We present DeepDive and the data that we use for this project. Second, we clarify the relations we focused on so far and explain the implementation and pipeline, including our model, features and extractors. Finally, we detail our results with a thorough precision and recall analysis and introduce potential next steps for this project.
Our project aims at enriching Wikidata and more generally the Wikipedia knowledge graph using DeepDive and Wikipedia articles composed of raw text. In this section, we introduce Wikidata to disambiguate it from Wikipedia and present DeepDive and its functioning principles.
Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is a free, collaborative, multilingual, secondary database, collecting structured data to provide support for Wikipedia as well as other projects within the Wikimedia foundation and beyond. While Wikipedia is mainly a collection of text and unstructured information, Wikidata project aims at building a comprehensive database that would contain all the information in a structured manner. One of the potential sources for building such a database is Wikipedia under the condition that we are able to extract this kind of information reliably from natural language. This is where DeepDive comes into play.
DeepDive is a project led by Christopher Ré at Stanford University. It is a new type of data analysis system that enables developers to extract structured relations from raw text and can achieve very high quality beating human volunteers in highly specific tasks. Since the end goal of the current project is to be able to push contributions directly to Wikidata, quality was our main concern and DeepDive our natural choice. Before starting to build a DeepDive application, one needs to understand what it does and what our tasks as developers are.
DeepDive provides a full framework to extract knowledge graph relations from raw text. The user needs to define extractors for mentions, candidates and features, and use distant supervision to extract positive and negative training examples. Based on inference rules provided by the user, DeepDive constructs a factor graph model, learns the weights, and performs inference to find the expectation for each relation.
A factor graph, the model used by DeepDive, is a probabilistic graphical model that can be used to represent complex probabilistic functions as a product of simpler functions. Factor graphs consist of two types of nodes: variables and factors. Each factor is a function acting on a subset of the variables and can have weights associated with it that can be either learned or determined beforehand. DeepDive constructs the factor graph based on user-defined inference rules, learns the weights, and performs marginal inference to infer the expectation of the variables taking a particular value. However to learn weights for this factor graph, the user needs to define the relationship between variables and their features as well as provide training examples.
DeepDive also provides an error analysis tool called MindTagger, that allows quick data annotation through a graphical interface. This tool allows users to interact directly with the data, find correct and incorrect predictions as well as inspect the features involved. Finally, through manual labeling, we can estimate our precision and recall.
To extract the knowledge graph relations at a large scale, we used a dump of English Wikipedia of February 2015. This very rich and large data set of 100Gb was already parsed using the Stanford NLP parser and it contains NLP information such as Name Entity Recognition (NER) and Part Of Speech (POS) tags as well as dependency paths. As we detail in the implementation section, we preprocessed this data for efficiency and correctness reasons.
Populating a knowledge graph of the size of Wikidata is a tremendous task. Therefore, we tried to focus on relations we believe users are usually looking for and that are immutable over time. Starting with relation (company,founder) that involved getting started with DeepDive and its ecosystem, we then moved to the ambitious and large scale project of family tree including spouses, parents, children and siblings. In this section, we detail these applications one by one focusing on pipelines and tasks we performed. A high level view of the pipeline is presented in figure 1.
The first application we focused on aimed at extracting company-founder relations directly from text. The choice of this relation was based on two factors: first, it is a static relation that does not change over time. Second, it was a consequent problem on Wikidata: our investigation revealed that there are 44301 English Wikipedia company articles that do not have founder information in their Wikidata item and only 2078 that do.
The first step in the task of extracting company-founder relations was to filter our initial data and keep only relevant pages. This step was performed to overcome the scaling challenge and is discussed further in Section "Implementation and challenges -- Scale". The second step was to extract candidates for possible relationships. To do so, we find people and company mentions within the same sentence using mainly NER tags and post-process them to complete the names or eliminate the possible errors of the tags.
The third step was to obtain a good training set for the application. While obtaining positive examples is somehow trivial by querying Freebase API for company-founder examples, obtaining relevant negative examples requires more creativity. Every name candidate relation is considered negative if it contradicts one of our positive examples. For instance, in this actual sentence from Wikipedia: ”Eric Emerson Schmidt is an American software engineer, businessman, and the executive chairman of Google.”, Eric Schmidt and Google appeared in the same sentence. They would then be considered as candidates for company-founder relation. However, in our positive examples from Freebase, we know that Google's founders are Larry Page and Sergey Brin. Therefore, Google - Eric Schmidt will be immediately considered as negative example. This method is called distant supervision and allowed us to obtain hundreds of thousands of negative examples that we later sampled down to keep a fair distribution between positive and negative examples. The rest of the candidates are tagged Null and are considered for inference.
Once the candidates were tagged, we start the learning process by extracting features. DeepDive implements a large library of generic features called ddlib that extracts common NLP features like the dependency paths or N-Grams.
Using ddlib, we achieved a fair result that we later used as a baseline to engineer more features as we detail in this report. Using the extracted features and tagged relations, we learn the weights of the features in the factor graph. In this application, the factor graph is a simple logistic regression model. Using these weights, we predict for each Null candidate relation the probability that it is a actual company-founder relation. Following DeepDive convention, we set a threshold at 0.9, which means that we declare a relation to be true if the probability given by the factor graph is over 0.9. Finally, using MindTagger, we detect patterns of errors and implement features to enhance the model.
As our second task, we attempted to construct a family tree of the people in Wikipedia. We focused on four specific relations: parent, sibling, child and spouse. Each of the four relations are with respect to the person that the corresponding Wikipedia page is about. For example, if the page is about Tom Cruise, using the sentences in that page we only extract the children, parents, siblings and spouses of Tom Cruise.
The pipeline for the family tree application is very similar to the pipeline we used for the company-founder application, and as previously explained, it consists of mention, candidate, and feature extraction, distant supervision, building the factor graph, learning and inference. However, in the family tree application, we wanted to extract each of the relations (spouse, sibling, parent and child) simultaneously and build the interdependencies of the relations into the model. Therefore, the factor graph model used in the family tree application is more expressive and requires further explanation.
As mentioned before, DeepDive constructs a factor graph model to infer relations between entities. In the family tree application, there are four relations we attempted to extract: spouse, child, sibling, and parent. These relations correspond to respective variables in the factor graph model and the expectation for each of the variables is inferred through marginal inference.
Each candidate, i.e. sentence containing two people mentions, gets assigned four variables (sibling, spouse, parent, child), and these variables are each connected to multiple IsTrue factors, one for each feature. The IsTrue factor weights depend on the corresponding features and are learned in the learning phase by DeepDive.
In addition to the IsTrue factors, each pair of variables are connected to a MutualExclusion factor. This factor is an or construct, Or(!A, !B), which evaluates to false only if both A and B evaluate to true. This encodes the assumption that if two people have a family relation of type A, they can 't have a family relation of type B. For example child and parent can't be siblings. The weights for these factors are set beforehand to be very large to encode the domain knowledge in the model. A visual representation of the factor graph can be seen in Figure 2.
The positive and negative training examples for the family tree application were collected in a similar manner as for the company-founder application. The positive examples were collected from Freebase and each candidate was labeled positive if it matched a corresponding Freebase relation. As in the company-founder application, every name candidate relation is considered negative if it contradicts one of our positive examples. For example, the sentence ”At Homestead, Jobs became friends with Bill Fernandez, a neighbor who shared the same interests in electronics” would contain one negative example for each of the spouse, sibling, parent, and child relations: Steve Jobs - Bill Fernandez. This example is labeled negative since we know the parents, siblings, spouse and children of Jobs and we know that Bill Fernandez is not contained in any of these sets.
Implementation and challenges
The first challenge we encountered during the implementation was the scale of our initial data. The overall Wikipedia dump was 100 Gb and most of it appeared to be quite useless for our work. Therefore, we started by filtering the data, keeping only relevant pages. Figure 3 shows the workflow to filter the data.
First, we leverage Wikidata's API to get all relevant item ids, then we retrieve their full name. Finally using this full name, we access Wikipedia tools system to extract the related pages in Wikipedia. Once we have the ids, we deploy a regular expression-based hash filter to keep only relevant pages. For the family tree application, we subsampled this data for faster development and deployed at full scale only at the end on a large remote machine as we will detail in the implementation section.
Parsing issues are omnipresent in NLP. In order to identify the patterns and solve them, we used MindTagger, the official DeepDive tool for error analysis. MindTagger, as presented in the preliminaries, allows us to directly see sentences where we succeed or fail to detect relations. This allowed to see for instance that parenthesis in the text hinder our ability to extract full names by adding ”Null” NER tags between ”Person” tags. Solving the problem was rather trivial: ignoring parenthesis after a ”Person” tag. Also, at the beginning of a Wikipedia page, the subject name is usually repeated twice, once in English and once in the original language. We also solved this by detecting pattern of duplication in the name detection. Solving these minor errors boosted our performance and allowed us to reach our precision objectives.
By closely looking at our data, we realized that Wikipedia contributors tend to use only the first or last name when discussing personal life of the person. For instance, on Bill Gates page, we can read Gates married Melinda French in January 1st, 1994. Since the end goal of the project is to push back results to Wikidata, it is important to extract relations between full entities, not partial names. However, we noticed that most of these relations can be extracted by simple rules and heuristics. Although coreferencing systems are widely used and some of the state-of-the-art perform well, they are not perfect and are still very costly to implement. Therefore, we deployed a simplistic, yet efficient, coreferencing system that allowed us to link these partial names to their full entity. We used a heuristic that tries to match up partial names to the full name of the page when studying a person. Then, if we identify a relation, say parents, we complete the name of the second person as well if needed. This fulfilled the requirement of retrieving full entities.
Another issue we wanted to deal with was coreferencing sentences like ”He married Melinda French”, and link the pronouns to actual entities as well. For that, we used heuristics that links the pronoun to a noun in the previous sentence (when there is only one name in the previous sentence and no other pronoun for instance) and then link all the pronouns ”he” in the following sentences. Similar heuristics allow us to extract almost an additional 1 million people mentions.
Choosing, extracting and analyzing features for the factor graph was one of the most time consuming parts of the project that also had a large impact in our end results. In both the founder and the family project, we began by using the generic feature library (ddlib) provided by DeepDive that enabled us to extract, for example, dependency path features, n-gram features and part-of-speech tag features. In both of our projects, using all of the features provided by ddlib resulted in a model with relatively high-recall but quite low precision. The model tended to overfit and many seemingly irrelevant features were given a high weight due to their prevalence in the text.
The next step in the feature engineering process was to reduce the feature space from all of the features provided by the generic feature library to a highly relevant subset. This was done through error analysis using an error analysis tool Mindtagger and through manually observing the features that had high learned weights. The set of relevant features was dependent on the application and the relation, but we noticed that unsurprisingly, dependency path features provided the highest precision but required a large training set.
After narrowing the features into a subset of the generic features, we analyzed the precision and recall errors we made and proceeded to implement features more specific to the application. These features included bucketing existing features in order to compensate for a small training set, cross-products of existing features (dependency path and specific n-grams), and numbers of people, words or organizations inside the relation. Table 1 presents high performing features of the child relation in the family tree application as an example.
|No dependency path detected between mentions||-2.04|
|(Inverse dependency path: prep_of → nsubj , 1-gram: father)||0.95|
|Number of people between > 2||-0.89|
|(Inverse dependency path: poss, 1-gram son)||0.80|
Table 1: high performing features of the child relation
In order to improve the quality of our model, we applied some heuristics to the relations extracted. Most of the low quality data comes from the extraction of only a first name from the text. Therefore, in this case, we added the last name of the corresponding spouse, sibling or parent. We add some other similar heuristics (for instance, in a child relation, when only ”Jr” is extracted, we add the name of the parent) to improve the quality of the data pushed to Wikidata.
Finally, we also removed grammatically correct but obviously wrong relations, such as the ”sons of God” and ”brothers and sisters of Jesus Christ”.
For faster development, we used DeepDive and built the applications on our local machine using a Postgres database. For the company founders application it appeared to be sufficient. However, for the family tree application, local machine were no longer sufficient and we deployed it on one of the Raiders machines of the Infolab. This large machine with over 90 CPUs running at 2.4 GHz and 1TB of RAM was set up and the application deployed on GreenPlum, a highly efficient parallel database.
The implementations of these applications are open-source. Composed mainly of Python script to extract and define relations in the text, the implementation contains also some Shell scripts that helped us automating the work and precision estimation: every time a manual tagging is done, it is added to a shared database table and synchronized between the team members. A reasonable estimate of precision and recall can therefore be computed after each run of DeepDive, this estimate being more and more precise as we manually label more relations (this tool is pushed to DeepDive team).
Finally, the implementation contains our hash filters discussed in the previous sections that were developed in C++ for efficiency reasons.
Before discussing the results of our applications, it is important to understand the output of DeepDive. Once the system is trained, and candidates extracted as described in the pipeline, every prediction happens with a certain confidence or probability and it is up to the developer to fix a threshold for the relation to be evaluated as true. In our case, following DeepDive convention, we fixed this threshold to 0.9 and evaluated our precision and recall based on that. Figure 4 shows the resulting precision for the extraction of the spouse relationship versus the confidence on the test set in the left graph and the number of predictions for each bucket on the right graph. First thing to note about the number of predictions in the buckets is the U-shape. This shows that the factor graph learned is quite confident about its predictions, whether it is true or false, which is a first sign that precision is high for this relation.
The results obtained for these applications are summarized in table 2. As we can see from the table, the recall of our extraction is not very high. The reason for that lies on the tradeoff between recall and precision. Since the end goal of our project is to push data back to Wikidata, we tailored our models to have very high precision with a reasonable recall.
|Relation||Precision||Recall||Number of relations extracted|
Table 2: precision, recall and number of relations extracted
Discussion and next steps
To conclude, we managed to produce an efficient and very precise extractor of these relations from the Wikipedia corpus, through building relevant training examples, extractors, features and factor graph models while dealing with scaling and parsing issues.
We are currently pushing these relations to Wikidata and writing a meta page for this project. Working with the DeepDive team, we are implementing tools to get feedback from the Wikidata community about the data and potential next leads.
This project being done, we still have our minds full of ideas. If we had 10 more weeks, we would like to:
• Improve the text chunking: build a whole DeepDive process for a good pronoun-entity linking. In particular find a significant number of negative training examples and relevant features. That would drastically improve our recall.
• Go through a deeper recall analysis to figure out what is really missing to our model and how to extract more relations from Wikipedia
• Work closely with the Wikidata team to determine the best implementation of these relations and how to make the most of a tree of life
• Expand to more relations (in particular to interdependent relations for which our current results would help us build a relevant factor graph).
We would like to thank Christopher Ré for his generous mentorship throughout the project and Leila Zia for the project idea and advice and for making our project impactful within the Wikimedia Foundation. We would also like to thank Jaeho Shin, Raphael Hoffmann, Ce Zhang and Zifei Shan of the DeepDive team for their help and advice regarding DeepDive.
 Frank R Kschischang, Brendan J Frey, and Hans-Andrea Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 47(2):498–519, 2001.
 Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics, 2009.
 Denny Vrandečić. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference companion on World Wide Web, pages 1063–1064. ACM, 2012.
 Sen Wu, Ce Zhang, Feiran Wang, and Christopher R ́e. Incremental knowledge base construction using deepdive. arXiv preprint arXiv:1502.00731, 2015.
 Ce Zhang, Christopher Ré, Amir Abbas Sadeghian, Zifei Shan, Jaeho Shin, Feiran Wang, and Sen Wu. Feature engineering for knowledge base construction. arXiv preprint arXiv:1407.6439, 2014.