Research:Discovering content inconsistencies between Wikidata and Wikipedia

From Meta, a Wikimedia project coordination wiki
Tracked in Phabricator:
Task T243256
Created
20:19, 4 February 2020 (UTC)
Collaborators
Meeyoung Cha
Cheng-Te Li
Yi-Ju
Disinformation
This page documents a completed research project.


Summary[edit]

Wikidata is currently the most edited project in the Wikimedia sphere [1]. While there are some efforts to use Wikidata information to populate Wikipedia pages [2] there is not much research about the consistency of the information already existing on Wikipedia and the content on Wikidata. Ongoing research is trying to compare the Wikipedia infoboxes (structured information on articles) with the content on Wikidata [3], however most of the information in Wikipedia is unstructured (ie. the text on the articles), and currently there is no solution to compare such content with the information on Wikidata. For example, consider the article about Chile in the English Wikipedia saying: “... It borders Peru to the north, Bolivia to the northeast, Argentina to the east, and the Drake Passage in the far south”, there we want to extract such information and compare with the Wikidata item about Chile (Q298) in the property “shares borders with” (P47) to know if they are consistent with the text on the English Wikipedia. Being able to make such comparison will have important and positive effects on the quality and availability of information both in Wikidata and Wikipedia, helping to detect inconsistencies, detect missing content (in both projects) and also in sharing references. Moreover, applying this technique to several languages will help to use Wikidata as bridge to improve the flow of information across different languages, allowing to address the knowledge gaps across projects [4]. This alignment of content will be also important to help patrollers in under resourced communities [5] to early detect the presence of suspicious content (eg. information introduced in wikipedia in language X, that is not consistent with the information contained in Wikidata or other Wikis), helping to fight against disinformation campaigns [6].

Goals[edit]

  • Create a mapping between Wikidata and Wikipedia content.
  • Measure the consistency of (factual) information in Wikidata compared with Wikipedia(s).
  • Detect suspicious factual content added to Wikipedia.

Approach[edit]

We will use NLP techniques for creating a mapping between Wikidata and Wikipedia content, in more than one language. Details about the ML used and evaluation methodology will be discussed during phase 0.

Results[edit]

Find the results here.

References[edit]

  1. stats.wikimedia.org/v2
  2. KAFFEE, Lucie-Aimée, et al. Mind the (language) gap: generation of multilingual Wikipedia summaries from Wikidata for ArticlePlaceholders. En European Semantic Web Conference. Springer, Cham, 2018. p. 319-334.
  3. https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE
  4. https://research.wikimedia.org/knowledge-gaps.html
  5. https://meta.wikimedia.org/wiki/Research:Patrolling_on_Wikipedia
  6. Saez-Trumper, D. (2019). Online Disinformation and the Role of Wikipedia. arXiv preprint arXiv:1910.12596