Research:Identification of Unsourced Statements

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
23:06, 18 October 2017 (UTC)
Duration:  2017-October — 2017-
citation needed, machine learning

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.

To guarantee reliability, Wikipedia's Verifiability policy requires inline citations for any material challenged or likely to be challenged, and for all quotations, anywhere in the article space. While already around 300K statements [4] have been identified as unsourced, we might be missing many more!

This projects aims to the community discover potentially unsourced statements, i.e. statements that need an inline citation to a reliable source, using a machine assisted framework.


We will flag statements that might need the [citation needed] tag. This recommendation will be based on a classifier that can identify whether a statement needs a reference or not. The classifier will encode general rules on referencing and verifiability, which will come from existing guidelines [1][2] or new observational studies we will put in place.

More specifically, we propose to design a supervised learning classifier that, given examples of statements with citations (or needing citations), and examples of statements where citations are not required, learns how to flag statements with the [citation needed] Template.

Data Collection[edit]

Platforms and Methods to collect training data[edit]

What would we like as training data:

  • Positives: statements requiring an inline citation.
  • Negatives: statements where an inline citation is not needed. Although everything should be cited, we should avoid citation overkill
A mock interface for the unsourced statement labeling task.

Collecting Positive Examples[edit]

Positive examples (statements needing citation) are easy to automatically discover - they are already referenced or flagged as unsourced. Positives: Statements already cited or flagged with the [citation needed] tag

Collecting Negative Examples[edit]

Negative examples (statements not requiring citation) are much harder to find. One can do it automatically by finding statements where the [citation needed] tag has been removed. Among the manual annotation solutions we thought of, one possibility would be to adopt:

We could prepare a set of candidate statements, and then ask WikiLabel/hypothesis editors to look at each statement and tag it as needing or not needing citation. For this purpose, we sketched a mock interface for the ideal task, were: a scrollable frame visualizes an article; the article is anchored on a block of text with a sentence highlited; the highlighted sentence is the statement to be annotated; editors have to make a decision on whether the statement needs citation or not, following pre-defined guidelines. Users also have a 'reason' dropdown menu and a possibly 'comment' free textbox for additional input. One way to populate the 'reason' dropdown menu would be to run a small-scale experiment. We could provide the Wikipedians with a free form text field and then use that to find out if there are any missing reasons or preferred language they'd like to use (see pilot proposal). Once we have a good taxonomy of reasons, we could use them to do a larger run.


Guidelines and templates to watch for data collection (and modeling)[edit]

Refining the Scope[edit]

The space of possibilties for this project might be too wide. We want to refine the scope of this project so that we address issues which are important across wikis.


The project has to tackle a category of articles:

  1. Which are Sensitive: a lot of attention is given to the quality of these articles.
  2. Whose editing rules are shared by multplie language communities.

Proposed Solution[edit]

One of the main category of articles fullfilling the 2 requirements is the Biographies of Living People Category. Not only this category is present in 100+ languages (with 850K pages in English, 150k in Spanish, 175k in Italian, 80K in Portoguese, 80K in Chinese) but also, it is considered a sensitive category by all these projects. There is a resolution of the foundation giving directions on how to write biographies of living people. This resolution is written on many languages, and many language-specific guidelines point to it. It says:

The Wikimedia Foundation Board of Trustees urges the global Wikimedia community to uphold and strengthen our commitment to high-quality, accurate information, by:
* Ensuring that projects in all languages that describe living people have policies in place calling for special attention to the principles of neutrality and verifiability in those articles;
* Taking human dignity and respect for personal privacy into account when adding or removing information, especially in articles of ephemeral or marginal interest;
* Investigating new technical mechanisms to assess edits, particularly when they affect living people, and to better enable readers to report problems;
* Treating any person who has a complaint about how they are described in our projects with patience, kindness, and respect, and encouraging others to do the same.

Supporting Material for the Project[edit]

Here some pointers to interesting places where we can find supporting material/data for our project. A general page to watch is the WikiProject on BLP. This page contains pointers to guidelines, data, contests, and users we might want to contact to get further information.

Feature Extraction and Guidelines for Annotation[edit]

  1. A set of Words to Watch, namely words indicating ambiguous sentences or assumptions/rumors, are available in many languages.
  2. There exist style guidelines in many languages. These can be used as criteria to evaluate the presence of unsourced statements in an article and implemented through nlp features.

Data Collection[edit]

  1. When completely missing citations, biographies of living people are marked as Unreferenced. When partially missing citations, they can be found in the BLP_articles_lacking_sources category. This might be a good set to focus on: we can mark the individual sentences in these articles that actually need a source. Some of these ULBP were actually 'rescued' by the volunteers. We can learn something from this rescuing process, and extract positive/negative candidates from rescued BLPs.
  2. Some BLP articles were marked as 'A CLASS' by the members of the Wiki Project BLP. We might want to learn (and possibly extract negatives) from these featured articles. Guidelines of this initiative can help with this as well.

Potential Applications[edit]

  • Smart Citation Hunt: an enhanced version of the Citation Hunt framework, where sentences to be sourced are automatically extracted using our classifier. An additional button helps to correct machine errors, suggesting that the sentence visualized does not need a citation.
  • Smart Editing: A real-time citation needed recommender that classifies sentences as needing citation or not. The classifier detects the end of a sentence while the editor is typing and classifies the new statement on-the-fly.
  • Citation Needed Hunt: an API (stand-alone tool) taking as input a sentence and giving as output a citation needed recommendation, together with a confidence score.


We aim to work in close contact with the Citation Hunt developers and the Wikipedia Library communities. We will pilot a set of a recommendations, powered by the new citation context dataset, to evaluate if our classifiers can help support community efforts to address the problem of unsourced statements.


Q1, Q2