Research:Identification of Unsourced Statements

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
23:06, 18 October 2017 (UTC)
Duration:  2017-October — 2017-
citation needed, machine learning

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

To guarantee reliability, Wikipedia's Verifiability policy requires inline citations for any material challenged or likely to be challenged, and for all quotations, anywhere in the article space. While already around 300K statements [4] have been identified as unsourced, we might be missing many more!

This projects aims to the community discover potentially unsourced statements, i.e. statements that need an inline citation to a reliable source, using a machine assisted framework.


We will flag statements that might need the [citation needed] tag. This recommendation will be based on a classifier that can identify whether a statement needs a reference or not. The classifier will encode general rules on referencing and verifiability, which will come from existing guidelines [1][2] or new observational studies we will put in place.

More specifically, we propose to design a supervised learning classifier that, given examples of statements with citations (or needing citations), and examples of statements where citations are not required, learns how to flag statements with the [citation needed] Template.

Refining the Scope[edit]

The space of possibilties for this project might be too wide. We want to refine the scope of this project so that we address issues which are important across wikis.


The project has to tackle a category of articles:

  1. Which are Sensitive: a lot of attention is given to the quality of these articles.
  2. Whose editing rules are shared by multplie language communities.

Proposed Solution[edit]

One of the main category of articles fullfilling the 2 requirements is the Biographies of Living People Category. Not only this category is present in 100+ languages (with 850K pages in English, 150k in Spanish, 175k in Italian, 80K in Portoguese, 80K in Chinese) but also, it is considered a sensitive category by all these projects. There is a resolution of the foundation giving directions on how to write biographies of living people. This resolution is written on many languages, and many language-specific guidelines point to it. It says:

The Wikimedia Foundation Board of Trustees urges the global Wikimedia community to uphold and strengthen our commitment to high-quality, accurate information, by:
* Ensuring that projects in all languages that describe living people have policies in place calling for special attention to the principles of neutrality and verifiability in those articles;
* Taking human dignity and respect for personal privacy into account when adding or removing information, especially in articles of ephemeral or marginal interest;
* Investigating new technical mechanisms to assess edits, particularly when they affect living people, and to better enable readers to report problems;
* Treating any person who has a complaint about how they are described in our projects with patience, kindness, and respect, and encouraging others to do the same.

Supporting Material for the Project[edit]

Here some pointers to interesting places where we can find supporting material/data for our project. A general page to watch is the WikiProject on BLP. This page contains pointers to guidelines, data, contests, and users we might want to contact to get further information.

Collecting Statements Data[edit]

The main subject of the sentences articles, as discussed in our feasibility study will be Biographies of Living People. Dedicated wiki projects for this category are available in different languages. To collect statements that are very likely to be already well cited, we resort to the category of Wikipedia's Featured Articles. To detect living people, we use Wikidata. We then retrieved featured articles linked to Wikidata entries.

Data repository link

Data Format[edit]

We split each article into paragraphs and sentences, then format a file for each language as follows:

<Wikidata ID> <Article Title> <Sec Title> <Start> <Offset> <Sentence> <Paragraph> <Reference|N\A (if not cited)>

Sample line:

 7330495	Richie Farmer	Post-playing career	34048	203	He was one of 88 inaugural members of the University of Kentucky Athletics Hall of Fame in 2005 and one of 16 inaugural members of the Kentucky High School Basketball Hall of Fame in 2012.{{207}}{{208}} 	Farmer was inducted into the KHSAA Hall of Fame in 1998 and the Kentucky Athletic Hall of Fame in 2002.\{\{205}}{{206}} He was one of 88 inaugural members of  the University of Kentucky Athletics Hall of Fame in 2005 and one of 16 inaugural members of the Kentucky High School Basketball Hall of Fame in 2012.{{207}}{{208}} \"The Unforgettables\" were honored on a limited edition collector's bottle of [[Maker's Mark]] [[Bourbon whiskey|bourbon]] in 2007.{{209}}	{newspaper=lexington herald-leader, date=march 31, 2007, page=b3</ref>, title='unforgettables' made their mark, type=news, cite news}	{newspaper=lexington herald-leader, date=july 11, 2012, page=b5</ref>, title=first class for high school hall of fame – 16 ky. stars to be inducted in elizabethtown on saturday, type=news, <ref name=kyhsbbhof>cite news} 

Language sources[edit]

Since most of the Wikipedia language editions agree on the resolution for sourcing and editing Biographies of Living People, we could potentially annotate sentences and build models on as many languages as we like. Due to large data availability, for this pilot we might want to focus on the major languages:

Data Analysis[edit]

Compared to the average featured articles, the percentage of statements having citations in biographies is much higher!

Per language brakedown of statements with/without citations (featured biographies)
Per language brakedown of statements with/without citations (featured articles)

Collecting annotations: WikiLabels[edit]

What would we like as training data:

  • Positives: statements requiring an inline citation.
  • Negatives: statements where an inline citation is not needed. Although everything should be cited, we should avoid citation overkill

Data discovery: manual or automatic?[edit]

Positive examples (statements needing citation) are potentially easy to discover - they are already referenced or flagged as unsourced with the [citation needed] tag. Negative examples (statements not requiring citation) are much harder to find. One can do it automatically by finding statements where the [citation needed] tag has been removed. However, these are very few. We could also consider as negatives statements that do not have an inline citation. But can we rely on this data? We need to resort to manual annotation.

Manual Annotation[edit]

Among the manual annotation solutions we thought of, we adopt the WikiLabels platform. Given a set of candidate statements, we ask WikiLabels editors to look at each statement and tag it as needing or not needing citation. For this purpose, we sketched a mock interface for the ideal task, were: a scrollable frame visualizes an article; the article is anchored on a block of text with a sentence highlited; the highlighted sentence is the statement to be annotated; editors have to make a decision on whether the statement needs citation or not, following pre-defined guidelines. Users also have a 'reason' dropdown menu and a possibly 'comment' free textbox for additional input. One way to populate the 'reason' dropdown menu would be to run a small-scale experiment. We could provide the Wikipedians with a free form text field and then use that to find out if there are any missing reasons or preferred language they'd like to use (see pilot proposal). Once we have a good taxonomy of reasons, we could use them to do a larger run.

2 different Wiki Labels pilots will be therefore conducted:

Interface Example[edit]

  • A scrollable frame visualizes an article without citations [reasoning: it simulates the worst-case, most difficult scenario for editors]
  • The article is anchored on <Sec Title>;
  • The block of text between <Start> and <Offset> is highlighted. The highlighted sentence is the statement to be annotated.
  • Editors are invited to make a decision on whether the statement needs citation or not [TODO: exact text].
  • [Pilot 1] Through a free-text form, editors are also invited to provide a reason for their choice.
  • [Pilot 2] Through a dropdown menu, editors are also invited to provide a reason for their choice from a pre-defined set.
  • Both choice and reason are recorded by WikiLabels
Pilot 1: A free text box allows annotators to specify the reason why the highlighted sentence needs a citation or doesn't need a citation.
Pilot 2: A drop-down menu allows annotators to specify the reason why the highlighted sentence needs a citation or doesn't need a citation.

Guidelines and templates to watch for data collection (and modeling)[edit]

  1. Some BLP articles were marked as 'A CLASS' by the members of the Wiki Project BLP. We might want to learn (and possibly extract negatives) from these featured articles. Guidelines of this initiative can help with this as well.
  1. When completely missing citations, biographies of living people are marked as Unreferenced. When partially missing citations, they can be found in the BLP_articles_lacking_sources category. This might be a good set to focus on: we can mark the individual sentences in these articles that actually need a source. Some of these ULBP were actually 'rescued' by the volunteers. We can learn something from this rescuing process, and extract positive/negative candidates from rescued BLPs.

Citation Need Modeling[edit]

Feature Extraction[edit]

Following the guidelines for inline citation need, we implement a set of features that can help model relevant aspects of the sentences. Features are based on both Natural Language Processing (multilingual) and structural information. The feature list and motivations can be found here.

Main Section Feature[edit]

This is a boolean feature equal to 1 if the sentence lies in the article's main section.

Multilingual Word Vectors[edit]

To get an idea of both the overall content and the style, we compute 300-dimensional language-specific fasttext vectors, taking the dictionaries from this publicly available repository. We then align each non-English language vector to English using alignment matrices. This allows us to have a feature vector from the same space for any language. See full code here.

Words to Watch[edit]

Among the features implemented, we designed a specific feature to detect the distance from the sentence to the set of Wikipedia's Words to Watch, namely words indicating ambiguous sentences or assumptions/rumors, are available in many languages. To do so, we proceed as follows:

  • We identify from the Words to Watch a set of 52 words to watch (see Research:Identification_of_Unsourced_Statements/Word_to_Watch for more details).
  • We translate them to other languages by taking the nearest neighbor based on multilingual fastext vectors (62% of matches on Italian translations - see this experiment)
  • We compute, for each sentence, the average distance to each word to watch, using fasttext vectors, and store the resulting distance into a 52-d feature vector, code here

Dictionary-based features[edit]

We design features based on lexical dictionaries, consisting mostly of verbs, constructed for specific tasks:

  • report verbs consist of verbs that are used in sentences when a citation is attributed, furthermore these provide also the stance of the person writing the statement w.r.t the cited information. (Recanesens et al., ACL 2013)
  • assertive verbs consist of verbs which can weaken or strengthen the believability of a statement. (Hooper. 1975, Syntax and Semantics 1975)
  • factive verbs consists of verbs which, when used, provide assurances regarding the truth of a statement. (Hooper 1975, Syntax and Semantics, 1975)
  • hedges are words that are used in cases when one tries to weaken the tone in a statement. The word itself can belong to different parts of speech, e.g. (adverb, adjective). (Hyland, Continuum, 2005)
  • implicative verbs when used in sentences imply the truth of a given action. (Karttunen, Language 1971)

To construct features based on the above dictionaries, we use the same approach used for the Word to Watch features.

Supervised Machine Learning[edit]

We use the above features as input for a Random Forest classifier. We define the parameters maximum depth and the number of trees using grid-search on cross-validation.


We conducted several experiments to predict the citation need for a sentence.

Feasibility analysis: Automatically labeled data[edit]

We first test the feasibility of the framework, i.e. the separability of sentences with and without citations in the feature space, by using the raw automatically labeled data. We use as training data the sentences from featured biographies. We considered sentences with an inline citation as positives, and sentences without a citation as negatives.

Featured Biography Article Data[edit]

We created a training with all 7692 negatives and an equal number of positives. Below the results on cross-validation for existing sentences in the data.

All Featured Article Data[edit]

To assess the generalisability of the previous methodology, we also test with data from all featured articles (73,280 negatives and an equal number of positives).

Accuracy of citation need detection on automatically labeled data

Manually labeled data[edit]


Potential Applications[edit]

  • Smart Citation Hunt: an enhanced version of the Citation Hunt framework, where sentences to be sourced are automatically extracted using our classifier. An additional button helps to correct machine errors, suggesting that the sentence visualized does not need a citation.
  • Smart Editing: A real-time citation needed recommender that classifies sentences as needing citation or not. The classifier detects the end of a sentence while the editor is typing and classifies the new statement on-the-fly.
  • Citation Needed Hunt: an API (stand-alone tool) taking as input a sentence and giving as output a citation needed recommendation, together with a confidence score.

Online Evaluation[edit]

We aim to work in close contact with the Citation Hunt developers and the Wikipedia Library communities. We will pilot a set of a recommendations, powered by the new citation context dataset, to evaluate if our classifiers can help support community efforts to address the problem of unsourced statements.


This project started as a pilot in October 2017 and will continue until we have an assessment of the performance and suitability of the proposed modeling strategy to support volunteer contributors.



Github repository with data and code:

Pilot experiments with WikiLabels:[edit]

Guidelines and templates to watch for data collection (and modeling)[edit]


See also[edit]

Subpages of this page[edit]

Pages with the prefix 'Identification of Unsourced Statements' in the 'Research' and 'Research talk' namespaces: