Jump to content

Research:Identification of Unsourced Statements/Labeling Pilot

From Meta, a Wikimedia project coordination wiki


Check out our labeling interface.
Check out our labeling interface.

As part of our current work on verifiability, the Wikimedia Foundation’s Research team is studying ways to use machine learning to flag unsourced statements needing a citation. If successful, this project will allow us to identify areas where identifying high quality citations is particularly urgent or important.

To help with this project, we need to collect high-quality labeled data regarding individual sentences: whether they need citations, and why. Following the success of the first pilot, we would like to continue collecting high-quality labeled data regarding why sentences need citations. We have used your input from the last experiment to generate a predefined taxonomy of reasons why editors add citations. With this taxonomy embedded in the interface, the annotation experience will be much faster and fun.

Reason Taxonomy


Reasons for adding a citation

  • Quotation: The statement appears to be a direct quotation or close paraphrase of a source
  • Statistics: The statement contains statistics or data
  • Controversial: The statement contains surprising or potentially controversial claims - e.g. a conspiracy theory (see Wikipedia:List_of_controversial_issues for examples)
  • Opinion: The statement contains claims about a person's subjective opinion or idea about something
  • Private Life: The statement contains claims about a person's private life - e.g. date of birth, relationship status.
  • Scientific: The statement contains technical or scientific claims
  • Historical: The statement contains claims about general or historical facts that are not common knowledge
  • Other: The statement requires a citation for reasons not listed above (please describe your reason in a sentence or two)

Reasons for not adding a citation

  • Common Knowledge: The statement only contains common knowledge - e.g. established historical or observable facts
  • Main Section: The statement is in the lead section and its content is referenced elsewhere in the article
  • Plot: The statement is about a plot or character of a book/movie that is the main subject of the article
  • Already Cited: The statement only contains claims that have been referenced elsewhere in the paragraph or article
  • Other: The statement does not require a citation for reasons not listed above (please describe your reason in a sentence or two)

I can't decide whether this statement needs a citation


General 'other' category to help discourage random guessing.

How to participate


If you are interested in participating, please proceed as follows:

  • Sign-up by adding your name to the sign-up page (this step is optional).
  • Go to your language campaign (English Wikipedia, French Wikipedia, Italian Wikipedia), login, and from 'Labeling Unsourced Statements II’, request one (or more) workset. Each workset takes maximum 5 minutes to complete and contains 5 tasks. There is no minimum number of worksets, but of course the more labels you provide, the better.
  • For each task in a workset, the tool will show you an unsourced sentence in an article and ask you to annotate it. You can then label the sentence as needing an inline citation or not, and specify a reason for your choice from a pre-defined set of reasons in a drop-down menu.
  • If you can't respond, please select 'skip'. If you can respond but you are not 100% sure about your choice, please select 'Unsure'.

If you have any question/comment, please let us know by sending an email to miriam@wikimedia.org or leaving a message on the talk page of the project. We can relatively easily adapt the tool if something needs to be changed.



See Collecting Statements Data