Research:Citoid support for Wikimedia references

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search


Due to their open nature, Wikipedia and its sister sites heavily rely on reliable external references to source facts and statements. Wikipedians have created an intricate system of complex templates and utilities to format references consistently. This complexity, however, makes it difficult for inexperienced users, or users without deep technical expertise, to add references. Without references, content isn't verifiable and is more likely to be removed by other contributors.

VisualEditor provides a visual interface to edit articles and hides away most of the complexity of the underlying wikitext's. VisualEditor offers a template editor that makes it easier to add, remove and edit template parameters. It also offers a citation editor, basically a simpler version of VisualEditor just for citations. Those tools facilitate the insertion and edition of references, but still require manual work that involves a steep learning curve.

Citoid is a tool created by Marielle Volz that extracts bibliographic information from web pages, and structures it into various formats. It's used by MediaWiki to prefill citations in VisualEditor, using the associated Citoid extension. With Citoid, users can add fully-formatted templated references in VisualEditor, by simply providing a URL, DOI, PMID or other supported identifier.

Citoid relies on Zotero, a third-party program, to identify and extract ("translate") most of the metadata needed to fill out citations. Zotero was initially created as a Firefox add-on to manage academic bibliographic data, like articles from scientific journals. It was later expanded to run as a stand-alone program, and to handle a wide array of non-academic references, like newspaper articles, using dedicated sub-programs ("translators"). This makes it particularly useful in the context of Wikipedia and its sister sites, that rely on a wide variety of sources.

Still, Zotero maintains a bias in favor of English-language, academic sources. Ad-hoc exploratory testing has shown that it's weaker when it comes to non-English sources, especially outside the academic sphere. In many of those cases, it fails to recognize the required metadata, in whole or in part.[1]

Citoid and Zotero developers may add support for additional sources, to improve the coverage of sources commonly used in references on Wikimedia sites. Because developer resources are finite, identifying the websites most commonly used in references, and determining if they're properly supported by Citoid, will help prioritize development efforts.

Research questions[edit]

  1. What are the most used domains in references on a given Wikimedia site?
  2. Can Citoid currently extract meaningful and complete information from URLs from those domains to pre-fill citations?

Related work[edit]

Wikimedians and researchers have independently investigated references and external links over the past few years. Identifying and ranking external links is relatively easy, since they're recorded in a dedicated table of the database[2], and provided as part of regular database backups[3].

References, however, are only stored in the plain wikitext of each page, so extracting them requires further action. "Reference" is defined as "any content placed inside <ref> tags".

Subsets of references[edit]

In 2007, Finn Årup Nielsen analyzed the content of scientific citations on the English-language Wikipedia, using the "cite journal" template as starting point[4]. In 2010, he undertook a similar initiative but focused on news citations, using the "cite news" template[5]. In 2013, Heather Ford et al. hand-coded and analyzed a random sample of 500 references from articles of the English-language Wikipedia to determine which were the most common[6].

Supersets of references[edit]

In 2010, Ed Summers analyzed all external links from encyclopedic articles ("main namespace") of the English-language Wikipedia, and ranked the top hosts[7]. More recently, Leonard Vertighel wrote a tool listing the most frequent domains found on the Italian-language Wikipedia[8], and Incola wrote a similar query limited to the main namespace[9].

Methods and results[edit]

Analyzing subsets or supersets of references provides valuable insight and trends, but our goal was to get a more complete picture.

We used a copy ("database dump") of the content of all pages in the main namespace of the English Wikipedia from 2015-03-04. From this dump, Aaron Halfaker extracted the content of all references using a custom Python utility, mwrefs. The resulting data set, ref_diffs.20150304.tsv.bz2, was made available from the Wikimedia's public data sets repository.

Guillaume Paumier wrote another Python utility (refsdomains) to process the data set, by extracting the domain of the URLs contained in references, and by tallying those domains. A preliminary list of domains was posted. Later, data sets of references were extracted and processed for Wikipedia in English, Spanish, Italian, French, Polish, Romanian, Swedish, Chinese, and Russian. This answered the first research question for that set of wikis; for any other given wiki, the same utility can be used. Using the lists of most cited domains, editors now have a good way to prioritize testing of URLs for Citoid.

Guillaume tested the top domains with Citoid for two wikis, and shared the results with the respective WikiProjects. The top 15 domains on the English Wikipedia were at least partially supported; On the French Wikipedia, two of the top 15 domains were not supported at all; the others were at least partially supported.

See also[edit]


  1. Following the activation of Citoid on the French-, Italian- and English-language Wikipedias, several users started testing it with various sources. On the French-language Wikipedia, this was done on Projet:Sources/Sources les plus utilisées/Liste manuelle. On the English Wikipedia, testing was coordinated around sources available through the "Wikipedia library" program, at Wikipedia:TWL/Citoid, and by members of WikiProject Medicine. Other tests were done by Erica Litrenta ([1] [2] [3] [4]) and Rachel diCerbo ([5]). Other users compared Citoid with other tools used to automatically retrieve and format source metadata. LuisVilla compared Citoid and reFill, and Atlasowa and Ark25 compared Citoid, Zotero and Ark25's "RefScript" tool.
  2. Manual:Externallinks table on
  3. Wikimedia Downloads.
  4. Scientific citations in Wikipedia. Finn Årup Nielsen. First Monday, volume 12, number 8 (August 2007).
  5. Top news cites referenced from Wikipedia. Finn Årup Nielsen. 2010-08-25.
  6. Getting to the source: where does Wikipedia get its information from? Heather Ford et al. Proceedings of the 9th International Symposium on Open Collaboration (WikiSym '13). 2013. doi:10.1145/2491055.2491064
  7. top hosts referenced in wikipedia (part 2). Ed Summers. 2010-08-25.
  8. Collegamenti esterni presenti con maggiore frequenza. Leonard Vertighel. Retrieved on 2015-05-11.
  9. Quarry: Most frequent domains. Incola. Retrieved 2015-05-11.