Web2Cit/Research/report

From Meta, a Wikimedia project coordination wiki
Web2Cit: collaborative automatic citations for web sources

Abstract[edit]

References are one of the main pillars on which Wikipedia is collaboratively constructed. To aid Wikipedia editors with inserting and formatting references, Wikipedia’s visual editor provides an automatic citation generator that produces a formatted citation given a URL, DOI, or other identifier of the cited source. However, this automatic tool does not always succeed in extracting citation metadata from web sources, mostly because these sources fail to appropriately embed these metadata. Until now, the only ways to fix this problem demanded either time or programming skills, ranging from manually fixing the errors to changing the underlying software code. Web2Cit is a tool that promises to lower the barriers to participation, by providing a relatively simple way to collaboratively define extraction procedures. But what is the actual performance of the current automatic tool and how good is Web2Cit in doing what it promises? In this research project we extracted citations from featured articles in the Spanish (SP), English (EN), French (FR), and Portuguese (PT) Wikipedias and compared them against automatically generated citations to estimate this performance. We found that the automatic generator returned the expected results on average for 60% of citation fields. In addition, we made available a script that will let us repeat this analysis in the future, once the Web2Cit tool has been more widely adopted by the Wikipedia community.

Web2Cit project in a nutshell[edit]

References play an essential role for Wikipedia’s reliability. These can be entered either by hand or using a template. Wikipedia uses Citoid’s visual editor extension to create citations by resolving URLs, DOIs, QIDs, etc., into a citation template. To do so, the Citoid service relies (in part) on Zotero web translators to get citation metadata from a website.

Websites which embed metadata appropriately are understood by Zotero’s generic translators and Citoid processes them accurately. This is often the case for academic journals. However, the results are often inaccurate or incomplete for non academic sites such as newspapers. These inaccurate and incomplete results are what in this project we refer to as the Citoid’s coverage gap.

Site-specific translators are sometimes needed, but most of them rely on web scraping techniques. Yet, and although even popular web sources in English Wikipedia sometimes do not display metadata properly, most of these site-specific translators were developed for English (see problems related to languages and representation in Phabricator’s tasks T94170 and T160273, or Citoid support for Wikimedia references research). Moreover, contributions to Zotero's translators repository are open, but they require programming skills. In addition to this, as it was explained in the grant’s narrative, some translators may pose cultural and language challenges.

Lack of Zotero web translator coverage forces editors to fall back on manually transcribing citation metadata, a process that may deter them from adding references to their contributions, bias references toward those whose sites expose metadata appropriately, or leave broken citations.

Web2Cit is a set of tools that enables non-technical users to collaboratively define and edit procedures to extract citation metadata from web sources. In addition, it provides a web service that the Citoid extension (among others) can use as a source (in addition to official Zotero web translators, Crossref, Worldcat) to resolve URLs provided by Wikipedia editors that use these community-created translators. More about the Web2Cit project here.

Web2Cit Research Group. Goals[edit]

Main Goals[edit]

The main goals of the Web2Cit Research Group (RG) were (1) to determine and analyze the width and nature of the above mentioned coverage gap between automatic citations created with Citoid and manually curated citations, and (2) to do so in an automatic way, such that it could be repeated in the future to calculate the impact and accuracy of Web2Cit.

Team members[edit]
  • Nidia Hernández: Script developer and writing
  • Romina De León: Technical staff and editor
  • Gimena del Rio Riande: Coordination and writing

Background and first steps[edit]

The Research Group (RG) started its work on September 7th, 2021. Monthly meetings with Web2Cit Project Manager, Diego de la Hera, were scheduled.

In order to become familiar with the project’s resources and tools, the RG first reviewed Citoid’s features and its API and analyzed Wikipedia’s API and Wikipedia’s citation templates in Spanish (SP) and English (EN). We set up a table that contained Wikipedia pages (SP and EN) with a description of their basic features and the citation templates they were using.

After that, the RG examined the methods and goals of Andrew Lih’s and Robert Fernández’s project. Lih and Fernández had previously studied Citoid performance for news article citations in English. Though their work was performed manually and had a qualitative approach, it helped us with the context and background of our labor.

We also opened a workspace for the RG in Phabricator and a repository in GitHub.

Methodological decisions and resources[edit]

Nidia Hernández started working on the script in a Jupyter notebook in a Python environment that was later moved to Wikimedia’s PAWS (paws.wmcloud.org) (see Script section). Some important decisions were made in order to improve our methodology. First, Gimena del Rio and Romina De León prepared a spreadsheet with relevant citation templates and parameters that the script would look for from the English, Spanish, Portuguese and French Wikipedia. We only focused on citation templates that have a URL field. This spreadsheet was first shared with Web2Cit’s advisory board, a group of engaged volunteers who helped build sustainability and community involvement since the beginning of the Web2Cit project. The spreadsheet was then set to open and to be shared with the community.

English Wikipedia has a vast list including over 100 source-specific templates while the generic templates for Spanish Wikipedia barely surpass 50. The names of the templates also vary from one language to another, for instance: cite news (EN), cita noticia (ES), article (FR), and citar jornal (PT); or publisher (EN), periódico (ES), périodique (FR), and jornal (PT). For this reason, our identification of citation templates was based on the manually curated list.

The spreadsheet includes

  • 26 citation templates in EN
  • 18 citation templates in ES
  • 24 citation templates in PT
  • 12 citation templates in FR

In addition, citation templates contain a set of parameters allowing to describe the source: author(s), title, URL, publisher, website, journal, etc. The name of these parameters may vary according to the type of source (web, book, news, journal, thesis, etc.) and the language. We focused on parameters mapping to one of the basic Web2Cit citation fields (URL, title, author_last, author_first, pub_date, pub_source), including one or more parameters (comma separated) for each of these fields. We admitted regular expressions for all the fields (e.g., author\d* meaning author1, author2, etc.).

In relation to the methodology used for preparing this spreadsheet, we first scrutinized Wikipedia:Citation_templates in EN and looked for analogues in other languages. The Citoid template map was used to assist us with this action, but we also completed this approach with a more manual and qualitative one, investigating the code for citations in different articles from different Wikipedias.

From this work, it’s worth noticing that there are substantial differences in certain citation templates and their parameters. This is because different language Wikipedias are independent. For example:

  • comic strip reference (only used in the EN Wikipedia, though in the FR Wikipedia we find it as bande desinée)
  • cite court (differs between EN, SP and FR)
  • cite newsgroup (not in FR)
  • cite patent (not in SP or FR)
  • cite press release (not in SP or FR)

It is worth mentioning we are aware that citation templates might not have and do not need to have direct analogues (like the above mentioned comic strip-bande desinée) and that each community defines its scope. Also, we are not including all the citation templates that have URL field in our spreadsheet. Some of these not used-citation templates are listed here.

Second, our approach is based on the assumption that citation metadata extracted from Wikipedia articles has been curated by the community and is correct. We acknowledge this may not always be the case. Hence, to further strengthen this point, we decided to work with featured articles. In order to access this category, the EN Wikipedia makes clear that featured articles are submitted to an editorial review process which evaluates, among other criteria, the appropriateness and the formatting of the citation[1]. Accuracy and curation of corpus issues were previously discussed with the advisory board in meetings and mailing list and discussed in our RG meetings.

In summary, our corpus consists of 10.5k featured articles retrieved with Wikipedia’s action API. The selection covers Wikipedias in four different languages: English (~6k featured articles), French (~2k featured articles), Portuguese (~1.3k featured articles) and Spanish (~1.2k featured articles). The wikicode of these featured articles was parsed using the spreadsheet mentioned above to identify the citation templates and parameters to be extracted.

We will delve into the findings inside this corpus in the Results section.

Script[edit]

As mentioned, Nidia Hernández developed a script that can be run in a PAWS notebook. The notebook Understand Citoid coverage for Web2Cit describes how to retrieve the source code and the citation templates of the references of a sample of Wikipedia articles to compare them with the results of the Citoid API for the same references.

Fetching article content[edit]

The script uses each Wikipedia's action API to first retrieve the list of all featured articles for each language. Then it queries the action API again to get the content of each featured article (wikicode) and the id of the last edition (revid).

Citation metadata extraction[edit]

Each article content is parsed in order to retrieve the references that were introduced using a citation template. To simplify the extraction of metadata, we didn’t include manually entered references (i.e., the ones that do not use citation templates), because the metadata is not consistently structured in these cases.

We are interested in extracting the following metadata for each reference:

  1. The source type (journal, book, website, etc.)
  2. The URL of the source
  3. The author(s)
  4. The title
  5. The publishing date
  6. The publishing source (publisher, location, etc.)

The names of each data may vary between templates. For instance, the publishing source is under "periodical" for the news template and under "publisher" for the books template. Or the publishing date might be called "date" or "year" in the maps template.

The script uses the spreadsheet explained above to map the name of each parameter in the citation templates to the metadata fields of interest (see Table 1).

An extract of our citation template spreadsheet.
Table 1. An extract of our citation template spreadsheet.

Using this information, we extract the citation metadata from the ~10.5k featured articles that were obtained in the previous step. To do this, we inspect the content of each article and parse the wikicode using mwparserfromhell. For each template:

  1. we verify if it appears in our spreadsheet
  2. we verify if it contains a URL[2]
  3. we extract the values for the following Web2Cit basic citation fields: URL, title, author_last, author_first, pub_date, pub_source. In case multiple parameters were mapped to any of the basic citation fields, we returned an array of values.
  4. we store the information in a dataframe containing a list of manual citations, each indicating the article from which it was extracted and the extracted metadata.

This processing obtained over 460k manual citations.

Citation metadata retrieval from Citoid[edit]

The script then calls the Citoid API to find the citation metadata for the list of URLs obtained on the previous step. In order to avoid an unnecessary load of the Citoid service, the URLs are filtered and only well-formed, public, http- or https-scheme, and ones that do not point to pdfs -as Citoid does not support them- are processed.

Additionally, duplicated URLs are eliminated to avoid the request of the same information twice. A cache of Citoid’s responses is also kept for that purpose. These filters and the possibility of receiving “Not found” answers from Citoid reduced the number of automatic citations to 288k. For the purposes of this work, we selected the 91k citations where the requested URL matched the URL returned in Citoid’s response[3].

It is worth mentioning we are only interested in the information corresponding to type, title, author, publication date, container title (e.g., name of the newspaper or journal) and publisher. In order to identify and extract these metadata in Citoid’s reponses, a name mapping is performed via another spreadsheet[4] that maps Zotero (= Citoid) field names to Web2Cit field names. This mapping also allows us to compare the metadata from Citoid’s response vs the manual metadata collected from featured articles.

Data normalization[edit]

Before comparing Citoid's response to the manual citations data, some normalization is necessary.

Regarding the source type information (website, book, etc.), the vocabulary used by Citoid differs from the vocabulary used in citation templates. To address this problem, a new column is added to our dataframe of manual citations having the equivalent in Citoid's itemType for each citation template name (see Table 2).

The mapping between these vocabularies is defined in a Wikipedia-specific Citoid-to-template configuration file (e.g., see here for EN Wikipedia). However, the mapping is not a one-on-one correspondence. For instance, source types from manual citations are usually more general than the ones from Citoid: Citoid distinguishes between "blogPost" and "webpage", while both are mapped to the "Cite web" template. In other cases, citation templates are source-tailored (for instance, Census 2006 AUS or Circulaire UAI, among many others) and cannot be mapped to Citoid item types using the aforementioned configuration file. The possible consequences of this information loss will be discussed in the Results section.

Table 2. Citation template names (left) mapped to Citoid’s fieldnames (right).

Additionally, other minor normalizations are performed: removal of empty strings (from manual data), lowercase setting (from manual and automatic data), and transliteration of non ASCII characters to their ASCII equivalent (from manual and automatic data).

In the case of manual metadata, dates in natural language are transformed to YYYY-MM-DD format using dateparser. Date values without year information are discarded. In addition, a special function was conceived for extracting the rendered text of external links and wikilinks, also in the manual data.

On the side of automatic data, the information about the author’s name and last name is splitted in two separate fields as in the manual data (author first and author last). However, this procedure could make us underestimate Citoid’s performance, so it would be desirable to merge these fields in future reassessments.

Metadata comparison[edit]

The evaluation of automatic citations was performed by comparing these against manual citations. Scores go from 1 (Citoid's response is a complete match of the data from the manual citation) to 0 (no match).

We use one of two methods based on the nature of the data to be compared:

  • in one of them (exact match comparison), we check if there is a coincidence between the data from Citoid and the manual data. This is the case for fields having more categorical information (source type and publishing date)
  • in the other approach (edit distance comparison), we measure the similarity between the response from Citoid and the manual data by calculating the edit distance (see below). This is the case for fields having less structured data (title, publishing source, author first and author last).
Exact match comparison[edit]

Regarding the coincidence approach for the source type comparison, as it was explained in the previous section (Data normalization) there is a single value on Citoid's side and a list of a limited set of values on the manual citations side. Consequently, the comparison of the source type is a simple boolean verification: if the response from Citoid is contained in the manual data for source type, the score is 1; if it is absent, the score is 0.

In the case of the date information, we compare if there is a match of the full date. This means that 2014 vs 2014-11-28 is considered a wrong answer. In future reassessments this type of case could be considered as a partial match.

Edit distance comparison[edit]

To evaluate Citoid's responses for non structured fields (title, publishing source, author first and author last), we calculate the edit distance (a measure of the number of edits that must be done to arrive from one string to another) between Citoid's data and the manual data. The resulting score (after dividing by the length of the longest string) is a continuous range of values going from 1 (full match) to 0 (complete difference).

General considerations[edit]

For all metadata fields we consider the type of data to be compared: (1) single value in Citoid’s side and an array of values in the manual side, or (2) array of values on both sides.

For source type, title, publishing source, and publishing date, the data in Citoid's side is unique and the data in the manual side is a list of one or more elements. In the case of having multiple values in the manual data, we keep the score of the best match.

Author_first and author_last are the only fields where we have lists on both sides of our comparison because a citation can have several authors. Citoid's data and the manual data can be in different order or one of the lists can have missing information. The comparison method for the author fields faces these difficulties using a permutation strategy: we create permutations of the manual data, previously adding empty elements to the shortest list in the case of having lists of different length. Then, we compare each element in Citoid's data (measuring the edit distance) to the corresponding element in each of the permuted lists. For each permutation, we have several scores resulting from the comparison of each element, so we first calculate the average score for the permutation and finally we keep the best permutation score.

The number of permutations grows fast depending on the length of the list. For example, the permutations for a list of 3 elements return 6 possibilities, while 40320 for a list of 8. Managing the comparison for this number of permutations becomes too demanding in terms of processing memory, this is why when citations have more than 7 items for author first or author last, we fallback to the comparison of elements in their original order without permutation.

Results[edit]

In this section we offer some results related to the width and nature of the coverage gap between automatic citations created with Citoid and manually curated citations.

In the following paragraphs, we will first present an analysis of Citoid’s performance by top websites and then the scores will be examined by source type. In the future results may be analyzed by language as well.

For a more granular approach to the data please consult our interactive visualization of results in results-viz.ipynb.

Overall performance[edit]

In our corpus of manual citations gathered from featured articles, some URLs appear several times, cited in different Wikipedias or in different articles of the same Wikipedia (see Figure 1). This is the reason why we first group the scores by URL and calculate the average, and proceed similarly for the web domains.

Figure 1. Number of citations of the evaluated URLs, defining a Zipf's curve. A reduced number of URLs are used many times (8% of the URLs) and a long tail of URLs are used only one time (80% of the URLs):

Citoid's success is calculated as the sum of presumably correct answers (i.e., matching against the corresponding citation from Wikipedia, as described in Metadata comparison) from the 6 evaluated metadata fields (author first, author last, title, source type, publishing source, publishing date). For instance, if only the source type and the publishing date are correct, Citoid's success is 2. A response from Citoid is considered correct if the comparison with the manual data returns a score of 0.75 or more. Figure 2 shows that, on average, domains get between 3 and 4 correct citation metadata fields (out of the 6 considered in this study) from Citoid:

Figure 2. Citoid’s performance by domain.

Top-cited domains[edit]

Figure 3. Citoid’s performance for the most cited web domains.

Regarding the overall success for the most cited domains represented in Figure 3, the best performance is for cricket.archive.com, www.nytimes.com and news.bbc.co.uk. The performance for books.google.com, the most cited domain, shows ~4 correct fields (60% success)[5].

Figure 4. Citoid’s performance by fieldname for the most cited web domains in all languages.

According to Figure 4, the best performing citation field across the 10 most cited domains in all Wikipedias seems to be title (orange dots) whereas the worst performing was publishing source (purple dots). For a visualization of Citoid’s performance by domain for each field in each Wikipedia (EN, ES, FR, PT), please consult our interactive notebook.

Table 3. Sample of Citoid data vs. manual data for books.google.com.
article_url url citoid_success source_type_citoid citation_template title_citoid title_manual author_first_citoid author_first_manual author_last_citoid author_last_manual pub_date_citoid pub_date_manual pub_source_citoid pub_source_manual
14004 https://en.wikipedia.org/wiki/Michael%20Jackson https://books.google.com/books?id=L70DAAAAMBAJ&pg=PA58 2 book cite magazine jet [michael jackson turns 30!] [johnson publishing] [] [company] [] 1988-08-29 [1988-08-29] johnson publishing company [jet, johnson publishing company]
46601 https://pt.wikipedia.org/wiki/Marshall_Applewhite https://books.google.com/books?id=M2QEAAAAMBAJ&pg=PA35 1 book citar periódico the advocate [heaven's scapegoat] [here] [mubarak] [publishing] [dahir] 1997-05-13 [1997-05-13] here publishing [the advocate]
10416 https://en.wikipedia.org/wiki/Operation%20Crossroads https://books.google.com/books?id=V04EAAAAMBAJ&pg=PA74 0 book citation life [what science learned at bikini] [time] [] [inc] [] 1947-08-11 [] time inc [life]
68239 https://en.wikipedia.org/wiki/Drowning%20Girl https://books.google.com/books?id=sPGdBxzaWj0C&pg=RA2-PA158 4 book cite book the grove encyclopedia of american art [the grove encyclopedia of american art] [joan m.] [] [marter] [] 2011 [2011] oxford university press [oxford university press]
90315 https://en.wikipedia.org/wiki/George%20S.%20Patton https://books.google.com/books?id=6_KcBQAAQBAJ&pg=PP3 3 book cite book great leaders: george patton [great leaders: george patton] [willard sterne, nancy] [] [randall, nahra] [willard sterne randall, nancy nahra] 2014-11-28 [2014] new word city [new word city]
27793 https://en.wikipedia.org/wiki/Paul%20McCartney https://books.google.com/books?id=ZhDx0XKwzPkC&pg=PA187 4 book cite book the complete how to kazoo [the complete how to kazoo] [barbara] [barbara] [stewart] [stewart] 2006-01-01 [2006] workman publishing []
23432 https://en.wikipedia.org/wiki/Hurricane%20Juan%20%281985%29 https://books.google.com/books?id=4Ait9MeAg8UC&pg=PA5&lpg=PP1&ie=ISO-8859-1&output=html 0 book lien web 'pamela' in the marketplace: literary controve... [''pamela in the marketplace''] [thomas, elmore fellow and tutor in english la... [] [keymer, keymer, sabor, sabor] [keymer, sabor, , ] 2005-12-15 [] cambridge university press []
51954 https://en.wikipedia.org/wiki/Black%20stork https://books.google.com/books?id=Sf7UBAAAQBAJ&pg=PA90 5 book cite book birds in scotland [birds in scotland] [valerie m.] [valerie m.] [thom] [thom] 2010-11-30 [2010] bloomsbury publishing [london, bloomsbury publishing]
58144 https://en.wikipedia.org/wiki/Japan https://books.google.com/books?id=njnRAgAAQBAJ&pg=PT26 5 book cite book traditional japanese architecture: an explorat... [traditional japanese architecture: an explora... [mira] [mira] [locher] [locher] 2012-04-17 [2012] tuttle publishing [tuttle publishing]
Table 4. Sample of Citoid data vs. manual data for www.nytimes.com.
article_url url citoid_success source_type_citoid citation_template title_citoid title_manual author_first_citoid author_first_manual author_last_citoid author_last_manual pub_date_citoid pub_date_manual pub_source_citoid pub_source_manual
482 https://en.wikipedia.org/wiki/Kim%20Clijsters https://www.nytimes.com/2001/03/18/sports/tennis-serena-williams-wins-as-the-boos-pour-down.html 5 newspaperArticle cite web tennis; serena williams wins as the boos pour ... [tennis; serena williams wins as the boos pour... [selena] [selena] [roberts] [roberts] 2001-03-18 [2001-03-18] the new york times [the new york times]
26287 https://en.wikipedia.org/wiki/Jake%20Gyllenhaal https://www.nytimes.com/2016/07/15/theater/sunday-in-the-park-with-george-with-jake-gyllenhaal-adds-2-performances.html 6 newspaperArticle cite news 'sunday in the park with george,' with jake gy... [''sunday in the park with george'', with jake... [michael] [michael] [paulson] [paulson] 2016-07-14 [2016-07-14] the new york times [the new york times]
56926 https://en.wikipedia.org/wiki/Antonin%20Scalia https://www.nytimes.com/2015/12/03/opinion/justice-scalias-majoritarian-theocracy.html 2 newspaperArticle citation opinion | justice scalia's majoritarian theocracy [justice scalia's majoritarian theocracy] [richard a., eric j.] [] [posner, segall] [] 2015-12-02 [2015-12-02] the new york times []
71410 https://en.wikipedia.org/wiki/Richard%20Cordray https://www.nytimes.com/2012/01/05/us/politics/richard-cordray-named-consumer-chief-in-recess-appointment.html 3 newspaperArticle cite web bucking senate, obama appoints consumer chief [bucking senate, obama appoints consumer chief] [helene, jennifer] [helene, ] [cooper, steinhauer] [cooper, steinhauer, jennifer] 2012-01-04 [2012-01-04] the new york times []
12828 https://en.wikipedia.org/wiki/Orel%20Hershiser%27s%20scoreless%20innings%20streak https://www.nytimes.com/1988/10/07/sports/the-playoffs-troubled-cone-stops-the-press.html 2 newspaperArticle cite web the playoffs; troubled cone stops the press [the playoffs; troubled cone stops the press] [joseph] [] [durso] [durso, joseph] 1988-10-07 [1988-10-07] the new york times []
3436 https://en.wikipedia.org/wiki/Huey%20Long https://www.nytimes.com/1982/07/11/books/american-demagogues.html 6 newspaperArticle cite news american demagogues [american demagogues] [robert] [robert] [sherrill] [sherrill] 1982-07-11 [1982-07-11] the new york times [the new york times]
71421 https://en.wikipedia.org/wiki/Oklahoma%20City%20bombing https://www.nytimes.com/1995/04/20/us/terror-oklahoma-city-investigation-least-31-are-dead-scores-are-missing-after.html 3 newspaperArticle cite news terror in oklahoma city: the investigation; at... [at least 31 are dead, scores are missing afte... [david] [] [johnston] [johnson, david] 1995-04-20 [1995-04-20] the new york times [the new york times]
15337 https://es.wikipedia.org/wiki/The%20Powerpuff%20Girls https://www.nytimes.com/2002/07/03/movies/film-review-they-have-a-tantrum-then-save-the-world.html 5 newspaperArticle Cita web film review; they have a tantrum, then save th... [film review; they have a tantrum, then save t... [stephen] [stephen] [holden] [holden] 2002-07-03 [2002-07-03] the new york times [the new york times]
3769 https://en.wikipedia.org/wiki/Moon https://www.nytimes.com/2014/09/09/science/revisiting-the-moon.html 5 newspaperArticle cite news the moon comes around again [the moon comes around again] [natalie] [natalie] [angier] [angier] 2014-09-08 [2014-09-07] the new york times [the new york times]
47030 https://en.wikipedia.org/wiki/Mars https://www.nytimes.com/2021/05/14/science/china-mars.html 6 newspaperArticle cite news china's mars rover mission lands on the red pl... [china's mars rover mission lands on the red p... [steven lee, kenneth] [steven lee, kenneth] [myers, chang] [myers, chang] 2021-05-14 [2021-05-14] the new york times [the new york times]
Table 5. Sample of Citoid data vs. manual data for www.theguardian.com
article_url url citoid_success source_type_citoid citation_template title_citoid title_manual author_first_citoid author_first_manual author_last_citoid author_last_manual pub_date_citoid pub_date_manual pub_source_citoid pub_source_manual
40406 https://pt.wikipedia.org/wiki/Nintendo https://www.theguardian.com/technology/gallery/2017/may/16/the-10-most-influential-video-games-of-all-time 3 newspaperArticle citar web the 10 most influential video games of all tim... [the 10 most influential video games of all ti... [keith] [] [stuart] [stuart, keith] 2017-05-16 [2017-05-16] the guardian [the guardian]
68408 https://en.wikipedia.org/wiki/Radiohead http://www.theguardian.com/music/2021/may/23/live-at-worthy-farm-review-glastonburys-dodgy-pyramid-scheme-has-stunning-music 4 webpage Cite web live at worthy farm review - beautiful music m... [live at worthy farm review - beautiful music ... None [alexis] None [petridis] 2021-05-23 [2021-05-23] the guardian [the guardian]
65494 https://en.wikipedia.org/wiki/Marilyn%20Monroe https://www.theguardian.com/film/features/featurepages/0 3 webpage cite web marilyn monroe: feminist icon? | features | gu... [happy birthday, marilyn] None [] None [] None [2001-05-29] www.theguardian.com []
2114 https://pt.wikipedia.org/wiki/Bradley%20Cooper http://www.theguardian.com/film/2013/jan/14/golden-globes-2013-winners-list 4 webpage Citar web golden globes 2013: full list of winners [golden globes 2013: full list of winners] [guardian] [] [staff] [] 2013-01-14 [2013-01-14] the guardian [the guardian]
80344 https://pt.wikipedia.org/wiki/Russell%20T%20Davies http://www.theguardian.com/media/2006/jul/17/mediaguardiantop100200614 3 webpage citar jornal 28. russell t davies [28. russell t davies] [guardian] [] [staff] [] 2006-07-17 [2006-07-17] the guardian [the guardian]
75882 https://en.wikipedia.org/wiki/2002%20World%20Snooker%20Championship http://www.theguardian.com/sport/2006/jan/19/snooker.gdnsport31 3 webpage Cite news snooker: world championship finds new sponsors [snooker: world championship finds new sponsors] [mike] [] [anstead] [] 2006-01-19 [2006-01-19] the guardian [the guardian]
52526 https://es.wikipedia.org/wiki/Radiohead http://www.theguardian.com/music/2009/nov/29/radiohead-interview 3 webpage cita web radiohead: band of the decade [radiohead: band of the decade] None [] None [gareth grundy] 2009-11-29 [] the guardian [[http://www.theguardian.com/uk]
82433 https://pt.wikipedia.org/wiki/Daniel%20Radcliffe https://www.theguardian.com/stage/gallery/2017/mar/06/daniel-radcliffe-in-rosencrantz-and-guildenstern-are-dead-old-vic-in-pictures 4 newspaperArticle citar web daniel radcliffe in rosencrantz and guildenste... [daniel radcliffe in rosencrantz and guildenst... None [] None [] 2017-03-06 [2017-03-06] the guardian []
76007 https://pt.wikipedia.org/wiki/Russell%20T%20Davies http://www.theguardian.com/media/2008/jul/14/mediatop100200827 2 webpage citar jornal 31. russell t davies [31. russell t davies] [guardian] [] [staff] [] 2008-07-13 [2008-07-14] the guardian [the guardian]
66215 https://en.wikipedia.org/wiki/Taylor%20Swift http://www.theguardian.com/music/2014/dec/03/taylor-swift-poptimism 4 webpage Cite news taylor swift leads poptimism's rebirth [taylor swift leads poptimism's rebirth] [ian] [ian] [gormely] [gormely] 2014-12-03 [2014-12-03] the guardian []

As we can observe in Table 3, a quick inspection of the pages referred by the URLs shows that in many cases (for instance, items 14004, 46601, 10416), the errors are a consequence of part vs whole identification of the source type: while manual citations focus on an a specific article of a periodical publication, Citoid returns the metadata for the whole magazine/newspaper and, for this reason, the values for the rest of the fields can also be different (title of the article vs. title of the magazine, publishing source as the magazine name vs. publishing source as the publisher, etc.).

In the case of items 27793 and 23432, the automatic metadata for the publishing source is correct, but the manual metadata shows no information, hence Citoid's answer is penalized. In some cases, it might be possible that the missing manual metadata is due to extraction problems. This could be solved by adding more parameter names to the Citation templates spreadsheet (see Citation template metadata extraction). In other cases, Wikipedia’s editors did not provide the information, highlighting the main weakness of our automatic approach, which relies on Wikipedia-curated citations as the final source of truth for citation metadata.

Table 4 confirms the results observed in Figure 3 for www.nytimes.com: Citoid presents more than 70% success for this domain, making it the top-cited domain best interpreted by Citoid, in line with previous research (Lih, Fernández, 2017). The automatic citations show consistent metadata for all fields, while the manual citations sometimes present wrong or missing information. On the contrary, Table 5 shows a less systematic performance for www.theguardian.com on the automatic side, challenging previous findings for this domain (Lih, Fernández, 2017): the source type is often identified as webpage instead of newspaper article, and “guardian” is sometimes identified as the author, underlining that Citoid performance may vary within domains. Finally, the examination of these examples suggest that title, publishing date and publishing source are the fields having most correct answers, while source type and author seem to be the most problematic.

Source type identification and variability[edit]

The evaluation of the source type needs some further explanation. Citoid's answers for source types present high scores (see Figure 5): more than 90% correct identification of web sources (i.e., templates mapping to webpage, videoRecording, podcast, blogPost, etc.) and book sources (templates mapping to book or manuscript). The performance decreases to ~40% for journalistic periodical publications (magazineArticle or newspaperArticle).

Figure 5. Average of correct answers for source type identification by Citoid. Source types are understood as the mapping of the citation template name in the manual data with Citoid’s vocabulary for itemTypes.

Two important observations must be noted here regarding the source type evaluation. On the one hand, we might be overestimating Citoid's success when identifying source types because whereas fixing some wrong field returned by Citoid is relatively trivial, fixing the citation template itself is quite cumbersome (it involves changing the whole template). As a result, editors might not change this neither while they work, nor later on. On the other hand, we might be underestimating Citoid's performance for source types because of our source type mapping (see Data normalization): specific citation templates are not included in our mapping (“unmapped” in Figure 5), leaving empty values in our manual data (i.e., they are not mapped to any item type as returned by Citoid). In those cases, Citoid's response for the source type is interpreted as a mismatch with the manual data and penalized as incorrect. However, it must be noted that the unmapped source types are only 3% of the evaluated citations.

Figure 6. Citoid’s performance by source type for the top 10 most used source types. Source types are understood as the mapping of the citation template name in the manual data with Citoid’s vocabulary for itemTypes.

Finally, it is worth noting that the number of correct metadata fields returned by Citoid may depend on the source type, as suggested by Figure 6.

Notes[edit]

  1. More about featured article criteria at: Wikipedia:Featured article criteria 
  2. It's worth noting that for some citation templates the URL may be in parameters other than the URL parameters. In these cases we should have considered these parameters as well. This task is pending, as mentioned in: "⚓ T301516 Consider alternative "url" parameters". phabricator.wikimedia.org. 
  3. The URL returned by Citoid is different from the requested URL in case of redirections or presence of canonical attribute. See Phabricator task T320877 for further information.
  4. See: Zotero to Web2Cit field map spreadsheet
  5. For more information about most cited domains, see our presentation at Wikiworkshop 2022, “Insights on the references of Wikipedia’s featured articles in English, French, Portuguese and Spanish “https://wikiworkshop.org/2022/papers/WikiWorkshop2022_paper_18.pdf.