Research:Characterizing Wikipedia Citation Usage/First Round of Analysis

From Meta, a Wikimedia project coordination wiki

This page summarizes the findings of the analysis of the first round of data collection. We analyzed the frequency of clicks on references linking to an external source, by crossing the data collected with our instrumentation with the page views recorded in the webrequests table in the Data Lake.

Data[edit]

Reference click data collection[edit]

We collected 10 days of data, from Jun 28th to Jul 9th 2018. The schema for the data we collected is here: Schema:CitationUsage. The data comes from non-logged-in users only. More info on the data collection in the main project page

The schema detected around 3M events per day, for a total of 32 Milion events over the course of 10 days. We detected 4 types of events:

  • `extClick` — click on external URLs;
  • `upClick` — click that takes the user from the reference at the bottom back to the anchor (e.g., “[1]”) in the main text (e.g., on “^”);
  • `fnClick` — clicks on page-internal links (e.g., “[1]”) that take the user to the reference section at the bottom;
  • `fnHover` — event when user hovers over (at least 1000ms) reference (e.g., “[1]”) in main page articles."
Date Number of Events upClick extClick fnClick fnHover
29 June 2018 2912911 15827 1380607 605643 910834
30 June 2509565 13259 1175389 601582 719335
01 July 2912911 14336 1292807 670076 781537
02 July 3218551 17072 1506356 679826 1015297
03 July 3160465 15639 1478172 658227 1008406
04 July 3015490 19191 1396499 660644 939093
05 July 3170142 21594 1473209 663434 1011812
06 July 2980773 15261 1380396 643607 941431
07 July 2603657 13514 1216515 641242 732296
08 July 3324488 15003 1341844 692435 810974

Reference text collection[edit]

We also parse the XML dumps to collect information about the text and templates used to reference external sources in English Wikipedia. Below the plot of the most popular templates for citations in English Wikipedia. Each bar represents how many times a given template appears in the references of all English Wikipedia articles.

Each bar represents how many times a given template appears in the references of all english Wikipedia articles
Each bar represents how many times a given template appears in the references of all english Wikipedia articles


Page requests data collection[edit]

We counted the page views relevant to our analysis by using the table wmf.webrequest. We limited the selection only to the English version of Wikipedia, on namespace 0, where the requests generated from desktop/web mobile (no app) and where the user is not logged in. Additionally, we detect the requests potentially generated by bots through a regex matching on the user-agent string, and since automatic requests are not relevant for our analysis, we discard them. We grouped the pageviews by four variables to allow different stratified analysis: page_id, continent, country_code, access_method. This dataset can answer a question like: "How many times was the article A loaded by not logged in users from mobile devices in the UK?"

Dimensions of Analysis[edit]

We analyze the frequency of external clicks according to 4 dimensions.

  • Topic : we extracted the topic of the ~2 milion pages where we recorded events, by using the draft topic prediction model from the Scoring Platform team.
  • Country: we infer the country where the event was generated from the geocoded_data field available on both the webrequest and our citationusage tables.
  • Domain: we segmented the clicks on external references according to the domain of the external link (e.g., "www.theguardian.com" or "www.imdb.com"). Below is a plot of the most popular domains in English Wikipedia references. These are the domains which appear more often across all articles in English Wikipedia. The top cited domains are books and newspapers.
  • Number of References in Page: we parse all pages to get the number of references with an external link. Here is a plot of the distribution of pages over number of references: the majority of pages have 1 to 5 external links. Around 1M pages have 0 external links.
How many pages have 0 externa links in their references? How many have 1-5? This plot shows the distribution of number of pages vs number of references
How many pages have 0 externa links in their references? How many have 1-5? This plot shows the distribution of number of pages vs number of references
Number of references linking to external domains, breakdown by domain.
Number of references linking to external domains, breakdown by domain.


Next, we compute the ratio between page views and external clicks on these 4 dimensions.

Results[edit]

Most Visited References[edit]

We looked at the most popular references among readers during our data collection period. We found that the most clicked external references are very much influenced by the events happening during the week of data collection. Among the most clicked links we found, for example, news about movie releases happened during that week; links to websites related to the football world cup and other popular sport events during that week. To even out the influence of these localized events on these statistics, we might need to collect the second round of data during a longer period.

Breakdown by topic[edit]

We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by page topic, and divide this quantity by the total number of webrequests in each topic. We find that the topics where external references tend to be more clicked are Mathematics and Engineering. Note that, since we aggregate data at a session level (and not a per-user level), some of these patterns might be biased by the presence of superusers (e.g., a reader interested in mathematics who is clicking on external references at every session).

Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic

Breakdown by country[edit]

We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by country of origin of the event, and divide this quantity by the total number of web requests in each country. We find that around 6% of the sessions coming from US or UK convert into a click on an external reference. We also find that Iran and some Pacific islands are among the countries with lower click-through rate for external citations.


Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country (bottom 20)
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by country (bottom 20)

Breakdown by domain[edit]

Finally, we look at the breakdown of number of clicks per domain. Below, a plot of the domains in English Wikipedia that receive readers click more often. Despite Google Books being the most popular domain in English Wikipedia references, we find that the top-clicked domain is the Internet Archive's Wayback Machine, while Google Books is the second most visited domain, followed by a number of newspapers.

Total clicks on an external links, breakdown by link domain
Total clicks on an external links, breakdown by link domain