Research:Characterizing Wikipedia Citation Usage/First Round of Analysis

This page summarizes the findings of the analysis of the first round of data collection. We analyzed the frequency of clicks on references linking to an external source, by crossing the data collected with our instrumentation with the page views recorded in the webrequests table in the Data Lake.

Data[edit]

Reference click data collection[edit]

We collected 10 days of data, from Jun 28th to Jul 9th 2018. The schema for the data we collected is here: Schema:CitationUsage. The data comes from non-logged-in users only. More info on the data collection in the main project page

The schema detected around 3M events per day, for a total of 32 Milion events over the course of 10 days. We detected 4 types of events:

`extClick` — click on external URLs;
`upClick` — click that takes the user from the reference at the bottom back to the anchor (e.g., “[1]”) in the main text (e.g., on “^”);
`fnClick` — clicks on page-internal links (e.g., “[1]”) that take the user to the reference section at the bottom;
`fnHover` — event when user hovers over (at least 1000ms) reference (e.g., “[1]”) in main page articles."

Date	Number of Events	upClick	extClick	fnClick	fnHover
29 June 2018	2912911	15827	1380607	605643	910834
30 June	2509565	13259	1175389	601582	719335
01 July	2912911	14336	1292807	670076	781537
02 July	3218551	17072	1506356	679826	1015297
03 July	3160465	15639	1478172	658227	1008406
04 July	3015490	19191	1396499	660644	939093
05 July	3170142	21594	1473209	663434	1011812
06 July	2980773	15261	1380396	643607	941431
07 July	2603657	13514	1216515	641242	732296
08 July	3324488	15003	1341844	692435	810974

Reference text collection[edit]

We also parse the XML dumps to collect information about the text and templates used to reference external sources in English Wikipedia. Below the plot of the most popular templates for citations in English Wikipedia. Each bar represents how many times a given template appears in the references of all English Wikipedia articles.

Page requests data collection[edit]

We counted the page views relevant to our analysis by using the table wmf.webrequest. We limited the selection only to the English version of Wikipedia, on namespace 0, where the requests generated from desktop/web mobile (no app) and where the user is not logged in. Additionally, we detect the requests potentially generated by bots through a regex matching on the user-agent string, and since automatic requests are not relevant for our analysis, we discard them. We grouped the pageviews by four variables to allow different stratified analysis: page_id, continent, country_code, access_method. This dataset can answer a question like: "How many times was the article A loaded by not logged in users from mobile devices in the UK?"

Dimensions of Analysis[edit]

We analyze the frequency of external clicks according to 4 dimensions.

Topic : we extracted the topic of the ~2 milion pages where we recorded events, by using the draft topic prediction model from the Scoring Platform team.
Country: we infer the country where the event was generated from the geocoded_data field available on both the webrequest and our citationusage tables.
Domain: we segmented the clicks on external references according to the domain of the external link (e.g., "www.theguardian.com" or "www.imdb.com"). Below is a plot of the most popular domains in English Wikipedia references. These are the domains which appear more often across all articles in English Wikipedia. The top cited domains are books and newspapers.
Number of References in Page: we parse all pages to get the number of references with an external link. Here is a plot of the distribution of pages over number of references: the majority of pages have 1 to 5 external links. Around 1M pages have 0 external links.

Next, we compute the ratio between page views and external clicks on these 4 dimensions.

Results[edit]

Most Visited References[edit]

We looked at the most popular references among readers during our data collection period. We found that the most clicked external references are very much influenced by the events happening during the week of data collection. Among the most clicked links we found, for example, news about movie releases happened during that week; links to websites related to the football world cup and other popular sport events during that week. To even out the influence of these localized events on these statistics, we might need to collect the second round of data during a longer period.

Breakdown by topic[edit]

We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by page topic, and divide this quantity by the total number of webrequests in each topic. We find that the topics where external references tend to be more clicked are Mathematics and Engineering. Note that, since we aggregate data at a session level (and not a per-user level), some of these patterns might be biased by the presence of superusers (e.g., a reader interested in mathematics who is clicking on external references at every session).

Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic

Breakdown by country[edit]

We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by country of origin of the event, and divide this quantity by the total number of web requests in each country. We find that around 6% of the sessions coming from US or UK convert into a click on an external reference. We also find that Iran and some Pacific islands are among the countries with lower click-through rate for external citations.

Breakdown by domain[edit]

Finally, we look at the breakdown of number of clicks per domain. Below, a plot of the domains in English Wikipedia that receive readers click more often. Despite Google Books being the most popular domain in English Wikipedia references, we find that the top-clicked domain is the Internet Archive's Wayback Machine, while Google Books is the second most visited domain, followed by a number of newspapers.