Research:Characterizing Wikipedia Citation Usage/First Round of Analysis
This page summarizes the findings of the analysis of the first round of data collection. We analyzed the frequency of clicks on references linking to an external source, by crossing the data collected with our instrumentation with the pageviews recorded in the webrequests table in the Data Lake.
Reference click data collection
We collected 10 days of data, from Jun 28th to Jul 9th 2018. The schema for the data we collected is here: Schema:CitationUsage. The data comes from non logged-in users only. More info on the data collection in the main project page
The schema detected around 3M events per day, for a total of 32 Milion events over the course of 10 days. We detected 4 types of events:
- `extClick` — click on external URLs;
- `upClick` — click that takes the user from the reference at the bottom back to the anchor (e.g., “”) in the main text (e.g., on “^”);
- `fnClick` — clicks on page-internal links (e.g., “”) that take the user to the reference section at the bottom;
- `fnHover` — event when user hovers over (at least 1000ms) reference (e.g., “”) in main page articles."
|Date||Number of Events||upClick||extClick||fnClick||fnHover|
|29 June 2018||2912911||15827||1380607||605643||910834|
Reference text collection
We also parse the XML dumps to collect information about the text and templates used to reference external sources in English Wikipedia. Below the plot of the most popular templates for citations in English Wikipedia. Each bar represents how many times a given template appears in the references of all English Wikipedia articles.
Page requests data collection
We counted the page views relevant to our analysis by using the table wmf.webrequest. We limited the selection only to the English version of Wikipedia, on namespace 0, where the requests generated from desktop/web mobile (no app) and where the user is not logged in. Additionally, we detect the requests potentially generated by bots through a regex matching on the user-agent string, and since automatic requests are not relevant for our analysis, we discard them. We grouped the pageviews by four variables to allow different stratified analysis: page_id, continent, country_code, access_method. This dataset can answer a question like: "How many times was the article A loaded by not logged in users from mobile devices in the UK?"
Dimensions of Analysis
We analyze the frequency of external clicks according to 4 dimensions.
- Topic : we extracted the topic of the ~2 milion pages where we recorded events, by using the draft topic prediction model from the Scoring Platform team.
- Country: we infer the country where the event was generated from the geocoded_data field available on both the webrequest and our citationusage tables.
- Domain: we segmented the clicks on external references according to the domain of the external link (e.g. "www.theguardian.com" or "www.imdb.com"). Below is a plot of the most popular domains in English Wikipedia references. These are the domains which appear more often across all articles in English Wikipedia. The top cited domains are books and newspapers.
- Number of References in Page: we parse all pages to get the number of references with an external link. Here is a plot of the distribution of pages over number of references: the majority of pages have 1 to 5 external links. Around 1M pages have 0 external links.
Next, we compute the ratio between pageviews and external clicks on these 4 dimensions.
Most Visited References
We looked at the most popular references among readers during our data collection period. We found that the most clicked external references are very much influenced by the events happening during the week of data collection. Among the most clicked links we found, for example, news about movie releases happened during that week; links to websites related to the football world cup and other popular sport events during that week. To even out the influence of these localized events on these statistics, we might need to collect the second round of data during a longer period.
Breakdown by topic
We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by page topic, and divide this quantity by the total number of webrequests in each topic. We find that the topics where external references tend to be more clicked are Mathematics and Engineering. Note that, since we aggregate data at a session level (and not a per-user level), some of these patterns might be biased by the presence of super users (e.g. a reader interested in mathematics who is clicking on external references at every session).
Breakdown by country
We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by country of origin of the event, and divide this quantity by the total number of webrequests in each country. We find that around 6% of the sessions coming from US or UK convert into a click on an external reference. We also find that Iran and some Pacific islands are among the countries with lower clickthrough rate for external citations.
Breakdown by domain
Finally, we look at the breakdown of number of clicks per domain. Below a plot of the domains in English Wikipedia that receive readers click more often. Despite Google Books being the most popular domain in English Wikipedia references, we find that the top-clicked domain is the Internet Archive's Wayback Machine, while Google Books is the second most visited domain, followed by a number of newspapers.