Research:Characterizing Wikipedia Citation Usage/Second Round of Analysis

From Meta, a Wikimedia project coordination wiki

This page summarises the findings of the analysis of the second round of data collection, conducted in 2018. We first analysed characteristics of the pages users interacted with during data collection, then computed statistics about frequency of clicks on references linking to an external source.

Data Collection[edit]

We collected data on users' pageviews and interaction with citations using the Event Logging platform. We also collected more info on pages and references using dumps and tools like ORES.

Percentage of events of each type recorded by the CitationUsage schema during the second round of data collection

We define as Reference any link cited in the body of the article. For example, external links without a context in the article text (i.e. this article) are not included. Here some example of pages that we would consider as "without references":

Reference Click Data Collection[edit]

We collected 30 days of data, from September 24th, 2018 to October 25th. The schemas we used to collect the data are: Schema:CitationUsage and Schema:CitationUsagePageLoad. With the CitationUsage schema, we collected all users' interactions with references and footnotes, making up around 50 events per second. With the CitationUsagePageLoad, we collected readers' pageviews, sampled at 33.3%, resulting in around 700 events per second. All data comes from non logged-in users only. More info on the data collection in the main project page

We joined the CitationUsage and CitationUsagePageLoad tables on the session_token field, our proxy to identify users. This is to obtain a full overview of users' reading sessions: some users' pagesviews convert into external clicks, others don't. The PageLoad schema detected around 50M events per day, for a total of 1.5 billion events; users recorded in the PageLoad schema generated a total of 30M CitationUsage events during the month of data collection, around 1M a day. As last time, the Citation Usage schema detected 4 types of events:

  • `extClick` — click on external URLs;
  • `upClick` — click that takes the user from the reference at the bottom back to the anchor (e.g., “[1]”) in the main text (e.g., on “^”);
  • `fnClick` — clicks on page-internal links (e.g., “[1]”) that take the user to the reference section at the bottom;
  • `fnHover` — event when user hovers over (at least 1000ms) reference (e.g., “[1]”) in main page articles."

The figure shows distribution of events by type as recorded in the second round of data collection. The majority of the events is an click or a hover on an external link.


Reference per Page Data Collection[edit]

Similar to last time, we parse the HTML versions of the Wikipedia pages to get information about the exact number of references in each page. We analysed 5.4 million pages of english Wikipedia. We find that 24.5% of the pages have 0 references, and only 0.6% have more than 100 references. Below the plot of the distribution of English Wikipedia pages by number of references, for pages with one or more references. Each bar represents how many pages have a given number of references. We can see that the majority of the pages have less than 10 citations.

Each bar represents the number of pages having 1,2,n citations


Topic per Page Data Collection[edit]

For each page that was captured by our joint schema, we extracted Topic information, by using the draft topic prediction model from the Scoring Platform team. The plot below shows the top-10 most popular topics in our dataset. We find that most pages are about Geography, or Literature (including Biographies). Note that each article can be assigned to multiple topics.

Each bar represents the number of pages belonging to each topic
Each bar represents the number of pages belonging to each topic

Dimensions of Analysis[edit]

We analyse the frequency of external clicks according to 4 dimensions.

  • Topic : we extracted the topic of the ~3.5 million pages where we recorded events, as above.
  • Domain: we segmented the clicks on external references according to the domain of the external link (e.g. "www.theguardian.com" or "www.imdb.com").
  • Number of References in Page: we parse all pages to get the number of references with an external link.
  • Page Popularity: we count how many pageviews each page gets during the month of data collection.
  • Citation Template: we look at which citation templates readers tend to click more often on.

Next, we compute the ratio between pageviews and external clicks on these 4 dimensions.

Results[edit]

We first looked at the general citation clickthrough rate statistics. We count how many times a user visiting a page clicks at least once on an external link. We find that only 0.9% of pageviews convert into a citation click

Breakdown by topic[edit]

We compute the total number of sessions with at least one external click as captured by our schema, aggregated this value by page topic, and divide this quantity by the total number of pageviews for pages in each topic. We find that, for technical topics such as Information Science, Technology, Business and Physics, the citation clickthrough rate is higher than average.

Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic
Ratio between sessions with one click on an external reference and all sessions on pages with at least one external link, aggregated by topic

Breakdown by domain: what are the top clicked domains?[edit]

We then look at the breakdown of number of citation clicks per domain. Below a plot of the domains in English Wikipedia that receive readers click more often. Results confirm the insights from the first round of analysis. Despite Google Books being the most popular domain in English Wikipedia references, we find that the top-clicked domain is the Internet Archive's Wayback Machine, while Google Books is the second most visited domain, followed by scholarly publications a number of newspapers.

Total clicks on an external links, breakdown by link domain, after extracting the original domain from the Archive links.
Total clicks on an external links, breakdown by link domain, after extracting the original domain from the Archive links.

To better understand the end point of the external links, we extracted the domains embedded in the web.archive.org links. We find similar patters. Readers mostly click on Google Books external links, followed by scholarly publications and mainly liberal newspapers.

Total clicks on an external links, breakdown by link domain, after extracting the original domain from the Archive links.
Total clicks on an external links, breakdown by link domain, after extracting the original domain from the Archive links.

Breakdown by citation template: what are the top clicked templates?[edit]

We study the external clickthrough rate (#click/link impression) as a function of the template used for rendering the citation. This gives us an idea of the types of citations that tend to attract more clicks. We see that the most popular (attractive) templates relate to academic papers or tech reports. Below the plot of the top 20 templates by citation CTR.

Breakdown by popularity: do readers click more on citations from pages with more pageviews?[edit]

We analyse the distribution of citation clickthrough by pageviews: we want to understand the role of page popularity in citation usage. To do so, we plot, for each article, its pageviews vs its clickthrough rate, i.e. the likelihood that a user will click on at least one citation when visiting the page. We find an inverse relation between these two dimensions: readers of pages with more views tend to click less on citations.

Citation Clickthrough Rate by Page Popularity 2nd Round
Citation Clickthrough Rate by Page Popularity 2nd Round

To further corroborate this finding, and eliminate the effect of outliers we bin the pages in popularity 10-pageview buckets. The trend is the same, with pages having 30 pageviews or less achieving a higher than average citation clickthrough-rate

Citation Clickthrough Aggregated by Page Popularity 2nd Round
Citation Clickthrough Aggregated by Page Popularity 2nd Round

Future Directions: Analysing Hovers[edit]

In recent experiments, we thought about using "hovers" (fnHover) as a measure to quantify readers' interest in citations. Hovering activates the rendering of the reference tooltip containing a summary of the source characteristics. The hovering activity might express a genuine interest in knowing more about the source used to support the content of a sentence. So far, we have found that readers tend to hover more often at the top of the page, i.e. vents are more frequent for inline citations that appear at the beginning of the article (see plot below mapping the density of hovering events over citation position in the page and page length).