Research:Wikimedia referrer policy

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Walnut.svg
This page in a nutshell: Since we switched all Wikimedia traffic to HTTPS, our sites stopped advertising themselves as sources of referred traffic to (most) external sites. While this is a literal implication of HTTPS, it means that Wikimedia's impact on traffic directed to other sites is becoming largely invisible: is Wikimedia turning into a large source of dark traffic? I review a use case (traffic directed to CrossRef) and discuss how other top web properties deal with this issue by adopting a so-called "Referrer Policy".
Duration:  2015-01 — ??
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

In February 2016, after earlier HTTPS work, the Wikimedia Foundation has set a referrer policy of origin-when-cross-origin. Research on the effects of this change is pending.

Inbound vs outbound traffic[edit]

Over the last months, the Wikimedia Foundation's main focus has been on understanding overall trends in inbound traffic and readership. While these trends are critical for the sustainability of the project, there's another aspect that is seldom discussed or analyzed: outbound traffic originating from Wikimedia sites. Beside being a top 10 web property by popularity, Wikimedia sites and Wikipedia in particular are also one of the Web's top sources of authority[1] and arguably one of the largest referrals of internet traffic.

Case in point: CrossRef, one of the official registration agencies for Digital Object Identifiers (think of it as the ICANN of science) and the maintainer of the most popular DOI lookup service, recently published statistics[2] that rank Wikipedia as the 8th global source of traffic for scientific publications[3][4][5] or 3rd or 4th for non-traditional scholarship[6]. The following table shows Wikipedia's ranking among the top 10 DOI lookup referrals, based on a snapshot of data generated from CrossRef for 2012-2014:

Top sources of DOI resolutions
Rank Domain Requests
none 173,196,400
1. nih.gov 15,425,110
2. webofknowledge.com 3,202,582
3. crossref.org 2,809,280
4. doi.org 2,536,993
5. serialssolutions.com 2,469,642
6. sciencedirect.com 1,743,396
7. scopus.com 1,623,347
8. wikipedia.org 1,547,341
9. exlibrisgroup.com 1,158,559
10. google.com 992,665
Source: Crossref

A closer look at the data indicates that Wikipedia is actually the 6th referral by traffic, when internal referrals such as doi.org or crossref.org are excluded, and when domains that belong to publishers (i.e. 'traditional' web citation) is removed, it jumps to 3rd or 4th place over 2012-2014 in the monthly rankings of 'non-traditional' web citation referrals. However, the main takeaway from this data is the abnormally large number of requests with no referrer (173M), which is two orders of magnitude larger than most of the top referrals. Given the nature of the DOI resolver, it's very unlikely that this represents direct traffic, so it's probably safe to assume that a very large fraction of "direct" requests come from other sources where the referrer is not available, for whatever reason.

Wikimedia as a source of dark traffic[edit]

Wikimedia switched to SSL as a default for logged in users in 2013[7] and started serving pages over SSL to a potentially large proportion of referred users[8] (with the exception of visitors from a few countries where HTTPS access is blocked). In June 2015, all traffic was moved to HTTPS.[9] Given that Wikimedia doesn't advertise itself as a referrer, as a result of these changes a large fraction of outbound traffic originating from Wikimedia sites is likely to be counted as direct or uncategorized traffic. Alexis Madrigal recently coined[10] the term "dark traffic" for "social traffic that is essentially invisible to most analytics programs" and "shows up variously in programs as direct or typed/bookmarked traffic, which implies to many site owners that you actually have a bookmark or typed in www.theatlantic.com into your browser" and notes that the most frequent source traffic is when links are followed from email programs, instant messages and mobile apps or when someone moves from a secure to a non-secure site.

Back to the CrossRef example: Wikimedia is the only site in the top ten referral list that consistently links non-secure sites (including the DOI resolver) from secure pages (when users browse Wikimedia sites over HTTPS). As a result, Wikipedia is likely to be a large source of dark traffic to CrossRef, at least since the HTTPS switchover became effective. Globally, Wikimedia sites are likely to be a very large source of dark traffic to any site. Other content publishers addressed this issue by adopting a so called "referrer policy".[11]

Sites with a referrer policy[edit]

Transparently advertising the source of traffic is an important piece of strategy for many web companies. HTTPS turning off referrers undermines the ability for these companies to signal where they are funneling traffic to and for target sites to understand where traffic is coming from. While this behavior is a literal application of the HTTPS specs[12], this rule "doesn't transition well into a future where almost every website uses HTTPS".[13]

A new HTML5 tag called "meta referrer" has been designed to allow content publishers to specify at the document level the behavior of the HTTP Referrer, regardless of whether HTTP or HTTPS is being used. According to the specifications,[11] "authors can set a policy for documents they create" and determine the behavior of "the referer HTTP header for outgoing requests and navigations".

The meta referrer tag can take 4 possible values:

never
always send an empty Referer header.
default
use the default policy, which implies that the Referer header is empty if the original page is encrypted.
origin
only send the "Origin", not the full URL. This is sent from HTTPS to HTTP and includes the hostname, not the page visited or URL parameters.
always
always send the full header, even from HTTPS to HTTP.

Support for the meta referrer tag is available or under development in most major browsers.[13] Popular websites adopted different referrer policies:

Referrer policy
Site Meta Referrer
Facebook Yes: origin-when-crossorigin
Google Yes: Origin
Reddit Yes: Always by default, origin on sensitive pages
Hacker News Yes: Origin

A referrer policy for Wikimedia sites[edit]

Adopting an Origin policy for Wikimedia sites will allow traffic to be qualified as originating from Wikimedia without disclosing the specific URL or parameters associated with the request.

Lacking a global referrer policy, for specific websites that support HTTPS, the referrer can be preserved by ensuring that all outgoing links from Wikimedia sites point to the secure version of the target site. For DOIs in particular, this could be achieved by updating the corresponding cite templates (such as {{Cite journal}}) to point to the secure version of the DOI resolver (https://dx.doi.org/).

Impact[edit]

Should this proposal be implemented, we'll work with external partners that receive significant traffic from Wikimedia to monitor changes around referred traffic at the time of the deployment. So far, we identified CrossRef and BBC as interested parties.

See also[edit]

References[edit]

  1. How did Wikipedia manage to get such a high Google PageRank?, Quora
  2. Wass, J. (2015) Introducing CrossRef Labs DOI Chronograph, CrossTech
  3. DOI lookup statistics only capture requests that use DOIs to link to scholarly papers. Requests that directly target a publisher version or an author preprint of a paper (such as those referred from Google Scholar) are not included in these statistics.
  4. Bilder, G. (2014) Many Metrics. Such Data. Wow., CrossTech
  5. DOI Chronograph: wikipedia.org, CrossRef Labs
  6. Wass, J. (2015) Introducing the CrossRef Labs DOI Chronograph , CrossTech
  7. "The future of HTTPS on Wikimedia projects « Wikimedia blog". Retrieved 2014-12-10. 
  8. A breakdown of browser behavior when serving page requests referred from Google. Note that authenticated Chrome users will always land on Wikimedia properties over SSL
  9. http://blog.wikimedia.org/2015/06/12/securing-wikimedia-sites-with-https/
  10. "Dark Social: We Have the Whole History of the Web Wrong - The Atlantic". www.theatlantic.com. Retrieved 2014-12-09. 
  11. a b "Referrer Policy". w3c.github.io. Retrieved 2014-12-10. 
  12. "Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol." See: RFC 2616 (HTTP 1.1), Security considerations
  13. a b "Smerity.com: Where did all the HTTP referrers go?". smerity.com. Retrieved 2014-12-09.