Research talk:Wikimedia referrer policy

From Meta, a Wikimedia project coordination wiki

Feedback on this proposal is welcome on this page.

great[edit]

This is a timely and sensible proposal. It is great that Dario has discovered and proposed how to address this issue. Pundit (talk) 16:58, 20 January 2015 (UTC)[reply]

I agree an "Origin" policy sounds sane, but we should clarify the #Target and ensure it wouldn't make fingerprinting significantly easier (for this I suggest to ask e.g. Zack Weinberg [1]). --Nemo 17:16, 20 January 2015 (UTC)[reply]
I'm not certain I understand the proposal myself. This is what I got out of it: Pages served off *.wikipedia.org are themselves (by default) HTTPS, but outbound links to cleartext HTTP pages are very common. Browsers don't send a Referer [sic] header when traversing a link from an HTTPS page to an HTTP page. Therefore, lots of people going from Wikipedia to other sites show up as "dark traffic" on those other sites.
The proposal, then, is to make use of a new HTML5 feature to restore a partial Referer, which would reveal to those sites that the origin of these inbounds is Wikipedia, but not the specific page.
If that's all correct, then, off the top of my head, I observe that the origin in HTTP/HTML terms includes the full hostname. So you're not just indicating Wikipedia, you're indicating which language's encyclopedia, metawiki, etc. And it's not difficult for a site that really wants to know, to crawl (a particular language's) Wikipedia and discover exactly which pages link to it. Probably, even when there are a lot of outbounds from WP to a particular site (say, a news archive) any given encyclopedia article is only going to link to a few dozen pages on that site at most (all the references for that article). So setting this knob to "Origin" instead of "Always" is only a speedbump.
Whether revealing this information is acceptable may vary on a per-language-community and per-user basis. I'd suggest that, at a minimum, logged-in users should be able to turn it off.
--Zwol (talk) (aka the Zack Weinberg referenced above) 20:14, 20 January 2015 (UTC)[reply]
Note that the purpose of the "no referrer when downgrading from HTTPS to HTTP" policy is not to protect your privacy from the site you are going to visit (how is it any better when a HTTPS site learns which Wikipedia page you visited than when a HTTP site does it?), but to protect it against someone sniffing your connection. When you use HTTPS to browse Wikipedia, someone listening in on your connection (your ISP, the guy sitting next to you in the cafe where you connect via wifi, the malware on your router etc) will only learn that you are visiting the domain en.wikipedia.org - all other details, including full URLs (thus, page names) are encrypted. If there wasn't any referer blocking, when you leave Wikipedia, the page you were on would be sent as the Referer header of an unencrypted HTTP packet to the new site, and the eavesdropper would see it. That would be bad so browsers prevent the sending of full URLs on a HTTPS -> HTTP connection change. This proposal would not jeopardize that - the eavesdropper would still not learn what page you have been reading. (They know which domain you have been visiting anyway - HTTPS does not protect against that.) --Tgr (WMF) (talk) 20:36, 20 January 2015 (UTC)[reply]
On top of what Tgr (WMF) said, you're correct: the policy would disclose the *full hostname*, disclosing the language (e.g. fr), access method (e.g. fr.m.) and project (e.g. wikisource). --Dario (WMF) (talk) 23:14, 20 January 2015 (UTC)[reply]

+1 for supporting this; we need to de-darken ASAP for this data. --Piotrus (talk) 11:03, 22 January 2015 (UTC)[reply]

+1 just jumping on the pile. This is a solid proposal. Big thanks to Dario of identifying the problem and proposing a good strategy for doing something about it. --Halfak (WMF) (talk) 21:23, 27 January 2015 (UTC)[reply]

Just wanted to add my +1 as well. I think it would be a pity if we hid, e.g. to GLAM institutions, how many readers are coming through Wikipedia, and thus hide our value to them. --denny (talk) 21:35, 12 February 2016 (UTC)[reply]

Target[edit]

The text of the page is unclear and the scope of the proposal is therefore vague: «default policy, which implies that the Referer header is empty if the original page is encrypted» vs. «Lacking a global referrer policy, for specific websites that support HTTPS, the referrer can be preserved». Does this mean that the "dark traffic" in question is only about links to http URLs?

If so, in my perspective the main issue here is the linking of http resources from our https sites: if such traffic is significant, we need a "server-side HTTPS Everywhere" initiative. --Nemo 17:16, 20 January 2015 (UTC)[reply]

The interwiki map was updated for dx.doi.org, which only affects the links which use the interwiki prefix, which – I discovered – are not that many because of the infamous example [[doi:10.1002/(SICI)1096-9063(199911)55:11<1043::AID-PS60>3.0.CO;2-L]] which is not a proper link for MediaWiki (invalid Title due to <> etc.). But the interwiki map is a global policy, so I already got all the templates [2] updated (except ar, ru), as well as [3] and [4] etc. --Nemo 07:19, 24 January 2015 (UTC)[reply]
I grepped the 2014-12 dumps for all Wikimedia projects and found 38052 lines containing http://dx.doi.org links, of which 1343 probably in templates (they match '{{{'). A mass replacement is needed, but not too hard. --Nemo 15:28, 24 January 2015 (UTC)[reply]
We're now down from 868609 http://dx.doi.org links in en.wiki on the 23rd to 467616 right now. --Nemo 01:45, 28 January 2015 (UTC)[reply]
And now 21727. Will cook up a bot soon. --Nemo 10:25, 7 February 2015 (UTC)[reply]

HTTPS Everywhere not so simple[edit]

I know DOIs aren't the whole story, but they do provide a very interesting motivating case, and one that's very close to Wikipedia's heart. DOIs are persistent identifiers for scholarly content: URLs designed never to break. They can be accessed over both HTTP and HTTPS, but virtually all DOIs in existence (including those that will be used in citations in the future) are HTTP. Of course it would be great if everyone had "server-side HTTPS everywhere". However, it's not only a question of server implementation (that problem is already solved for DOIs), but dealing with URLs that already exist and are in currency, of which there are, in this case, 70 million. --Joe Wass (talk) 17:44, 20 January 2015 (UTC)[reply]

I'm not sure what you are trying to say, but I'll try to interpret. By «not only a question of server implementation (that problem is already solved for DOIs)», I guess you mean that dx.doi.org already resolves to https URLs whenever possible: correct? By «URLs that already exist and are in currency», do you mean links to dx.doi.org? Their number is irrelevant: be them 10, 10 millions or 10 billions, replacing all links to a single domain in all pages of all Wikimedia projects is trivial. --Nemo 18:06, 20 January 2015 (UTC) P.s.: https://dx.doi.org doesn't work for me. This is something they must fix, then we can update the Interwiki map and see what's left to update locally.[reply]
Sorry if I didn't make myself clear. Yes, I'm talking about DOIs and only DOIs. I'm trying to say that lots of DOIs exist and are used, are persistent identifiers, so people copy and paste them verbatim. If you're able to run a bot to update all the links to HTTPS, and to somehow automatically convert all new edits into the future, then that's great. I would love to see this policy on all Wikimedia projects (at last 160 wikipedia subdomains and 25 wikimedia subdomains refer DOIs). Not wishing to get entirely off-topic, but there is work in place to identify DOIs in new edits (sponsored by CrossRef), so it could in theory be extended to do that. As a side note, by coincidence the IDF are having problems with DNS today, the fixes should be propagating. That's probably why your visit to https://dx.doi.org failed (please try again!). Joe Wass (talk) 20:01, 20 January 2015 (UTC)[reply]
Edit filters could be used to warn editors when they are adding insecure DOI references. Avoiding insecure outlinks from secure pages is desirable for privacy reasons as well, so a better use of our resources, IMO. --Tgr (WMF) (talk) 20:23, 20 January 2015 (UTC)[reply]
Slowing down editors is not something we should do for "errors" which a bot can easily fix. https://dx.doi.org still doesn't resolve for me; I'll look into a mass replacement tomorrow. --Nemo 21:00, 20 January 2015 (UTC)[reply]
Works fine for me, e.g. [5]. There is a significant speed hit (600ms for HTTPS vs. 200ms for plain) though. --Tgr (WMF) (talk) 21:38, 20 January 2015 (UTC)[reply]
There were problems with the DNS yesterday, and the fixes are taking a while to propagate. Very unfortunate timing with this discussion! More information in this blog post. Joe Wass (talk)

origin vs. origin-when-cross-origin[edit]

The section starting with The meta referrer tag can take 4 possible values does not show much similarity with the actual spec, even though it links to it. The Working Draft from this summer specifies five values: directive-value = "none" / "none-when-downgrade" / "origin" / "origin-when-cross-origin" / "unsafe-url" (the latest editor's draft, which is linked from the proposal, uses the same five options, but the keywords are slightly different). origin means never sending the full URL, even for internal navigation, which would make it impossible for Wikimedia to track user pathways within the site, for example. origin-when-crossorigin is a strictly less bad choice as it only removes the path part of the URL when the target is at a different domain (i.e. not Wikimedia). --Tgr (WMF) (talk) 19:49, 20 January 2015 (UTC)[reply]

It seems the values in the proposal are from CSP 1.1. Note that there are three different ways of specifying a referrer policy:

  • a Content-Security-Policy: referrer origin; header (CSP 1.1 and upwards)
  • a <meta http-equiv="content-security-policy" content="referrer origin" /> tag (CSP 1.1 and upwards)
  • a <meta name="referrer" content="origin" /> tag (RP)

Browser support is not necessarily identical for those options (CSP 1.1 has been around longer; it does not have origin-when-cross-origin or anything identical, though). --Tgr (WMF) (talk) 20:20, 20 January 2015 (UTC)[reply]

Now there's also strict-origin-when-cross-origin, which might be more appropriate for us (so that we still encourage websites to adopt HTTPS). --Nemo 22:20, 10 June 2017 (UTC)[reply]

Effect on HTTP referrers[edit]

The default referrer policy is No Referrer When Downgrade, which means that the full URL is sent to secure targets and no referrer at all to insecure targets. As opposed, Origin Only or Origin When Cross-Origin means that secure targets will only receive the origin, thus eg. content authors will be unable to find out which Wikipedia article references their content. (There is no "Origin When Downgrade" option, which seems to me like a major oversight in the spec. Since it is still very much in flux - the last editor's draft is from three months ago -, it might be worth raising this point on webappsec list or the issue tracker.)

This might be wanted from a privacy point of view, but if the motivation is purely to support the analytics capabilities of third parties, this change is entirely misguided, as it will reward sites who do not use HTTPS while punish sites who do - not responsible behavior if we care about the security of the web, IMO. --Tgr (WMF) (talk) 20:02, 20 January 2015 (UTC)[reply]

Good point. +1 on asking them about "Origin When Downgrade", which seems to be our issue here. --Nemo 08:07, 21 January 2015 (UTC)[reply]

Isn't the referral leaked when using origin?[edit]

I think this will be great for data science, but I cannot see it will be great for readers and writers. Wouldn't it, even with origin, be possible to see which Wikipedia page the websurfer came from in many cases because the number of distinct pages where a URL appears on is limited, e.g., DOI:10.2967/jnumed.107.045518 appears in two articles. — Finn Årup Nielsen (fnielsen) (talk) 10:03, 21 January 2015 (UTC)[reply]

Indeed. It was said above as well: «And it's not difficult for a site that really wants to know, to crawl (a particular language's) Wikipedia and discover exactly which pages link to it». --Nemo 11:20, 22 January 2015 (UTC)[reply]

Spammers[edit]

Referral traffic seems valuable. Would this make us a (bigger) target for (new) spammers?

How does this interact with our Nofollow policy?

--Kim Bruning (talk) 21:20, 4 March 2015 (UTC)[reply]

I agree this is a problem. Spammers would love feedback about how well their carefully placed links on wikipedia.org are working, and if they got any traffic at all they would redouble their efforts to spread their links with blatant spam and more subtle fake references. Johnuniq (talk) 01:17, 8 April 2016 (UTC)[reply]

Comparison updates[edit]

Since the original page was published, it looks like Facebook is now using 'origin-when-crossorigin', and Hacker News is using 'origin'.

I also reviewed a number of old browsers, and in general, it looks like all browsers what implement the header implement it consistently. So I don't think there's a danger of 'origin' being interpreted as something other than http://www.w3.org/TR/referrer-policy/. CSteipp (WMF) (talk) 16:41, 23 October 2015 (UTC)[reply]

Origin referrer has been code-reviewed[edit]

This is to notify that T87276 has been code reviewed and is ready for deployment. Configuration has been reviewed for security but is not set on any wikis yet.

The team wanted to to touch base to make sure that there are not any ongoing concerns about this change, as much of the conversation occurred several months ago. It seems to be generally supported by participants of this page, but there have been some questions as well, and the team wants to check on unresolved questions or concerns prior to deployment. There's also the consideration of timing - this move does give us additional data, but WMF teams generally do not make drastic changes during the month of December. It's possible this would not deploy until January.

There's also discussion on how to announce this to communities overall - there are community channels within projects, but also thinking perhaps a blog post might be helpful. If there are any thoughts on that, please post here. -Rdicerb (WMF) (talk) 22:52, 16 November 2015 (UTC)[reply]

#Effect on HTTP referrers is still valid: it's a pity that no solution was found that would still encourage websites to adopt HTTPS. Was upstream even asked? Nemo 08:06, 15 February 2016 (UTC)[reply]

Also used for xwiki?[edit]

@Dario (WMF): Is this the process that we will be using for internal xwiki traffic? Of does that follow a different path when we are using interwiki mapping?  — billinghurst sDrewth 01:21, 9 February 2016 (UTC) (please ping when replying)[reply]

@Billinghurst: the proposed change is to adopt an Origin When Cross-Origin policy, so any time a user navigates from domain A to a different domain B, if that navigation downgrades from HTTPS to HTTP, the referring domain A would be recovered (reference). Internal cross-wiki traffic (say *.wikipedia.org -> *.wikipedia.org) , being over HTTPS (no downgrade) and same-origin, should already expose the resulting request as referred from wikipedia.org. Copying CSteipp (WMF) to confirm that this is actually the case.--Dario (WMF) (talk) 21:06, 10 February 2016 (UTC)[reply]
To clarify, cross wiki traffic from *.wikipedia.org -> *.wikipedia.org is not same origin. The hostname must match exactly. Only aa.wikipedia.org/x -> aa.wikipedia.org/y are the same origin. This can be seen by looking (with web developer tools) at the Referer: request header when clicking on an interwiki link on a Wikipedia pointing to another Wikipedia.
Due to this configuration change, any time a user clicks on a link on a wikimedia wiki, the webserver of the target link is informed which wikimedia website the link was clicked on. It doesn't inform the target webserver which wiki page the link was clicked on. The referrer wiki page name is only sent to the target webserver when the clicked link goes to the exact same wiki.
However, as each external link typically only appears on a few pages on each wiki, it is very very likely the external webserver can know which page you were reading if it is a public wiki.
John Vandenberg (talk) 07:08, 2 March 2016 (UTC)[reply]
@John Vandenberg: that is correct, it would be theoretically possible for an external website to reconstruct the page visited from a link present on a public wiki.--Dario (WMF) (talk) 00:02, 9 March 2016 (UTC)[reply]
Because of this very possibility, and the extension that an eavesdropper would be able to use the same attack, I have raised this issue at en:WP:Village pump (policy). Rich Farmbrough 22:35 31 March 2016 (GMT).

Proposal: be a silent referrer[edit]

Proposed: we should reveal as little as possible to the outside world about what pages our users read or what links they follow, or even whether they have accessed a Wikimedia site at all. As far as possible we should be a silent referrer and a source of "dark traffic". We should also do what we can to make links from Wikipedia pages as valueless as possible for search engine optimization (SEO).

Doing this might include:

  • Putting <meta name="referrer" content="none"> in the <head> of every page.
  • Adding rel="nofollow noreferrer" to every link.
  • Not sending Server Name Indication by becoming our own certifying authority.
  • Encouraging local language Wikipedias to modify unique URLS used for tracking into generic URLs as far as possible.

The technical details are, of course, open to discussion and revision. This proposal is about what we should do, not how best to do it once we have decided.

Rationale:

If the owner of a website knows that a visitor came from Wikipedia, or even that the visitor visited a Wikimedia site and then later visited his site, this gives him valuable information as to what links on Wikipedia are effective, and thus encourages link-spamming. Any traffic is valuable, even if it does not contribute to a higher search engine ranking.

Every small bit of information, even if seemingly useless by itself, helps a clever adversary with lots of resources to figure out what pages our readers are looking at. See en:Wikipedia:Village pump (policy)#Privacy and dark traffic. Here is the reason why, whenever possible, we should reveal as little as possible about what pages our users read or what links they follow:[6][7][8] --Guy Macon (talk) 01:57, 8 April 2016 (UTC)[reply]

Support/Oppose/Comment[edit]

  • I didn't want to make the proposal longer (the more words in a proposal the fewer people bother to read it), but if we do this we should, as far as possible, not be a silent referrer and a source of "dark traffic" to other parts of Wikipedia or to related, Wikimedia foundation controlled projects. So if a CLAM partner is legally bound by the Wikimedia privacy policy, they should be, as far as possible, provided with referrer information. If not, they shouldn't see anything that any other site isn't allowed to see. --Guy Macon (talk) 07:22, 28 May 2017 (UTC)[reply]
  • Support consistent with " So if a ... partner is legally bound by the Wikimedia privacy policy, they should be, as far as possible, provided with referrer information. If not, they shouldn't see anything that any other site isn't allowed to see." I would refine this by saying "credibly bound", for example, I personally would not trust Facebook to comply even if theoretically bound by this policy, based on their indisputable history of gaming compliance with other partners, e.g., Apple AppStore TOS. --Joe Decker (talk) 18:03, 22 June 2017 (UTC)[reply]
  • oppose no evidence this is a problem. micromanaging the WMF is not a good idea; flexible response is a better idea. Slowking4 (talk) 11:59, 26 June 2017 (UTC)[reply]
  • Support: I believe there is substantial evidence that there is a problem. The world will likely never know the full extent of what the Chinese government did to persecute Liu Xiaobo and his followers nor what is currently being done in Turkey nor even the US. We know that James Clapper kept his job after perjuring himself before the US Senate, and Edward Snowden is in exile and would be in prison for exposing Clapper's perjury. We also know that when then-President Obama asked the CIA for examples of success with their recommendations, "they couldn't come up with much.[Mazzetti, Mark (October 14, 2014), "C.I.A. Study of Covert Aid Fueled Skepticism About Helping Syrian Rebels", New York Times, retrieved 2017-03-09 ]
Recent comments on the Trump administration's efforts to overturn net neutrality filed by the Electronic Frontier Foundation on behalf of Internet engineers noted that, "a number of ISPs ... were rerouting their customers’ search queries to a third-party company, Paxfire, which in some cases then sent users to websites they did not request [apparently to monetize the] users' searches." (p. 37)
Whether it's governments or commercial concerns or organizations like Cambridge Analytica, organizations around the world are finding ways of using that information in ways that harm consumers. DavidMCEddy (talk) 02:12, 31 July 2017 (UTC)[reply]

Ongoing?[edit]

The project is still marked as ongoing: what is left to do? Some data analysis on Wikimedia's end? Or on some target domain's end? Nemo logged out --131.175.28.130 08:56, 27 March 2017 (UTC)[reply]

Effect on BEIC[edit]

I recently got access to the Google Analytics for BEIC domains linked by Wikimedia wikis and the effect of this change has been clear: "referral" traffic for gutenberg.beic.it was 23 % in January 2017 and jumped to 73 % in March 2017 (while the absolute totals remained rather stable). With HTTPS, referral traffic had dropped from 87 % in May 2015 to 23 % in July 2015. Referrals from it.m.wikipedia.org relative to referrals from it.wikipedia.org were 33 % in May 2015, 12 % in February-March 2016, 17 % in April 2016 and 27 % in February 2017. Considering mobile usage increased in the meanwhile, we probably still have some 15 % "lost referrers" from Wikimedia wikis due to outdated clients (largely on mobile, cf. phabricator:T148780#2891117) and lack of HTTPS on beic.it. Cc Marco Chemello and Chiara Consonni. --Federico Leva (BEIC) (talk) 09:55, 27 March 2017 (UTC)[reply]

RfC Announce: Wikimedia referrer policy[edit]

The WMF is not bound by RfCs on the English Wikipedia, but we can use an advisory-only RfC to decide what information, if any, we want to send to websites we link to and then put in a request to the WMF. I have posted such an advisory-only RfC, which may be found here:

en:Wikipedia:Village pump (policy)/RfC: Wikimedia referrer policy

Please comment so that we can determine the consensus of the Wikipedia community on this matter. --Guy Macon (talk) 21:17, 10 June 2017 (UTC)[reply]