Jump to content

Talk:Labs TOU Consultation Round 1 (2016)

Add topic
From Meta, a Wikimedia project coordination wiki
Latest comment: 8 years ago by Tim.landscheidt in topic Discuss Source Code Publication

Discuss Use of Third Party Resources

  • What about the use of third party API's like that of Google? I used the Google API to check the channel id of a YouTube account based on someone entering the userid or channel name in the Wikidata field for YouTube channel ID. And what about the change I made by fetching the page of a YouTube account and gathering the channel id from the metatags? Mbch331 (talk) 19:32, 20 May 2016 (UTC)Reply
Calling out to a 3rd-party service from your backend code is generally fine. You should not be sharing personal information about your user (e.g. on-wiki username, IP address, etc) with that 3rd-party unless you have warned the user before and received their consent.
What we really want to stop is embedding links to 3rd party resources (javascript, images, css, iframes) directly in the application pages. The problematic aspect of embedded 3rd party resources is that they compel the visitor's browser to interact directly with the 3rd party which exposes their IP address and allows cookies to be attached to their browsing session. This sort of behavior is "normal" on the modern Internet, but it is in reality a breach of privacy that we can and should avoid. A tool hosted in Labs/Tool Labs should be just as respectful for an individual's privacy rights as the production Wikipedia and sister projects are in my opinion. BDavis (WMF) (talk) 21:37, 20 May 2016 (UTC)Reply
ZZhou (WMF) could you clarify here and/or in the main document that the concerns for "use or integrate resources hosted on third-party servers" are specifically scoped to such uses that expose sensitive data to the third-party and have not been explicitly agreed to by the end user? The ticket I filed at T129936 which in part led to this discussion was specifically about forced browser interactions. I think an over broad interpretation of disallowing all third-party interactions from the backend or of disallowing consensual browser interactions with third-party servers would be harmful rather than helpful. BDavis (WMF) (talk) 18:37, 21 May 2016 (UTC)Reply
You are right that perhaps we should be more lenient towards third-party interaction on the back-end or allow users to opt-in to third-party tracking. I have revised the discussion section. --ZZhou (WMF) (talk) 22:40, 23 May 2016 (UTC)Reply
  • Hosting or mirroring on labs has the disadvantage of possible duplicate downloading and possible duplicate caching on the client side. --Purodha Blissenbach (talk) 07:43, 21 May 2016 (UTC)Reply
  • If a CDN is used, and the user has visited a site that used the same CDN with the same file (e.g. MaxCDN with Bootstrap), then it will be cached in the user's browser and not downloaded again. This has potential speed benefits, and uses less storage on labs because everyone does not have to download and copy the files into their individual tools. Also, other services such as reCAPTCHA are useful to prevent spam, because locally hosted services are often less effective. Tom29739 (talk) 18:26, 21 May 2016 (UTC)Reply
Tool Labs itself has the cdnjs and static tools designed to host widely used resources. I personally don't believe that any particular third-party CDN is so commonly used that there can be a guarantee of a high rate of local browser cache hits. Even if that can be proven false, trading an IP address, cookies, and HTTP referrer information for a small data download is not a decision that I feel Wikimedia should be forcing on it's visitors.
Google's reCAPTCHA service is one of types of forced browser interactions I would most like to see avoided on the wmflabs.org domain. reCAPTCHA comes with Google's blanket privacy policy and has been documented as performing device fingerprinting and setting long lived cookies. As stated elsewhere, I understand that these practices are common on the Internet of 2016, but I personally believe that Wikimedia can be better than common. BDavis (WMF) (talk) 19:03, 21 May 2016 (UTC)Reply
reCaptcha vastly outperforms opensource captcha implementations both in terms of difficulty to crack and ease of use. Given the choice, many users would willingly share private data to avoid the poor usability and waste of time that comes with the usual "try to beat a computer at recognizing horribly distorted letters" captchas. Also, how does reCaptcha compare with locally hosted alternatives in terms of difficulty of setting up? I think efforts should be focused instead on creating Labs' own captcha service that can be used by any Labs developer just by plugging in a few lines of javascript, and this proposal could be resumed afterwards. --Tgr (talk) 10:11, 22 May 2016 (UTC)Reply
I'm creating such an API. The trouble with such a service is that it will be a traditional, unreadable text one. reCaptcha has a risk detection engine which makes it much more effective. The ConfirmEdit extension page on MediaWiki wiki says that unreadable text captchas are ineffective and reCaptcha is very effective. We shouldn't have to reinvent the wheel and go back to the dark ages for captchas. Most users would probably thank us for letting them use an easy checkbox rather than unreadable text. By disallowing reCaptcha, you are effectively giving spammers free access to tools. Tom29739 (talk) 21:44, 22 May 2016 (UTC)Reply
  • Replacing an externally hosted JS file with a locally hosted one is not too much effort and can be reasonably expected from a Labs developer. Replacing browser-initiated calls to third-party services with ones that are proxied through the Labs machine *is* much effort, and might not be possible at all. Tools that pass data to third-party sites should require a clear warning and allow users to leave before any data transfer happens; I'm not sure enforcing anything more than that via the TOU is a good idea. (Tools used on WMF sites should of course follow the WMF privacy policy. Arguably, tools linked from WMF sites should as well, although that sounds unfeasible in practice.) --Tgr (talk) 10:11, 22 May 2016 (UTC)Reply
+1 (if I understand you correctly). Labs is for experiments, and those may (initially) require integrating third-party services. If a user has previously agreed to that I don't see harm, and I also think that in the long term, if a tool/application is useful for a wider range of audience, it should be cleaned up to not depend on any third-party services (and turned into an extension/gadget). But that should not be a prerequisite for starting to develop something. --Tim Landscheidt (talk) 23:30, 23 May 2016 (UTC) P. S.: Proxying requests is often as bad if the third party can correlate requests to their service with actions on Wikipedia.Reply

I completely agree. I think the essential point here is consent. As long as the user explicitly consents ('Yes, I agree that some personal information will be shared with [host X, host Y]'), I think it is reasonable to allow this -- not unlike the current 'By using this project, you agree that any private information you give to this project may be made publicly available and not be treated as confidential.' message requirement. Valhallasw (talk) 08:20, 27 May 2016 (UTC)Reply

  • Currently, I'm using Google Analytics for one of my tool, but this change will disallow me from using it anymore. Hence, is there any service from WMF that can be used to collect some analytics on tools (e.g. number of visitor on pages over a time period)? Kenrick95 (talk) 09:02, 29 May 2016 (UTC)Reply
    Hits can probably be measured manually, can't they? Not sure if it's possible to count hosts. But just putting a placeholder answer in expectancy that smarter guys will elaborate :) --Base (talk) 08:31, 31 May 2016 (UTC)Reply

Discuss Privacy Policy of Tool Labs


Can you summarize what requiring the WMF privacy policy would actually mean for Labs tool owners? Not using third-party resources + keeping private data secret? --Tgr (talk) 09:53, 22 May 2016 (UTC)Reply

  • The WMF Privacy Policy is so vague that most Tool Labs applications already adhere to it except for "Identifying user-agent information of site visitors" in the Data retention guidelines (and it is unclear to me as NAL if those guidelines are binding or not). Who will counsel application developers on the details? Who will review applications if they actually comply with it? I think it is better not to raise the expectation that Labs applications by random strangers will be "good"; instead, (as said above) applications that are useful for a wider range of audience should be moved to extensions, gadgets or applications in a separate Labs with tighter oversight (no code not reviewed for privacy, etc.). --Tim Landscheidt (talk) 23:53, 23 May 2016 (UTC)Reply
@Tgr: Some requirements from the Privacy Policy that might be applicable to a Tool Labs project include the following:
1) Private information must be secured.
2) Private information can only be retained for a short period of time.
3) Private information can generally not be shared with third parties except with user consent.
Note currently some of the syntax and language in the WMF Privacy Policy, where they reference the main WMF sites, might need to adjusted in any version we provide for Tool Labs developers and end-users.
@Tim.landscheidt: Our understanding is that currently Tool Labs projects already adhere to the Privacy Policy, as enforced by Labs administrators. So in that respect, this will just be a documentation of existing policies. You are right though that by documenting this, we might be making a stronger representation to our end-users so we will have to figure out a better way to make this work for developers if we decide on this route. As for the data retention guidelines, we are open to hearing whether or not the terms of the guidelines would be workable for developers.--ZZhou (WMF) (talk) 07:26, 24 May 2016 (UTC)Reply

Discuss Privacy Disclaimers

  • I think that making users write their own disclaimers will be a disaster as they're not lawyers and many of them aren't even native English speakers. At most, we could recommend a link to Labs' overall privacy policy. Max Semenik (talk) 21:20, 20 May 2016 (UTC)Reply
    Must the disclaimers be in English? Where is this limitation coming from? --Base (talk) 07:21, 31 May 2016 (UTC)Reply
    @Base: The Labs Terms of Use appears to indicate the specific disclaimer language (which is in English) must be published. If WMF were to provide the disclaimer language for developers, how do you propose we address the issue of other languages? --ZZhou (WMF) (talk) 20:56, 1 June 2016 (UTC)Reply
    Не бачу там нічого про мову. Я вважаю що люди можуть писати дисклеймер будь-якою мовою. Хоча, звісно, багатомовність є більш бажаною. --Base (talk) 22:01, 1 June 2016 (UTC)Reply
    • @Base: You are right - it does not state the disclaimer language (although the disclaimer as written on that page is itself in English). We do not want to force developers to write disclaimers in languages they do not understand. At the same time, the purpose of disclaimers is to inform end-users and that purpose is defeated if a disclaimer is published in a language they also do not understand. Thus, the provision of the disclaimer on the Terms provides a way for developers who do not speak English to still publish a disclaimer to their end-users. Are you aware right now of many translated disclaimers (based on the one in the Terms) being published in other languages on Labs today? If so, is there still a way we can better ensure end-users will understand the language of any disclaimers they come across on Labs? --ZZhou (WMF) (talk) 17:23, 7 June 2016 (UTC)Reply
  • This is useful for users, but may also 'scare' people off because the disclaimers may make the user think they are giving away loads of private info by using a tool. Tom29739 (talk) 18:26, 21 May 2016 (UTC)Reply
  • Instead of lots of boilerplate text no one is going to ever read, maybe the WMF could create some sort of easily-recognisable visual identity (like privacy icons). --Tgr (talk) 10:14, 22 May 2016 (UTC)Reply

Discuss Privacy Statements

  • Technically speaking, every tool or instance that's web accessible collects user information, at least in form of webserver logs. I would advise to create a "standard" set of what's being typically collected, and recommend/require a separate disclosure if information beyond that is being collected. Max Semenik (talk) 21:16, 20 May 2016 (UTC)Reply
Applications hosted on Tool Labs do not have access to the end user's IP address. This information is stripped from the request by the proxy server that also terminates the HTTPS connection and routes to the appropriate backend web server. The proxy server for other Labs projects relays the original IP address in an X-Forwarded-For header. User-Agent is available to both as is some level of information on the on-wiki user if OAuth is used by the application.
The XFF header is truly only needed by a very small number of applications in Labs and it would be nice to change the proxy so that this is an explicit grant that must be asked for rather than a default privilege. For Labs hosts which have a public IP address, there is no intervening proxy that can anonymize access. Projects requiring this direct access to their clients should also in my opinion be required to both justify their need and be subject to some disclosure and retention policy for the data they do collect and retain. BDavis (WMF) (talk) 21:55, 20 May 2016 (UTC)Reply
Pretty much anything wanting to expose a network service to the world that isn't HTTP/HTTPS needs it's own public IP. What sort of disclosure and retention policy do you have in mind? --Krenair (talkcontribs) 22:10, 20 May 2016 (UTC)Reply
A listing of the data collected that could be considered sensitive (IP addresses, usernames, etc) and the duration for which that collected data is archived by the service. There may be some common classes of services that could be covered by a shared policy to make things easier for the people running the service. One example of a common class is irc bots which log content for the channels they join. BDavis (WMF) (talk) 22:18, 20 May 2016 (UTC)Reply

Discuss Source Code Publication

  • Strongly endorse in principle, and believe it should be required for any tool accessible/usable by non labs users outside labs terminal, such as web accessible services. John Vandenberg (talk) 20:57, 20 May 2016 (UTC)Reply
  • Endorse in principle, however it will take a looong migration period and might piss off a few users. Tact is required here. Max Semenik (talk) 21:09, 20 May 2016 (UTC)Reply
    • People tend to be reluctant when asked to publish half-done stuff or work-in-progress. Sometimes, debugging code or similar should not, or must not, be made public. So, we might need a two-layer approach. Doable but complicated. Needs broad acceptance. — The preceding unsigned comment was added by Purodha (talk) 2016-05-21T07:56:35 (UTC)
  • Endorse. Any code exempted as 'security sensitive code' should be extremely well justified and overseen by project or Labs administrators. In fact I'm not actually sure if there is a good reason to allow anything to be exempted like this. --Krenair (talkcontribs) 22:08, 20 May 2016 (UTC)Reply
I read 'security sensitive code' as a reference to passwords and tokens needed to access services. Like Krenair, I also can't think of any other code product that should be on Tool Labs or Labs generally that would be sensitive. BDavis (WMF) (talk) 22:21, 20 May 2016 (UTC)Reply
If it's like we do in prod where the private config containing passwords is kept separately and not available outside of the servers, I think that's fine. --Krenair (talkcontribs)
Cough cough countervandalism bots cough. Max Semenik (talk) 23:18, 20 May 2016 (UTC)Reply
I'm a bit concerned about the hurdles it imposes for what would otherwise be quick one-off tools, if they need to have a repository approved and created for them (which might be overkill, anyway). Indeed, just the fact that "it will be published" may discourage quickies, no matter that probably noone will interested on them and that there would have been no problem in providing it on request. Platonides (talk) 20:45, 22 May 2016 (UTC)Reply
Well, I hope that we're not talking about requiring the code is in a Wikimedia hosted source code repository, which is not user/hack friendly, and would exclude a lot of current users and usages. Even hacks can be easily thrown in a repo hosted on gitlab/github/bitbucket/etc/etc, either personal dumping repos, forks of maintained libraries/toolkits, but hopefully people would collaborate in shared dumping repos, and even migrate their mature hacks into maintained libraries and toolkits. But I do agree we can and should have some sensible limits on this 'source code publication' rule so it doesnt apply to 'one-liners' e.g. criteria like web accessible, performs API writes, is on a bot flagged account, etc. John Vandenberg (talk) 21:26, 22 May 2016 (UTC)Reply
I'm sure using github or bitbucket would be fine. I agree it would be good to have some sort of threshold, but how would the threshold be defined? Scripts for one time use? Scripts below a certain number of lines? Kaldari (talk) 18:11, 23 May 2016 (UTC)Reply
  • Endorse in principle. There should probably be some kind of exception for small one-off scripts though. Also, we may want to have a grand-fathering provision for old tools that aren't being actively maintained (but are still being used). Kaldari (talk) 18:11, 23 May 2016 (UTC)Reply
  • The problem I see with this are the excemptions others request above. What is a one-off script? We had examples in the past where users (at least in my perception) wanted to assert that they have a legal right to access a tool's source code. I wouldn't want to give those users the power to harass developers who might not be good or consistent at publishing their source code. The problem of abandoned tools is different (and covers the question of publication): If a developer abandons a tool and there would be a process for other developers to take over (T87730), those new developers could take over the tool (and publish the source code). On the other hand, if there would only be a requirement to publish the code, new developers would have to fork it and figure out how to set it up on their own. So IMHO T87730 should be fixed, with no requirement to publish code. --Tim Landscheidt (talk) 00:12, 24 May 2016 (UTC)Reply
  • I agree with Tim that there are essentially multiple questions here: 1) should all code on tool labs be open source, 2) does that mean others have the right to demand the source code, 3) can a tool be taken over once the owner disappears? I think the answer to 3) should be yes, and the TOU should be adapted to make that possible legally. I think it is in line with our mission to require 1) and possibly to require 2). Valhallasw (talk) 10:25, 27 May 2016 (UTC)Reply
  • Endorse. I'm pro source code publication with the obviously necessary exclusion of configuration files which contain credentials (database passwords, OAuth secrets, etc). My reasoning for this is that the computing resources provided by Labs and Tool Labs are funded by Wikimedia donations. Developers who choose to make use of those resources in my mind are obligated to contribute to the goals of the Wikimedia movement and to respect its values to justify use of the resources. Publication and libre licensing of source code is necessary for the value of freedom and the right to fork. BDavis (WMF) (talk) 16:06, 27 May 2016 (UTC)Reply
  • This makes me a bit uneasy. Of course project/tool creators should be encouraged to publish their source code, but making it a requirement has a number of concerns, as outlined by others. How about a "Freedom of Information" style of system, where users may request that the administrator(s) of a particular tool publish the source code of that tool, and the administrator(s) are required to publish all code to a public repository, except code that is subject to specific exemptions (such as private passwords, secret counter-vandalism algorithms, CAPTCHA logic, and tools with significant security/privacy implications like UTRS). In the case of abandoned projects, Labs or Tool Labs admins could fulfil the role of the project/tool administrator. This, that and the other (talk) 04:31, 5 June 2016 (UTC)Reply
    A problem with the FOIA style plan is that without a mandatory source code escrow system to back it up there may be no way to recover source for abandoned/neglected projects. This is not a theoretical situation. We have tools today on Tool Labs that have no source code on the Tool Labs server. This is possible for any compiled language including Java, C, and C++. It is also possible that the source is present, but unlicensed which is functionally the same as having no source code due to the default copyright status of software.
    I find the "software secrets" argument to be less than compelling due to the shared hosting nature of the Tool Labs project. Many, many users have shell access to the servers that power tool labs and there are innumerable known and unknown ways to gain a local privilege escalation on a Linux host that would allow you to read files owned by another user. The expectation of file content privacy on such as shared host can be no more strict that the expectation that the window of your car will not be smashed in and the contents of your locked vehicle revealed. That is to say that the only barrier to such a loss is societal convention. If there are truly secrets worth having on a Tool Labs server it should be assumed that they are already compromised. --BDavis (WMF) (talk) 05:43, 5 June 2016 (UTC)Reply
  • I'm late to the party (too late?), but to add my two cents, echoing others, I feel closed source software that directly affects a Wikimedia project is contrary to our mission. If you write bad code and don't want anyone to see it, requiring it to be open source just means others can help improve it. If you are unwilling to work with others and want all the credit for your work, you probably shouldn't be participating in what is supposed to be a collaborative project. For the cases of spammers/vandals, those folks are always going to find a way, but I suppose there could be some extreme exceptions where closed source code might be permitted. In that case we should require there be multiple maintainers in the event the service goes down and the sole maintainer is unreachable. If there's no legitimate reason for the source to be closed, a "freedom of information" system I don't think is going to work. Case in point: What seeing with Merlbot on dewiki, which is just appalling. Meanwhile phab:T87730 has been open for over a year. In short, our on-wiki dedication to openness and transparency should be mirrored on off-wiki projects that are clearly and directly related — MusikAnimal talk 17:59, 10 June 2016 (UTC)Reply
I don't think those two spheres can be compared due to the technical differences they have. If I edit a wiki, I don't have to spend a single thought on openness and transparency; MediaWiki makes all the magic happen, and even better, instantaneous. The only (!) case in which this happens in Labs is if a project's software is in operations/puppet, and the project's administrators do not ever test patches locally first. Otherwise, there will always be significant effort needed to document and update the code actually running. Looking at how often (paid) WMF developers have set up "temporary" software without immediately documenting and publishing its code, I don't think it is reasonable to hold volunteer developers to a higher standard. (NB: IMHO all code should be published; I only want to avoid an atmosphere of fear.) --Tim Landscheidt (talk) 07:04, 11 June 2016 (UTC)Reply

Open Discussion

  • A core concept that should be defined is external usage. Tools that can only be accessed within the labs environment do not have the same problems re end-user privacy. Some tools only pull & push data onto the wikis, which means they don't have external usage, but still may need additional TOU, especially if they are critical functions. John Vandenberg (talk) 21:09, 20 May 2016 (UTC)Reply
John Vandenberg can you give some concrete examples of internal services or bots that may need extra Terms of Use documentation? I do not understand exactly what you are concerned about, but I'm interested to learn more. The reference to "critical functions" puts me more in mind of Terms of Service and support guarantees than TOU. BDavis (WMF) (talk) 22:01, 20 May 2016 (UTC)Reply
@BDavis (WMF): I consider w:Wikipedia:Bots/Requests_for_approval/FacebookBot to be a critical service, as its disappearance would be very bad (removing WP all pages from Facebook wouldnt be bad, IMO, but leaving old/bad/etc WP pages on FB would be bad). If it was running on labs, even though it doesn't interface directly with users, we should still require that its code is open, so it can be maintained properly, and algorithms re-used for similar purposes (which avoids partnership lock in). And IMO, even if it isn't running on labs, it still should be open source for the same reasons, and the TOU could enforce that by way of additional TOU that kick in for "high volume API usage". John Vandenberg (talk) 01:56, 21 May 2016 (UTC)Reply
Thanks for the clarification John Vandenberg. I personally tend to agree that all projects hosted in Labs and Tool Labs should be published under an OSI approved license whether they are end-user facing or not. We should probably take that aspect of the discussion to the Discuss Source Code Publication section of this consultation.
The idea of the enwiki or the general Wikimedia community requiring licensing and source code publication for bots that are granted certain on-wiki rights regardless of where the bot operates from is interesting. I think that however is a different discussion than the Labs TOU clarifications. BDavis (WMF) (talk) 18:29, 21 May 2016 (UTC)Reply
Sure; my intention here was not to discuss specific requirements, but to look at definitions that can be used group tools into cohorts which have specific requirements. Specifically 'external usage' / 'end-user facing', which trigger a bunch of additional requirements. John Vandenberg (talk) 18:40, 21 May 2016 (UTC)Reply
  • One major issue that has come up repeatedly is the use of Labs to aggregate, analyze, and display public information about users' interactions with the site. Some specific examples:
    • Showing which hours during a day a user typically edits the site.
    • Showing the top number of edits that a user has made to pages within a MediaWiki namespace.
  • Does aggregating, analyzing, and displaying this type of public information require user consent? To be clear, this is not working with any private data. Yet users have sometimes expressed dismay at having their public contribution activities ingested and redisplayed in certain ways.
  • It would probably be helpful to dig up some of the past discussions related to this. --MZMcBride (talk) 22:23, 20 May 2016 (UTC)Reply
@MZMcBride: Good point, I will look for these. Do you have any idea where they might be (on our mailing lists (wikitech, labs-l) or somewhere else)? --ZZhou (WMF) (talk) 22:45, 23 May 2016 (UTC)Reply
Hi ZZhou (WMF). Requests for comment/X!'s Edit Counter and Wikimedia Blog/Drafts/Handling our user data - an appeal are probably decent starting points for the types of discussions I'm talking about. There have been many other discussions, but finding them is annoying. Maybe someone else will help out.
In short, before Wikimedia Labs, there was the Wikimedia Toolserver, hosted by Wikimedia Deutschland. Some of the restrictions on aggregating data came from stricter German privacy laws and practices. When tools were moved from the Toolserver to Wikimedia Labs, and consequently were hosted more directly by Wikimedia Foundation Inc., the issues surrounding "profiling" editors re-arose. --MZMcBride (talk) 01:20, 24 May 2016 (UTC)Reply
Emufarmers reminded me of Talk:Privacy policy/Call for input (2013)#Generation of editor profiles. --MZMcBride (talk) 01:56, 24 May 2016 (UTC)Reply
@MZMcBride: Thanks! --ZZhou (WMF) (talk) 23:16, 24 May 2016 (UTC)Reply

Hi ZZhou. When I go to <https://tools.wmflabs.org/dispenser/view/Checklinks>, I get a warning about being "Leaving Wikimedia" and I'm required to press a "Proceed" button. Do you know anything about this? --MZMcBride (talk) 15:40, 29 May 2016 (UTC)Reply