Research talk:Usage of correct diacritics in readers of Romanian Wikipedia

From Meta, a Wikimedia project coordination wiki

Legal clearance?[edit]

Hey there. This looks like a great project to do.

One question – on the "Policy, Ethics and Human Subjects Research" section, it says [my emphasis]:

The study involves running a JavaScript code on all ro.wp users' machines without their explicit consent. However, the personal data gathered is the same as the data gathered by the WMF in the normal operation of the site. Also, we will not register the IP addresses, usernames or any other means to associate wiki contributions with the result.

Is this true? Though I know that Reading record general (I think country-level) geo-location and brief extracted browser information, it isn't the full data (city-level and UA string), I believe. Also, how does the script by Cristian Adam (which presumably is freely-licensed for use like this?) work – from what I can see it appears to be working out automatically (without user interaction) whether the glyphs are correctly displayed and we'd then be recording this, which is definitely not something we're regularly (or indeed ever) recording. Do you think this means we'll need to get Legal and Security review of this code before going ahead?

How can I help with this?

Jdforrester (WMF) (talk) 21:14, 10 March 2016 (UTC)[reply]

Hi James, thanks for the feedback. geoiplookup.wikimedia.org returns the city as well as the country. Since MediaWiki definitely records IPs, which are the basis for this information, I don't consider it as additional information. It's just a way to avoid recording the IPs while keeping sub-country granularity. Perhaps we can publish the results on a less granular basis (such as county), if this is really an issue. With regards to the UA string, this is also something that WMF records, at least in the cu_changes table. However, this is not strictly needed as long as I have a reliable way to extract the browser and OS versions from it (I haven't studied the problem, but I suspect the Analytics team could help here).
The external script is under the MIT license and the part I want to use only displays 3 divs and measures the difference in width between them. I don't see how this could be considered personally identifiable information. In conclusion, I don't believe a legal review is needed. A security review is never a bad idea; if you can help secure the resources for that, I'd appreciate it. Perhaps we can can discuss at the Hachathon a bit more about what that means exactly? This page contains some pointers, but is concentrated on MW code.--Strainu (talk) 22:17, 10 March 2016 (UTC)[reply]
@Strainu: Yes, the (not necessarily great) data is available to client-side scripts but I don't believe it's generally used. Also, "MediaWiki definitely records IPs" is not true for readers, which is what I was talking about. :-) The cu_changes table does indeed record the entire UA string right now, but it is very special, heavily monitored, routinely scrubbed, and much more locked down than pretty much anything else we have, and even then we don't record readers' data there.
Presumably, recording something like { country: Romania, browser_family: Chrome, browser_version: 43, platform: Windows, platform_version: 10, pass_status: true } (and then, after aggregation, { country: Romania, browser_family: Chrome, browser_version: 43, platform: Windows, platform_version: 10, pass_count: 913, fail_count: 23 }) would serve your needs without holding more than the bare minimum of data, right? It'd be great to find a way to be minimally-invasive of users' privacy whilst still getting the data you need to improve the wiki.
For a security review I've created T129584 for this. Definitely happy to talk more at the Hackathon. See you there.
Jdforrester (WMF) (talk) 23:17, 10 March 2016 (UTC)[reply]

Post to mobile-l[edit]

After asking around on IRC, ABaso suggested that you post about this project on the mobile-l mailing list since it aligns well with the work they are doing around emerging communities. I thought I'd relay the suggestion. :) --EpochFail (talk) 22:33, 10 March 2016 (UTC)[reply]