Research talk:Geo-aggregation of Wikipedia pageviews

Discussion of the proposal is welcome on this talk page.

The geographic tree needs to be language neutral and thus best coded with ISO 3166[edit]

May I suggest that in terms of describing and storing the geographic tree, it is necessary to follow the ISO 3166 (or other alternatives) to store data. It will have benefits of saving storage space (not using location names but rather codes) and be language neutral (allowing others to translate into different languages more easily).--Hanteng (talk) 07:40, 13 January 2015 (UTC)[reply]

Is it possible to add Accept-Language locale setting?[edit]

For each log, there should be HTTP "Accept-Language" available, which tells us something about viewers' preferred language-territory choice (e.g en-UK, zh-cn, etc.). It will be great if additional aggregations of "Accept-Language" is added to every last children of the trees. --Hanteng (talk) 07:45, 13 January 2015 (UTC)[reply]

General idea[edit]

I've not read the proposal yet, but thanks for working on it: such data, once completely liberated from privacy concerns, will be extremely valuable for research. And it's nice to get external resources for this work.

As for privacy, this is probably dealt with in the request, but of course it's better if no data transfer at all happens, or the data transfer should be compliant with EU-standard privacy regulations, by reusing the definitions of e.g. Safe harbor regulations (even though those are not legally binding for this case). This is especially important as Los Alamos is a branch of the USA federal government. --Nemo 09:30, 13 January 2015 (UTC)[reply]

Nemo, no data will be transferred to LANL. We're exploring the possibility of an NDA for Reid to work with our Analytics engineers on the request geocoding and publication pipeline (e.g. phab:T77683) on Wikimedia's servers. The privacy implications of the proposal are being assessed by the Legal teams of the two organizations. --Dario (WMF) (talk) 23:00, 15 January 2015 (UTC)[reply]

Reading activity of logged-out contributors[edit]

I'm wondering what this means to logged-out users who edit and read. I agree that, because their IP addresses and edits are already published, location-wise it would be fine. However, I'm not too sure when it is combined with their reading activity. As a simplistic example, what if an IP user repeatedly visited article A and edited B, producing k pageviews to both from different IPs which can be seen in the editing history of B, and are associated with the same geo-location, and there were no other pageviews to A and B? Does it mean that the reading activity of the IP user can be guessed using the published dataset by a high chance? whym (talk) 12:18, 15 January 2015 (UTC) / Sorry, I have realized that I mixed up k users and k pageviews. The above comment is now modified accordingly. I think my (underlying) point still stands. whym (talk) 09:56, 10 February 2015 (UTC)[reply]

You are absolutely right. Read this post by John Vandenberg, he made similar points. This WMF shift to datahoarding about users is really terrifying. Read also Research:Improving link coverage#Data Collection. Or What are readers looking for? Wikipedia search data now available (Research:User queries). --Atlasowa (talk) 16:01, 10 February 2015 (UTC)[reply]

FWIW, personally I'm rather happy to see the overall trend to letting researchers access more dat and would like to appreciate this initiative on geo-aggregation. As the section header implies, my concern here is very specific to potential reveal of the reading activity of logged-out contributors. If it is solved (or shown to be my misunderstanding), I'd support the creation and publication of the dataset. One way to mitigate the risk might be to treat all IP users as one user and ensure that every bin has k or more users. Another might be a limited publication under a certain agreement which includes terms to disallow attempts of re-identification. whym (talk) 10:13, 11 February 2015 (UTC)[reply]