Research talk:Characterizing Wikipedia Reader Behaviour/Code

@Isaac (WMF): Thanks for starting the page and the work on it so far. We finally will have an end-to-end documentation by the end of this documentation which is great for future research in this space. I will gradually add comments here, and they will evolve as we learn more.

Stage One[edit]

Let's spell out the bullet points in more details.

Decide the start and end time of the survey. Specify those in UTC.
Specify the Wikimedia projects involved.
Specify the platform: mobile, desktop, or app.
Specify the sampling rate
Finalize the survey questions and answers
Decide on the choice of survey service provider. For this research, we have used Google Forms. One requirement for the choice of service provider is that the service they offer should be able to speak with EventLogging.
Create unique IDs that will be shared between EventLogging and Google spreadsheet that captures the responses. (Baha can link to code and steps).
Review the schema to make sure it captures everything you need.
If you will be working with different language communities, make sure you have a Point of Contact in each language who can work with you throughout the experiment and afterwards.
Reach out to Legal to request a privacy statement (Link to process page). You will need to display it as part of the QS window shown to the user.
Add the survey information to: https://meta.wikimedia.org/wiki/Community_Engagement/Calendar
Once everything is ready, give a 72-hour heads up on corresponding village pumps. Monitor the conversations.
Make sure QuickSurvey widget is translated in the languages you will run the survey in. Check https://translatewiki.net/wiki/Special:Translate/ext-quicksurveys?group=ext-quicksurveys&language=en&filter=%21translated&action=translate
Prepare the messages that will be displayed on the survey in the corresponding Wikipedia projects (Baha can give an example)
Test the data collection. (Baha can help with expanding what needs to be done here from his part. Please also check with Fabian if he did anything specific here, most likely not, as it's standard now and as long as the survey form talks with EL and the schema is working, we're good.)

I think I'm missing steps, and I acknowledge that some of these are not exactly code, but how we should go through running the whole experiment. (Contact community, for example). You may want to categorize these items in some topically relevant categories: Communications, Privacy, Questionnaire, Infrastructure, or some other labels. Let me know if this kind of a bullet-list is helpful for you, or let me know how else I can help. I'll wait to hear from you before going through the next ones. Thanks! --LZia (WMF) (talk) 22:01, 3 December 2018 (UTC)[reply]

I forgot to mention. At the step that the dates are being specified, the deployment window should be consulted to make sure QuickSurvey can actually go live on those days/times: https://wikitech.wikimedia.org/wiki/Deployments#Near-term --LZia (WMF) (talk) 22:07, 3 December 2018 (UTC)[reply]

Thanks for all of this. I added in everything, with a few exceptions:

I'll need a link for the process page for reaching out to Legal. I couldn't find one.
I'll follow up with Baha regarding what you raised above: creating unique IDs to be shared between Google Forms + EventLogging, messages to be displayed on survey, and testing data collection.

--Isaac (WMF) (talk) 16:04, 6 December 2018 (UTC)[reply]

Stage Two[edit]

Florian's input is very much welcome here. In the meantime:

I know Florian does some tests when he builds unique device ID to make sure he captures as many of the unique devices/sessions as possible. If that step is part of the code, it's good to highlight it and add a couple of sentences on the factors that may play in on our ability to build unique devices reliably, if any.
When we do a random sample of readers in general, and after we build a unique device and sets of sessions, we should drop all sessions that associate with a unique device that has uri_query is like '%action=edit%' or '%action=submit%'. The idea here is that we should, as much as possible, avoid having editor data as part of our analysis. (This may not be currently part of the code, in which case, we will add the code for it for future iterations.)

--LZia (WMF) (talk) 22:23, 3 December 2018 (UTC)[reply]

Excellent points - I added in a reference for background info on the limitations of the UA+IP model. I'm not sure of how we test this approach in our scenario, but I'll check with Florian. Regarding editors, based on uriQueryUnwantedActions (https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L112), the is_pageview definition in Hive (which is filtered on) excludes webrequests with uri_queries matching %action=edit% and %action=submit% so we get that for free. I'll try to clarify that. --Isaac (WMF) (talk) 15:58, 6 December 2018 (UTC)[reply]