Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-06-25

Tuesday, June 25, 2019

This work log is intended to document how we are approaching the balance between a) having a large enough reader population in a Wikipedia language community to get enough survey responses, and, b) not completely ignoring smaller language communities and regions of the world that have not been represented well in past reader research.

How much is enough page views?

For a baseline, in the 2017 surveys, we included Bengali Wikipedia and had to sample all readers for a week to get barely enough data -- to compare page view volumes, see the siteviews tool: https://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=user&start=2018-06&end=2019-05&sites=bn.wikipedia.org. While this was sufficient for reporting confident results around top-level proportions -- e.g., what proportion of readers are familiar with the article they are reading? -- there are several drawbacks:

Many of the more interesting analyses that we can do with the data involve stratified statistics -- e.g., motivation of readers divided up between age groups or country -- which require even more data to have certainty in the results.
To minimize potential disruption to a community, we much prefer to not sample every reader but at least sample only approximately half the community or less.
While it has not been necessary thusfar, there are certain research questions where it is useful to have a control group of readers who did not see a survey.

For these reasons, language communities such as Basque or Swahili likely do not have sufficient page views to reach enough survey responses to report confident results.

Approaches to reaching low-traffic communities

To get around this challenge of surveying smaller reader communities, there are a few options:

Just do it and see what we get: translating and launching the surveys requires a significant amount of work and so we do not add languages lightly at this point, but if there is enough motivation/reason, we can launch a survey and see if we get enough responses.
Run surveys in languages with substantial overlap: this is the approach that we are taking in this round of surveys. We are doing this in two ways:
- Country-level targeting: the QuickSurveys tool now allows us to not just randomly sample a language community but only include readers from certain countries. For instance, if we are interested in surveying readers in Nigeria, Igbo Wikipedia or other local language editions do not have enough page views to provide the necessary responses. Most page views from Nigeria go to English Wikipedia, so we can instead run the survey in English but only for readers in Nigeria and in this way collect enough responses while not collecting tens of thousands of responses from countries like the United States that we do not need. This is not ideal for a variety of reasons but it does offer a compromise.
- Related languages: we have begun to analyze language switching behaviors on Wikipedia to see what high-level patterns are associated with people switching languages while reading Wikipedia articles. This allows us to see what larger language editions we should sample if we want to reach readers from a smaller language community -- e.g., Spanish for Catalan / Galician / Basque / Asturian; Indonesian for Javanese / Sundanese; Russian for Ukrainian / Armenian / Kazahk / Kyrgyz; Hindi for Behari / Marathi / Gujarati.

2019 Reader Surveys

In these 2019 reader surveys, we are doing country-level sampling for two languages / regions:

English Wikipedia and all countries in Africa (phab:T226273#5279261)
French Wikipedia and all countries in Africa (phab:T226273#5279647)