Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Isaac (WMF) (talk | contribs) at 17:30, 16 December 2019 (added blurb about october surveys). It may differ significantly from the current version.

NOTE: Check back later for information on future research.

The overall goal of this iteration of research is to better understand motivation and behavior in terms of different subpopulations of readers. We are focusing on different demographic groups that have been associated with either different behavior and awareness of Wikipedia. There will be two core components to this research: a survey on the demographics and motivations of Wikipedia readers on different projects and analysis of reader behavior. Combining these two approaches will allows us to better understand how the experiences of different subpopulations of readers overlap/diverge and where we can focus our efforts to improve this experience.

Reader Surveys

We are developing a survey to understand how reader motivation varies across different demographic groups. Many of the questions that we plan on asking are similar to those asked by the Global Reach team through phone surveys. These past surveys have been very informative regarding what populations are not reaching Wikipedia. Our reader surveys will complement this past work by helping us understand the needs of the readers who do reach Wikipedia. We are planning on asking about the following attributes:

English-language demographics survey questions

Are you at least 18 years of age?
¤ Yes
¤ No
<three motivation questions from: previous surveys>
Tell us about yourself

What is your age?
¤ 18-24 years
¤ 25-29 years
¤ 30-39 years
¤ 40-49 years
¤ 50-59 years
¤ 60 years and older
¤ Prefer not to say

What is your gender?
¤ Woman
¤ Man
¤ Prefer not to say
¤ Other... <open-text>

How many years (full-time equivalent) have you been in formal education? Include all primary and secondary schooling, university and other post-secondary education, and full-time vocational training, but do not include repeated years. If you are currently in education, count the number of years you have completed so far.
¤ I have no formal schooling
¤ 1-6 years
¤ 7 years
¤ 8 years
¤ 9 years
¤ 10 years
¤ 11 years
¤ 12 years
¤ 13 years
¤ 14 years
¤ 15 years
¤ 16 years
¤ 17 years
¤ 18 years
¤ >18 years
¤ Prefer not to say

Would you describe the place where you live as....
¤ A farm or home in the country
¤ A country village
¤ A small city or town
¤ The suburbs or outskirts of a big city
¤ A big city
¤ Prefer not to say

What is your native language?
<list of Wikipedia languages in their native script>

What is your second native language?
¤ I do not have a second native language
¤ Other... <open-text>

Pilot Results

From March 4 - 5, 2019, a small-scale pilot of the survey was run on English Wikipedia. It resulted in 771 responses, of which 626 were complete and not under the age of 18. The pilot (and start of the survey translation process) identified a number of issues, described below, that were worked through before expanding the survey to more languages / respondents.

QuickSurveys Sampling

Sampling for inclusion in a given survey is done by browser. The first time a user navigates to a Wikipedia article with an active survey, a token is stored in their browser's local storage that is associated with that survey's name and indicates in a deterministic way whether the survey will be displayed on that browser. Given that a survey is active for at least several days, readers who at least occasionally visit Wikipedia are just as likely to be sampled as frequent readers. More frequent readers who are included in the survey are more likely to respond to the survey though. In the pilot, respondents viewed an average of 6.9 pages and 52% only viewed a single page while individuals who did not respond viewed an average of 4.7 pages and 61% only viewed a single page. Additionally, selection bias or issues with translations / text of the questions could differentially affect response rates.

A small minority of survey respondents did not have associated EventLogging data, which limits our ability to understand the relationship between reader demographics / motivations and the types of pages that they are reading. The different causes and respective magnitude are provided below:

  • People we completely miss (~3-5%): there are some platforms for which EventLogging and QuickSurveys do not work because these platforms do not support JavaScript. This mainly would be older IE platforms (any IE version before 11) but also would include "lite" browsers (e.g., Opera Mini) that are optimized for low data or privacy. We cannot do much about this. It is not a huge proportion of the internet-connected world but likely is more likely to knock out older users and people from regions with poor internet connectivity, so we should be aware of that. See this for more details.
  • People who can see QuickSurveys but don't have EventLogging (~10%): It is possible that browsers that are slower are failing to load the EventLogging code and thus would be able to see and respond to surveys but would not be logged appropriately. See this phabricator task for more details. There is a chance that some of this is fixable (phab:T218243 and phab:T220627#5107667), but we cannot recover data in any real way for these respondents so any analysis that relies on EventLogging data will miss them. There was no strong demographics patterns related to who was missing EventLogging data, though they tended to be below 40 and male.
  • People who right-click and open in a new tab to take external surveys (~5%): We get QuickSurveyInitiation EventLogging but not QuickSurveysResponses EventLogging for this group. This happens almost exclusively on desktop and should only be a problem for external surveys (no reason to right-click on internal surveys). For this group, it's harder to get the contextual information but not impossible based on approximate methods. The main feature we lose is the editCountBucket.

Age / Gender Skew

The survey respondents skewed heavily young and male. Including those who were under the age of 18, 70% of respondents were under the age of 30. Of those who completed the survey, 76% identified as men. There were no clear interactions with other variables -- that is, the gender balance was consistent across age groups. This held true for country as well with the exception that the United States was slightly more balanced gender-wise (only 67% men). The United Kingdom and India, the other two most well-represented countries, had a gender balance of 75% and 83% men respectively.

This was a surprising level of skew for the reader population, which led to the question: is the readership truly skewed that far to men or is the skew resulting from different rates at which individuals of different gender identities self-select into the survey? We looked at past surveys and found the following data points regarding gender and frequency of Wikipedia reading:

  • Based on a survey of 1000 AMT workers from US: "Second, men use Wikipedia more often — they are twice as likely than women to use Wikipedia daily"[2]
  • While younger respondents were consistently more likely to read Wikipedia frequently, mixed evidence from Global Insights phone surveys on gender:
    • India: women more likely to be frequent readers of Wikipedia
    • Mexico: men more likely to be frequent readers of Wikipedia
    • Nigeria: men slightly more likely to be frequent readers of Wikipedia
    • Iraq: ~equal likelihood by gender of being frequent readers of Wikipedia

Urban / Rural Question

See locale analysis.

June 2019 Results

The survey was run in 13 languages from 26 June 2019 - 1 July 2019 (see task T212444 for technical details). See this worklog for a description of how we select languages. After cleaning and removing responses from individuals under the age of 18, the surveys ended up with the following response counts:

Survey Response Count Countries with at least 500 responses
Arabic (ar) 7741 Saudi Arabia, Egypt, Iraq
German (de) 4144 Germany
English (en -- Worldwide) 6181 United States, India
English (en -- Africa) 8043 South Africa, Nigeria, Kenya, Egypt
Spanish (es) 11897 Spain, Mexico, Argentina, Colombia, Peru, Chile
Persian (fa) 7036 Iran
French (fr -- Worldwide) 4401 France
French (fr -- Africa) 3122 Morocco, Algeria
Hebrew (he) 586 Israel
Hungarian (hu) 1216 Hungary
Norwegian (no) 737 Norway
Romanian (ro) 1336 Romania
Russian (ru) 4565 Russia, Ukraine
Ukrainian (uk) 1148 Ukraine
Chinese (zh) 2190 Taiwan

Check back later for complete results (there are still checks to do to make sure we are confident in the debiasing before releasing official results). For intermediate results, see the Wikimania presentation (17 August 2019).

Debiasing / Analysis Features

Along with responses to the survey questions, we connect the survey responses with the following data:

  • Request (contextual) features: country, continent, day of week, time of day
  • Article demand: average page views, average number of sitelinks (languages in which the article appears)
  • Article topic: proportion of reader's page views that went to biographies of men, biographies of women, articles with coordinates (geolocated), articles with a point-in-time, whether the reader in the same country as the coordinates of the article, article's instance-of property (aggregated to one of several superclasses)
  • Article quality (based on ORES): article length, infonoise (ratio of parsed text to wikitext), number of level-two headings, number of level-three or greater headings, number of templates, number of ref tags, number of wikilinks, number of external links
  • Session: session length (time), session length (number of page views), average time between page views, initial referer class for session (external, internal, unknown), where in the session the survey was taken, number of unique Wikipedia languages visited, whether the reader was signed in during the session, whether the reader viewed a Main Page

See this worklog for an analysis of the effectiveness of our approach for reconstructing reader sessions. For most features that are an average across a reading session, we also compute the entropy of that value (i.e. a measure of how uniform the session is). For example: did an individual read articles that had consistent numbers of page views or some articles that had a lot of page views and some articles that were more niche and had many fewer page views. Note that many of the article features rely on Wikidata -- without these interlanguage links and structured properties, we would be unable to do much of these inherently multilingual analyses.

Results

The bar charts below show the debiased results for all questions. We initially withheld the results for gender while we ran a series of monthlong surveys to determine if this changed the results at all -- the hypothesis being that women tend to be less frequent readers of Wikipedia so providing more time to see and respond to the survey might change the estimated balance of readers. The initial analysis of those surveys though suggest that the results did not change. These results represent the June surveys with the addition of Polish Wikipedia, which was surveyed in October 2019 over the course of one month.

Reader Motivation

The results below largely match those from the 2017 surveys where there was language overlap between the two surveys. Further detail will be added at a later point -- e.g., cross-tabulation with some demographics or article topics.

Prior knowledge of Wikipedia readers across 13 languages from June 2019 survey
Prior knowledge of Wikipedia readers across 13 languages from June 2019 survey

We see substantial variation by language around prior knowledge before reading an article. At the extremes, about 80% of Hungarian Wikipedia readers indicate that they are already familiar with the topic that they are reading, while less than half of Chinese Wikipedia readers indicate that they are familiar.

Information needs of Wikipedia readers across 13 languages from June 2019 survey
Information needs of Wikipedia readers across 13 languages from June 2019 survey

We again see substantial variation by language with the highest proportion of respondents for each category being 58% of readers for an "overview" in Hebrew Wikipedia, 46% of readers for an "in-depth" read in Persian Wikipedia, and 44% of readers for a "fact" in Norwegian Wikipedia.

Motivation of Wikipedia readers across 13 languages from June 2019 survey
Motivation of Wikipedia readers across 13 languages from June 2019 survey

Intrinsic learning remains the primary motivation of readers. In a few languages (English, German, Norwegian), media is also a primary motivation. Note that due to the fact that school was in session for some countries at the end of June (notably more so in the Southern Hemisphere) but not for others, care should be taken before interpreting differences in work/school between languages too much.

Reader Demographics

The results for the reader demographics questions are provided below along with 99% confidence intervals. Brief takeaways are provided for each question and more detailed analyses will follow.

Age of Wikipedia readers across 13 languages from June 2019 survey
Age of Wikipedia readers across 13 languages from June 2019 survey

Readers under the age of 25 are the most prevalent population. The notable exceptions to this are for Norwegian and German, where the age distribution is much more uniform. Keep in mind that whether school was in session might affect age of readers -- this is especially pertinent when considering the results for Hebrew and Spanish Wikipedias.

Gender of Wikipedia readers across 13 languages from June 2019 survey
Gender of Wikipedia readers across 13 languages from June 2019 survey

Readers across all languages skew towards identifying as men. There is substantial variation though, with readers in language communities like Romanian being quite close to gender parity while much larger gaps are seen in languages like Persian or Norwegian.

Education (number of years completed) of Wikipedia readers across 13 languages from June 2019 survey
Education (number of years completed) of Wikipedia readers across 13 languages from June 2019 survey

Readers (over the age of 18) most often have between 13-16 years of education, which generally would be interpreted as some amount of college. Hebrew is an exception here, which is likely explained by compulsory military service for most individuals over the age of 18 in Israel.

Locale (urban/rural) of Wikipedia readers across 13 languages from June 2019 survey
Locale (urban/rural) of Wikipedia readers across 13 languages from June 2019 survey

Most readers are from urban areas, though again German and Norwegian are slightly more balanced in this regard.

Native language(s) of Wikipedia readers across 13 languages from June 2019 survey
Native language(s) of Wikipedia readers across 13 languages from June 2019 survey

For many Wikipedias, the vast majority of readers -- e.g., over 95% -- include that language as one of their native languages. The exceptions are English and French, which are non-native languages for many readers in Africa but still the main Wikipedia language edition to which they turn. Anecdotally, there were many readers from Africa who listed native languages that do not even have a Wikipedia edition. Care should also be taken in the interpretation of Chinese Wikipedia, which has some interesting language adaptations.

October 2019 Results

From 26 September 2019 to 30 October 2019, surveys were deployed in Russian, Polish, and English Wikipedia. The goal was to determine whether a monthlong survey reached a different reader population -- namely less-frequent readers. It also provides data at a different time of year, which allows us to test how seasonality appears to affect the results. See task T232525 for greater context. After cleaning and removing responses from individuals under the age of 18, we achieved the following response counts:

Survey Response Count Countries with at least 500 responses
English (en) 1704 United States
Polish (pl) 688 Poland
Russian (ru) 1130 Russia

For most metrics, no significant differences were seen between the week-long June and month-long October results for Russian and English. In particular, the gender results were nearly identical. The exceptions related to students now being in school in much of the Russian- and English-speaking world: namely an increase in individuals under the age of 18 (from 16% to 22% in English and from 22% to 27% in Russian) and corresponding uptick in work/school as a motivation as well as in-depth information and intrinsic learning for motivation in English. The results for Polish Wikipedia are included above with the June results because no major differences were seen

Language Switching

In order to prioritize content gaps across languages, it is useful to understand how people "jump" across different languages seeking given content. As a first approach to characterize this behavior, we quantified three elements:

  • People reading Wikipedia in more than one language : We found less than 20% of the people switch between languages in the same session when they read Wikipedia.
  • Share of the most popular project per country : Most of the countries have a clear dominant project, but there are exceptions in multilingual countries. However, we also found that in those countries, multilingual readers of each language are separate communities (people generally do not switch between languages), corresponding to smaller administrative divisions.
Share of the most popular Wikipedia per country: The vast majority of people in a country read in the same language.
Share of the most popular Wikipedia per country: The vast majority of people in a country read in the same language.
  • Ratio of English Wikipedia Readers per country: In non-english speaking countries, the number of people visiting English Wikipedia is marginal.

References

  1. Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684. 
  2. Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Social Science Research Network.