Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases
|This page is currently a draft. More information pertaining to this may be available on the talk page.
Translation admins: Normally, drafts should not be marked for translation.
NOTE: Check back later for information on future research.
The overall goal of this iteration of research is to better understand motivation and behavior in terms of different subpopulations of readers. We are focusing on different demographic groups that have been associated with either different behavior and awareness of Wikipedia. There will be two core components to this research: a survey on the demographics and motivations of Wikipedia readers on different projects and analysis of reader behavior. Combining these two approaches will allows us to better understand how the experiences of different subpopulations of readers overlap/diverge and where we can focus our efforts to improve this experience.
We are developing a survey to understand how reader motivation varies across different demographic groups. Many of the questions that we plan on asking are similar to those asked by the Global Reach team through phone surveys. These past surveys have been very informative regarding what populations are not reaching Wikipedia. Our reader surveys will complement this past work by helping us understand the needs of the readers who do reach Wikipedia. We are planning on asking about the following attributes:
- Age: which category an individual falls into -- for example, age 18-24. Age has been shown to be linked to internet skills, with older adults being less likely to use Wikipedia.
- Gender: man, woman, open-ended, or prefer not to say. There are well-known gender disparities in content, as well as readership in certain regions.
- Education: how many years of education have you completed. The aim is to understand how education, which correlates with internet skills, affects behavior.
- Geographic Region: spectrum between rural and urban. Rural regions tend to have much lower-quality content and lower readership.
- Native Language: what is your native language(s). Individuals fill different roles and have access to different content across language editions.
- Motivation: same questions as prior surveys.
English-language demographics survey questions
Are you at least 18 years of age? ¤ Yes ¤ No
<three motivation questions from: previous surveys>
Tell us about yourself What is your age? ¤ 18-24 years ¤ 25-29 years ¤ 30-39 years ¤ 40-49 years ¤ 50-59 years ¤ 60 years and older ¤ Prefer not to say What is your gender? ¤ Woman ¤ Man ¤ Prefer not to say ¤ Other... <open-text> How many years (full-time equivalent) have you been in formal education? Include all primary and secondary schooling, university and other post-secondary education, and full-time vocational training, but do not include repeated years. If you are currently in education, count the number of years you have completed so far. ¤ I have no formal schooling ¤ 1-6 years ¤ 7 years ¤ 8 years ¤ 9 years ¤ 10 years ¤ 11 years ¤ 12 years ¤ 13 years ¤ 14 years ¤ 15 years ¤ 16 years ¤ 17 years ¤ 18 years ¤ >18 years ¤ Prefer not to say Would you describe the place where you live as.... ¤ A farm or home in the country ¤ A country village ¤ A small city or town ¤ The suburbs or outskirts of a big city ¤ A big city ¤ Prefer not to say What is your native language? <list of Wikipedia languages in their native script> What is your second native language? ¤ I do not have a second native language ¤ Other... <open-text>
From March 4 - 5, 2019, a small-scale pilot of the survey was run on English Wikipedia. It resulted in 771 responses, of which 626 were complete and not under the age of 18. The pilot (and start of the survey translation process) identified a number of issues, described below, that were worked through before expanding the survey to more languages / respondents.
Sampling for inclusion in a given survey is done by browser. The first time a user navigates to a Wikipedia article with an active survey, a token is stored in their browser's local storage that is associated with that survey's name and indicates in a deterministic way whether the survey will be displayed on that browser. Given that a survey is active for at least several days, readers who at least occasionally visit Wikipedia are just as likely to be sampled as frequent readers. More frequent readers who are included in the survey are more likely to respond to the survey though. In the pilot, respondents viewed an average of 6.9 pages and 52% only viewed a single page while individuals who did not respond viewed an average of 4.7 pages and 61% only viewed a single page. Additionally, selection bias or issues with translations / text of the questions could differentially affect response rates.
A small minority of survey respondents did not have associated EventLogging data, which limits our ability to understand the relationship between reader demographics / motivations and the types of pages that they are reading. The different causes and respective magnitude are provided below:
- People who can see QuickSurveys but don't have EventLogging (~10%): It is possible that browsers that are slower are failing to load the EventLogging code and thus would be able to see and respond to surveys but would not be logged appropriately. See this phabricator task for more details. There is a chance that some of this is fixable (phab:T218243 and phab:T220627#5107667), but we cannot recover data in any real way for these respondents so any analysis that relies on EventLogging data will miss them. There was no strong demographics patterns related to who was missing EventLogging data, though they tended to be below 40 and male.
- People who right-click and open in a new tab to take external surveys (~5%): We get QuickSurveyInitiation EventLogging but not QuickSurveysResponses EventLogging for this group. This happens almost exclusively on desktop and should only be a problem for external surveys (no reason to right-click on internal surveys). For this group, it's harder to get the contextual information but not impossible based on approximate methods. The main feature we lose is the
Age / Gender Skew
The survey respondents skewed heavily young and male. Including those who were under the age of 18, 70% of respondents were under the age of 30. Of those who completed the survey, 76% identified as men. There were no clear interactions with other variables -- that is, the gender balance was consistent across age groups. This held true for country as well with the exception that the United States was slightly more balanced gender-wise (only 67% men). The United Kingdom and India, the other two most well-represented countries, had a gender balance of 75% and 83% men respectively.
This was a surprising level of skew for the reader population, which led to the question: is the readership truly skewed that far to men or is the skew resulting from different rates at which individuals of different gender identities self-select into the survey? We looked at past surveys and found the following data points regarding gender and frequency of Wikipedia reading:
- Based on a survey of 1000 AMT workers from US: "Second, men use Wikipedia more often — they are twice as likely than women to use Wikipedia daily"
- While younger respondents were consistently more likely to read Wikipedia frequently, mixed evidence from Global Insights phone surveys on gender:
- India: women more likely to be frequent readers of Wikipedia
- Mexico: men more likely to be frequent readers of Wikipedia
- Nigeria: men slightly more likely to be frequent readers of Wikipedia
- Iraq: ~equal likelihood by gender of being frequent readers of Wikipedia
Urban / Rural Question
See locale analysis.
June 2019 Results
The survey was run in 13 languages from 26 June 2019 - 1 July 2019 (see task T212444 for technical details). See this worklog for a description of how we select languages. After cleaning and removing responses from individuals under the age of 18, the surveys ended up with the following response counts:
|Survey||Response Count||Countries with at least 500 responses|
|Arabic (ar)||7741||Saudi Arabia, Egypt, Iraq|
|English (en -- Worldwide)||6181||United States, India|
|English (en -- Africa)||8043||South Africa, Nigeria, Kenya, Egypt|
|Spanish (es)||11897||Spain, Mexico, Argentina, Colombia, Peru, Chile|
|French (fr -- Worldwide)||4401||France|
|French (fr -- Africa)||3122||Morocco, Algeria|
|Russian (ru)||4565||Russia, Ukraine|
Check back later for complete results (there are still checks to do to make sure we are confident in the debiasing before releasing official results). For intermediate results, see the Wikimania presentation (17 August 2019).
Debiasing / Analysis Features
Along with responses to the survey questions, we connect the survey responses with the following data:
- Request (contextual) features: country, continent, day of week, time of day
- Article demand: average page views, average number of sitelinks (languages in which the article appears)
- Article topic: proportion of reader's page views that went to biographies of men, biographies of women, articles with coordinates (geolocated), articles with a point-in-time, whether the reader in the same country as the coordinates of the article, article's instance-of property (aggregated to one of several superclasses)
- Article quality (based on ORES): article length, infonoise (ratio of parsed text to wikitext), number of level-two headings, number of level-three or greater headings, number of templates, number of ref tags, number of wikilinks, number of external links
- Session: session length (time), session length (number of page views), average time between page views, initial referer class for session (external, internal, unknown), where in the session the survey was taken, number of unique Wikipedia languages visited, whether the reader was signed in during the session, whether the reader viewed a Main Page
See this worklog for an analysis of the effectiveness of our approach for reconstructing reader sessions. For most features that are an average across a reading session, we also compute the entropy of that value (i.e. a measure of how uniform the session is). For example: did an individual read articles that had consistent numbers of page views or some articles that had a lot of page views and some articles that were more niche and had many fewer page views. Note that many of the article features rely on Wikidata -- without these interlanguage links and structured properties, we would be unable to do much of these inherently multilingual analyses.
In order to prioritize content gaps across languages, it is useful to understand how people "jump" across different languages seeking given content. As a first approach to characterize this behavior, we quantified three elements:
- People reading Wikipedia in more than one language : We found less than 20% of the people switch between languages in the same session when they read Wikipedia.
- Share of the most popular project per country : Most of the countries have a clear dominant project, but there are exceptions in multilingual countries. However, we also found that in those countries, multilingual readers of each language are separate communities (people generally do not switch between languages), corresponding to smaller administrative divisions.
- Ratio of English Wikipedia Readers per country: In non-english speaking countries, the number of people visiting English Wikipedia is marginal.
- Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684.
- Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Social Science Research Network.