Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases
|This page is currently a draft. More information pertaining to this may be available on the talk page.
Translation admins: Normally, drafts should not be marked for translation.
NOTE: Check back later for information on future research.
The overall goal of this iteration of research is to better understand motivation and behavior in terms of different subpopulations of readers. We are focusing on different demographic groups that have been associated with either different behavior and awareness of Wikipedia. There will be two core components to this research: a survey on the demographics and motivations of Wikipedia readers on different projects and analysis of reader behavior. Combining these two approaches will allows us to better understand how the experiences of different subpopulations of readers overlap/diverge and where we can focus our efforts to improve this experience.
We are developing a survey to understand how reader motivation varies across different demographic groups. Many of the questions that we plan on asking are similar to those asked by the Global Reach team through phone surveys. These past surveys have been very informative regarding what populations are not reaching Wikipedia. Our reader surveys will complement this past work by helping us understand the needs of the readers who do reach Wikipedia. We are planning on asking about the following attributes:
- Age: which category an individual falls into -- for example, age 18-24. Age has been shown to be linked to internet skills, with older adults being less likely to use Wikipedia.
- Gender: man, woman, open-ended, or prefer not to say. There are well-known gender disparities in content, as well as readership in certain regions.
- Education: how many years of education have you completed. The aim is to understand how education, which correlates with internet skills, affects behavior.
- Geographic Region: spectrum between rural and urban. Rural regions tend to have much lower-quality content and lower readership.
- Native Language: what is your native language(s). Individuals fill different roles and have access to different content across language editions.
- Motivation: same questions as prior surveys.
English-language demographics survey questions
Are you at least 18 years of age? ¤ Yes ¤ No
<three motivation questions from: previous surveys>
Tell us about yourself What is your age? ¤ 18-24 years ¤ 25-29 years ¤ 30-39 years ¤ 40-49 years ¤ 50-59 years ¤ 60 years and older ¤ Prefer not to say What is your gender? ¤ Woman ¤ Man ¤ Prefer not to say ¤ Other... <open-text> How many years (full-time equivalent) have you been in formal education? Include all primary and secondary schooling, university and other post-secondary education, and full-time vocational training, but do not include repeated years. If you are currently in education, count the number of years you have completed so far. ¤ I have no formal schooling ¤ 1-6 years ¤ 7 years ¤ 8 years ¤ 9 years ¤ 10 years ¤ 11 years ¤ 12 years ¤ 13 years ¤ 14 years ¤ 15 years ¤ 16 years ¤ 17 years ¤ 18 years ¤ >18 years ¤ Prefer not to say Would you describe the place where you live as.... ¤ A farm or home in the country ¤ A country village ¤ A small city or town ¤ The suburbs or outskirts of a big city ¤ A big city ¤ Prefer not to say What is your native language? <list of Wikipedia languages in their native script> What is your second native language? ¤ I do not have a second native language ¤ Other... <open-text>
From March 4 - 5, 2019, a small-scale pilot of the survey was run on English Wikipedia. It resulted in 771 responses, of which 626 were complete and not under the age of 18. The pilot (and start of the survey translation process) identified a number of issues, described below, that were worked through before expanding the survey to more languages / respondents.
Sampling for inclusion in a given survey is done by browser. The first time a user navigates to a Wikipedia article with an active survey, a token is stored in their browser's local storage that is associated with that survey's name and indicates in a deterministic way whether the survey will be displayed on that browser. Given that a survey is active for at least several days, readers who at least occasionally visit Wikipedia are just as likely to be sampled as frequent readers. More frequent readers who are included in the survey are more likely to respond to the survey though. In the pilot, respondents viewed an average of 6.9 pages and 52% only viewed a single page while individuals who did not respond viewed an average of 4.7 pages and 61% only viewed a single page. Additionally, selection bias or issues with translations / text of the questions could differentially affect response rates.
A small minority of survey respondents did not have associated EventLogging data, which limits our ability to understand the relationship between reader demographics / motivations and the types of pages that they are reading. The different causes and respective magnitude are provided below:
- People who can see QuickSurveys but don't have EventLogging (~10%): It is possible that browsers that are slower are failing to load the EventLogging code and thus would be able to see and respond to surveys but would not be logged appropriately. See this phabricator task for more details. There is a chance that some of this is fixable (phab:T218243 and phab:T220627#5107667), but we cannot recover data in any real way for these respondents so any analysis that relies on EventLogging data will miss them. There was no strong demographics patterns related to who was missing EventLogging data, though they tended to be below 40 and male.
- People who right-click and open in a new tab to take external surveys (~5%): We get QuickSurveyInitiation EventLogging but not QuickSurveysResponses EventLogging for this group. This happens almost exclusively on desktop and should only be a problem for external surveys (no reason to right-click on internal surveys). For this group, it's harder to get the contextual information but not impossible based on approximate methods. The main feature we lose is the
Age / Gender Skew
The survey respondents skewed heavily young and male. Including those who were under the age of 18, 70% of respondents were under the age of 30. Of those who completed the survey, 76% identified as men. There were no clear interactions with other variables -- that is, the gender balance was consistent across age groups. This held true for country as well with the exception that the United States was slightly more balanced gender-wise (only 67% men). The United Kingdom and India, the other two most well-represented countries, had a gender balance of 75% and 83% men respectively.
This was a surprising level of skew for the reader population, which led to the question: is the readership truly skewed that far to men or is the skew resulting from different rates at which individuals of different gender identities self-select into the survey? We looked at past surveys and found the following data points regarding gender and frequency of Wikipedia reading:
- Based on a survey of 1000 AMT workers from US: "Second, men use Wikipedia more often — they are twice as likely than women to use Wikipedia daily"
- While younger respondents were consistently more likely to read Wikipedia frequently, mixed evidence from Global Insights phone surveys on gender:
- India: women more likely to be frequent readers of Wikipedia
- Mexico: men more likely to be frequent readers of Wikipedia
- Nigeria: men slightly more likely to be frequent readers of Wikipedia
- Iraq: ~equal likelihood by gender of being frequent readers of Wikipedia
Urban / Rural Question
See locale analysis.
June 2019 Results
The survey was run in 13 languages from 26 June 2019 - 1 July 2019 (see task T212444 for technical details). See this worklog for a description of how we select languages. After cleaning and removing responses from individuals under the age of 18, the surveys ended up with the following response counts:
|Survey||Response Count||Countries with at least 500 responses|
|Arabic (ar)||7741||Saudi Arabia, Egypt, Iraq|
|English (en -- Worldwide)||6181||United States, India|
|English (en -- Africa)||8043||South Africa, Nigeria, Kenya, Egypt|
|Spanish (es)||11897||Spain, Mexico, Argentina, Colombia, Peru, Chile|
|French (fr -- Worldwide)||4401||France|
|French (fr -- Africa)||3122||Morocco, Algeria|
|Russian (ru)||4565||Russia, Ukraine|
Check back later for complete results. For additional results, see the Wikimania presentation (17 August 2019).
Debiasing / Analysis Features
Along with responses to the survey questions, we connect the survey responses with the following data:
- Request (contextual) features: country, continent, day of week, time of day
- Article demand: average page views, average number of sitelinks (languages in which the article appears)
- Article topic: proportion of reader's page views that went to biographies of men, biographies of women, articles with coordinates (geolocated), articles with a point-in-time, whether the reader in the same country as the coordinates of the article, article's instance-of property (aggregated to one of several superclasses)
- Article quality (based on ORES): article length, infonoise (ratio of parsed text to wikitext), number of level-two headings, number of level-three or greater headings, number of templates, number of ref tags, number of wikilinks, number of external links
- Session: session length (time), session length (number of page views), average time between page views, initial referer class for session (external, internal, unknown), where in the session the survey was taken, number of unique Wikipedia languages visited, whether the reader was signed in during the session, whether the reader viewed a Main Page
See this worklog for an analysis of the effectiveness of our approach for reconstructing reader sessions. For most features that are an average across a reading session, we also compute the entropy of that value (i.e. a measure of how uniform the session is). For example: did an individual read articles that had consistent numbers of page views or some articles that had a lot of page views and some articles that were more niche and had many fewer page views. Note that many of the article features rely on Wikidata -- without these interlanguage links and structured properties, we would be unable to do much of these inherently multilingual analyses.
The bar charts below show the debiased results for all questions. We initially withheld the results for gender while we ran a series of monthlong surveys to determine if this changed the results at all -- the hypothesis being that women tend to be less frequent readers of Wikipedia so providing more time to see and respond to the survey might change the estimated balance of readers. The initial analysis of those surveys though suggest that the results did not change. These results represent the June surveys with the addition of Polish Wikipedia, which was surveyed in October 2019 over the course of one month.
The results below largely match those from the 2017 surveys where there was language overlap between the two surveys. Further detail will be added at a later point -- e.g., cross-tabulation with some demographics or article topics.
We see substantial variation by language around prior knowledge before reading an article. At the extremes, about 80% of Hungarian Wikipedia readers indicate that they are already familiar with the topic that they are reading, while less than half of Chinese Wikipedia readers indicate that they are familiar.
We again see substantial variation by language with the highest proportion of respondents for each category being 58% of readers for an "overview" in Hebrew Wikipedia, 46% of readers for an "in-depth" read in Persian Wikipedia, and 44% of readers for a "fact" in Norwegian Wikipedia.
Intrinsic learning remains the primary motivation of readers. In a few languages (English, German, Norwegian), media is also a primary motivation. Note that due to the fact that school was in session for some countries at the end of June (notably more so in the Southern Hemisphere) but not for others, care should be taken before interpreting differences in work/school between languages too much.
The results for the reader demographics questions are provided below along with 99% confidence intervals. Brief takeaways are provided for each question and more detailed analyses will follow.
Readers under the age of 25 are the most prevalent population. The notable exceptions to this are for Norwegian and German, where the age distribution is much more uniform. Keep in mind that whether school was in session might affect age of readers -- this is especially pertinent when considering the results for Hebrew and Spanish Wikipedias.
Readers across all languages skew towards identifying as men. There is substantial variation though, with readers in language communities like Romanian being quite close to gender parity while much larger gaps are seen in languages like Persian or Norwegian.
Readers (over the age of 18) most often have between 13-16 years of education, which generally would be interpreted as some amount of college. Hebrew is an exception here, which is likely explained by compulsory military service for most individuals over the age of 18 in Israel.
Most readers are from urban areas, though again German and Norwegian are slightly more balanced in this regard.
For many Wikipedias, the vast majority of readers -- e.g., over 95% -- include that language as one of their native languages. The exceptions are English and French, which are non-native languages for many readers in Africa but still the main Wikipedia language edition to which they turn. Anecdotally, there were many readers from Africa who listed native languages that do not even have a Wikipedia edition. Care should also be taken in the interpretation of Chinese Wikipedia, which has some interesting language adaptations.
October 2019 Results
From 26 September 2019 to 30 October 2019, surveys were deployed in Russian, Polish, and English Wikipedia. The goal was to determine whether a monthlong survey reached a different reader population -- namely less-frequent readers. It also provides data at a different time of year, which allows us to test how seasonality appears to affect the results. See task T232525 for greater context. After cleaning and removing responses from individuals under the age of 18, we achieved the following response counts:
|Survey||Response Count||Countries with at least 500 responses|
|English (en)||1704||United States|
For most metrics, no significant differences were seen between the week-long June and month-long October results for Russian and English. In particular, the gender results were nearly identical. The exceptions related to students now being in school in much of the Russian- and English-speaking world: namely an increase in individuals under the age of 18 (from 16% to 22% in English and from 22% to 27% in Russian) and corresponding uptick in work/school as a motivation as well as in-depth information and intrinsic learning for motivation in English. The results for Polish Wikipedia are included above with the June results because no major differences were seen
In order to prioritize content gaps across languages, it is useful to understand how people "jump" across different languages seeking given content. As a first approach to characterize this behavior, we quantified three elements:
- People reading Wikipedia in more than one language : We found less than 20% of the people switch between languages in the same session when they read Wikipedia.
- Share of the most popular project per country : Most of the countries have a clear dominant project, but there are exceptions in multilingual countries. However, we also found that in those countries, multilingual readers of each language are separate communities (people generally do not switch between languages), corresponding to smaller administrative divisions.
- Ratio of English Wikipedia Readers per country: In non-english speaking countries, the number of people visiting English Wikipedia is marginal.
- Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684.
- Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Social Science Research Network.