Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases
The overall goal of this research project is to better understand motivation and behavior in terms of different subpopulations of readers. We are focusing on different demographic groups that have been associated with either different behavior and awareness of Wikipedia. There are two core components to this research: a survey on the demographics and motivations of Wikipedia readers on different projects and analysis of reader behavior. Combining these two approaches allows us to better understand how the experiences of different subpopulations of readers overlap/diverge and where we can focus our efforts to improve this experience.
We developed a survey to understand how reader motivation varies across different demographic groups. Many of the questions that we ask are similar to those asked by the Global Reach team through phone surveys. These past surveys have been very informative regarding what populations are not reaching Wikipedia. Our reader surveys complement this past work by helping us understand the needs of the readers who do reach Wikipedia. We asked about the following attributes:
- Age: which category an individual falls into -- for example, age 18-24. Age has been shown to be linked to internet skills, with older adults being less likely to use Wikipedia.
- Gender: man, woman, open-ended, or prefer not to say. There are well-known gender disparities in content, as well as readership in certain regions.
- Education: how many years of education have you completed. The aim is to understand how education, which correlates with internet skills, affects behavior.
- Geographic Region: spectrum between rural and urban. Rural regions tend to have much lower-quality content and lower readership.
- Native Language: what is your native language(s). Individuals fill different roles and have access to different content across language editions.
- Motivation: same questions as prior surveys.
English-language demographics survey questions
Are you at least 18 years of age? ¤ Yes ¤ No
<three motivation questions from: previous surveys>
Tell us about yourself What is your age? ¤ 18-24 years ¤ 25-29 years ¤ 30-39 years ¤ 40-49 years ¤ 50-59 years ¤ 60 years and older ¤ Prefer not to say What is your gender? ¤ Woman ¤ Man ¤ Prefer not to say ¤ Other... <open-text> How many years (full-time equivalent) have you been in formal education? Include all primary and secondary schooling, university and other post-secondary education, and full-time vocational training, but do not include repeated years. If you are currently in education, count the number of years you have completed so far. ¤ I have no formal schooling ¤ 1-6 years ¤ 7 years ¤ 8 years ¤ 9 years ¤ 10 years ¤ 11 years ¤ 12 years ¤ 13 years ¤ 14 years ¤ 15 years ¤ 16 years ¤ 17 years ¤ 18 years ¤ >18 years ¤ Prefer not to say Would you describe the place where you live as.... ¤ A farm or home in the country ¤ A country village ¤ A small city or town ¤ The suburbs or outskirts of a big city ¤ A big city ¤ Prefer not to say What is your native language? <list of Wikipedia languages in their native script> What is your second native language? ¤ I do not have a second native language ¤ Other... <open-text>
The survey was run in 13 languages from 26 June 2019 - 1 July 2019 (see task T212444 for technical details). See this worklog for a description of how we select languages. After cleaning and removing responses from individuals under the age of 18, the surveys ended up with the following response counts (Polish included in this table but see October 2019 Results for details):
|Survey||Response Count||Countries with at least 500 responses|
|Arabic (ar)||7741||Saudi Arabia, Egypt, Iraq|
|English (en -- Worldwide)||6181||United States, India|
|English (en -- Africa)||8043||South Africa, Nigeria, Kenya, Egypt|
|Spanish (es)||11897||Spain, Mexico, Argentina, Colombia, Peru, Chile|
|French (fr -- Worldwide)||4401||France|
|French (fr -- Africa)||3122||Morocco, Algeria|
|Russian (ru)||4565||Russia, Ukraine|
Debiasing / Analysis Features
Along with responses to the survey questions, we connect the survey responses with the following data:
- Request (contextual) features: country, continent, day of week, time of day
- Article demand: average page views, average number of sitelinks (languages in which the article appears)
- Article topic: proportion of reader's page views that went to biographies of men, biographies of women, articles with coordinates (geolocated), articles with a point-in-time, whether the reader in the same country as the coordinates of the article, article's instance-of property (aggregated to one of several superclasses)
- Article quality (based on ORES): article length, infonoise (ratio of parsed text to wikitext), number of level-two headings, number of level-three or greater headings, number of templates, number of ref tags, number of wikilinks, number of external links
- Session: session length (time), session length (number of page views), average time between page views, initial referer class for session (external, internal, unknown), where in the session the survey was taken, number of unique Wikipedia languages visited, whether the reader was signed in during the session, whether the reader viewed a Main Page
See this worklog for an analysis of the effectiveness of our approach for reconstructing reader sessions. For most features that are an average across a reading session, we also compute the entropy of that value (i.e. a measure of how uniform the session is). For example: did an individual read articles that had consistent numbers of page views or some articles that had a lot of page views and some articles that were more niche and had many fewer page views. Note that many of the article features rely on Wikidata -- without these interlanguage links and structured properties, we would be unable to do much of these inherently multilingual analyses.
The bar charts below show the debiased results for all questions along with 99% confidence intervals. The results for the information-need-related questions below largely match those from the 2017 surveys where there was language overlap between the two surveys.
We see substantial variation by language around prior knowledge before reading an article. At the extremes, about 80% of Hungarian Wikipedia readers indicate that they are already familiar with the topic that they are reading, while less than half of Chinese Wikipedia readers indicate that they are familiar.
We again see substantial variation by language with the highest proportion of respondents for each category being 58% of readers for an "overview" in Hebrew Wikipedia, 46% of readers for an "in-depth" read in Persian Wikipedia, and 44% of readers for a "fact" in Norwegian Wikipedia.
Intrinsic learning remains the primary motivation of readers. In a few languages (English, German, Norwegian), media is also a primary motivation. Note that due to the fact that school was in session for some countries at the end of June (notably more so in the Southern Hemisphere) but not for others, care should be taken before interpreting differences in work/school between languages too much.
Readers under the age of 25 are the most prevalent population. The notable exceptions to this are for Norwegian and German, where the age distribution is much more uniform. Keep in mind that whether school was in session might affect age of readers -- this is especially pertinent when considering the results for Hebrew and Spanish Wikipedias.
Readers across all languages skew towards identifying as men. There is substantial variation though, with readers in language communities like Romanian being quite close to gender parity while much larger gaps are seen in languages like Persian or Norwegian.
Readers (over the age of 18) most often have between 13-16 years of education, which generally would be interpreted as some amount of college. Hebrew is an exception here, which is likely explained by compulsory military service for most individuals over the age of 18 in Israel.
Most readers are from urban areas, though again German and Norwegian are slightly more balanced in this regard.
For many Wikipedias, the vast majority of readers -- e.g., over 95% -- include that language as one of their native languages. The exceptions are English and French, which are non-native languages for many readers in Africa but still the main Wikipedia language edition to which they turn. Anecdotally, there were many readers from Africa who listed native languages that do not even have a Wikipedia edition. Care should also be taken in the interpretation of Chinese Wikipedia, which has some interesting language adaptations.
Reader Behavior Analyses
We also correlated the responses from the information need and demographic questions above with various metrics related to reader behavior (see Debiasing / Analysis Features). From this, we note a few consistent trends across most of the languages surveyed:
- Men generate more pageviews per reading session than women:
|Survey||Average # pageviews per session (Men)||Average # pageviews per session (Women)|
|Arabic (ar)||2.465 [2.350-2.614]||1.862 [1.753-2.007]|
|German (de)||3.935 [3.128-5.513]||2.127 [1.915-2.392]|
|English (en -- Worldwide)||2.853 [2.706-3.046]||2.355 [2.166-2.598]|
|English (en -- Africa)||2.424 [2.304-2.544]||2.122 [1.997-2.337]|
|Spanish (es)||2.791 [2.533-3.256]||2.181 [1.964-2.668]|
|Persian (fa)||2.705 [2.575-2.884]||2.188 [2.029-2.398]|
|French (fr -- Worldwide)||2.831 [2.600-3.068]||2.068 [1.887-2.354]|
|French (fr -- Africa)||2.064 [1.945-2.204]||1.897 [1.774-2.061]|
|Hebrew (he)||2.234 [1.928-2.543]||1.595 [1.405-1.867]|
|Hungarian (hu)||2.357 [2.125-2.710]||1.836 [1.604-2.160]|
|Norwegian (no)||2.431 [2.071-3.287]||1.851 [1.574-2.203]|
|Polish (pl)||2.294 [2.067-2.589]||2.021 [1.734-2.359]|
|Romanian (ro)||2.300 [2.012-2.803]||1.783 [1.636-1.972]|
|Russian (ru)||2.651 [2.503-2.825]||2.050 [1.938-2.191]|
|Ukrainian (uk)||2.766 [2.410-3.292]||1.862 [1.666-2.106]|
|Chinese (zh)||3.068 [2.798-3.360]||2.406 [2.178-2.808]|
- For most topics, men and women show equal interest (i.e. are equally likely to read an article about the topic). There are topics that skew more heavily towards readers who are men (Sports, Military History) or women (Medicine, Entertainment). The table shows the main article topics and for how many surveys we found a significantly higher likelihood of a man viewing the topic, woman viewing the topic, or no significant difference between men and women:
|Topic||Skews Men||Skews Women||Balanced|
|History and Society—Transportation||11||0||5|
|History and Society—Military and warfare||10||0||6|
|History and Society—Business and economics||9||0||7|
|Culture—Games and toys||5||0||11|
|History and Society—History and society||1||4||11|
|History and Society—Politics and government||3||0||13|
|Culture—Food and drink||1||1||14|
|Culture—Philosophy and religion||0||0||16|
|History and Society—Education||0||0||16|
October 2019 Validation Check
From 26 September 2019 to 30 October 2019, surveys were deployed in Russian, Polish, and English Wikipedia. The goal was to determine whether a monthlong survey reached a different reader population -- namely less-frequent readers. It also provides data at a different time of year, which allows us to test how seasonality appears to affect the results. See task T232525 for greater context. After cleaning and removing responses from individuals under the age of 18, we achieved the following response counts:
|Survey||Response Count||Countries with at least 500 responses|
|English (en)||1704||United States|
For most metrics, no significant differences were seen between the week-long June and month-long October results for Russian and English. In particular, the gender results were nearly identical. The exceptions related to students now being in school in much of the Russian- and English-speaking world: namely an increase in individuals under the age of 18 (from 16% to 22% in English and from 22% to 27% in Russian) and corresponding uptick in work/school as a motivation as well as in-depth information and intrinsic learning for motivation in English. The results for Polish Wikipedia are included above with the June results because no major differences were seen based on when the survey was deployed.
Despite the (many) gaps that we see, there is an incredible diversity of backgrounds that readers bring to Wikipedia. On one hand, we see language editions like French or English where, due to colonialism, there are many countries for which English or French is a second language and therefore many readers are non-native speakers. On the other hand, we see for languages like Polish or Norwegian that the readership is largely focused in a single country and with readers who are native speakers.
Pageviews as a proxy for demand
We found that pageviews tend to be even more imbalanced in who generates them than the underlying reader population -- i.e. we estimate that 67% of readers to Wikipedia are men on any given day but 72% of pageviews on Wikipedia are generated by men. While this is still more balanced than the editor population, it should make us cautious about using pageviews as a pure proxy for reader demand -- i.e. they reflect reader demand from the existing population of readers, not necessarily all the people that we hope to read Wikipedia. Looking at some of the more popular articles from the week in which the survey was run (as measured by aggregate views to articles with the same Wikidata ID), it's clear that some articles are popular via wide appeal and some are popular through relatively wide appeal but largely just to men:
- Chernobyl disaster and Chernobyl (miniseries) were the 1st- and 13th-most-viewed articles by the survey respondents and were generally popular in every language surveyed and had about 69% of their pageviews coming from men. Other articles with pageviews that had similar breakdowns were Billie Eilish, Elizabeth II, and the Solar eclipse of July 2, 2019.
- 2019 Africa Cup of Nations, 2019 Copa América, and 2019 FIFA Women's World Cup were the 2nd-, 4th-, and 7th-most-viewed articles but about 85% of their pageviews came from men and they were still popular in pretty much every language surveyed. The same was true about G20 as the 3rd-most-viewed article, but obviously it has a very different topic and potential impact than the articles about soccer tournaments. Many other articles in the top-50-most-viewed have over 80% male readership.
We see a clear self-focus bias amongst readers -- i.e. people read about content that is related to their identity or context. For instance, women are more likely than men to read biographies of women (and vice versa). People read articles about places near to them. Younger readers are more likely to read about younger people. While we cannot establish any causal pathways from these surveys, these findings do indicate that a Wikipedia with more diverse content will support a more diverse readership. That is, on principle we should be working to reduce knowledge gaps in content on Wikipedia (e.g., gender gaps), but it very possibly (in the long-term) also has positive effects in bringing in a more diverse population of readers (and by extension, hopefully editors).
Pipeline of participation inequality
Shaw and Hargittai propose a pipeline of participation inequality on Wikipedia, where inequalities such as the gender gap amongst editors on Wikipedia actually arise at various stages that can be viewed as prerequisites to editing -- e.g., access to internet, awareness of Wikipedia, reader of Wikipedia, knowing that Wikipedia can be edited, and only then in editing Wikipedia. This survey provided strong empirical evidence supporting this model and showing in which regions addressing the gender gap amongst editors actually requires focusing on readers first. As part of understanding the apparent discrepancies between our results and past surveys, these surveys also highlight the importance of not grouping frequent readers in with infrequent readers of Wikipedia. For instance, data from various surveys and studies has demonstrated that in many places there is no gender gap when you ask "Do you read Wikipedia?" but one does appear when you ask "Did you read Wikipedia yesterday?". The readership data from these surveys appears to be reflective of people who read Wikipedia on any given day -- i.e. more like "Did you read Wikipedia yesterday?" than "Do you read Wikipedia?".
- Wikimania presentation (August 2019)
- Wikimedia Research Showcase presentation (November 2019)
- Paper on gender results (July 2020)
- Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684.
- Reading sessions are delineated by 1 hour of inactivity per: Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).
- Shaw, Aaron; Hargittai, Eszter (1 February 2018). "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication 68 (1): 143–168. ISSN 0021-9916. doi:10.1093/joc/jqx003.
- Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Journal of Economic Behavior and Organization (Social Science Research Network) 163: 262-276. Retrieved 5 August 2020.
- Zickuhr, Kathryn; Rainie, Lee (13 January 2011). "Wikipedia, past and present". Pew Research Center: Internet, Science & Tech. Retrieved 5 August 2020.