Research:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases

The overall goal of this research project is to better understand motivation and behavior in terms of different subpopulations of readers. We are focusing on different demographic groups that have been associated with either different behavior and awareness of Wikipedia. There are two core components to this research: a survey on the demographics and motivations of Wikipedia readers on different projects and analysis of reader behavior. Combining these two approaches allows us to better understand how the experiences of different subpopulations of readers overlap/diverge and where we can focus our efforts to improve this experience.

Reader Surveys[edit]

We developed a survey to understand how reader motivation varies across different demographic groups. Many of the questions that we ask are similar to those asked by the Global Reach team through phone surveys. These past surveys have been very informative regarding what populations are not reaching Wikipedia. Our reader surveys complement this past work by helping us understand the needs of the readers who do reach Wikipedia. We asked about the following attributes:

Age: which category an individual falls into -- for example, age 18-24. Age has been shown to be linked to internet skills, with older adults being less likely to use Wikipedia.
Gender: man, woman, open-ended, or prefer not to say. There are well-known gender disparities in content, as well as readership in certain regions.
Education: how many years of education have you completed. The aim is to understand how education, which correlates with internet skills, affects behavior.
Geographic Region: spectrum between rural and urban. Rural regions tend to have much lower-quality content and lower readership.
Native Language: what is your native language(s). Individuals fill different roles and have access to different content across language editions.^[1]
Motivation: same questions as prior surveys.

English-language demographics survey questions[edit]

Are you at least 18 years of age?
¤ Yes
¤ No

<three motivation questions from: previous surveys>

Tell us about yourself

What is your age?
¤ 18-24 years
¤ 25-29 years
¤ 30-39 years
¤ 40-49 years
¤ 50-59 years
¤ 60 years and older
¤ Prefer not to say

What is your gender?
¤ Woman
¤ Man
¤ Prefer not to say
¤ Other... <open-text>

How many years (full-time equivalent) have you been in formal education? Include all primary and secondary schooling, university and other post-secondary education, and full-time vocational training, but do not include repeated years. If you are currently in education, count the number of years you have completed so far.
¤ I have no formal schooling
¤ 1-6 years
¤ 7 years
¤ 8 years
¤ 9 years
¤ 10 years
¤ 11 years
¤ 12 years
¤ 13 years
¤ 14 years
¤ 15 years
¤ 16 years
¤ 17 years
¤ 18 years
¤ >18 years
¤ Prefer not to say

Would you describe the place where you live as....
¤ A farm or home in the country
¤ A country village
¤ A small city or town
¤ The suburbs or outskirts of a big city
¤ A big city
¤ Prefer not to say

What is your native language?
<list of Wikipedia languages in their native script>

What is your second native language?
¤ I do not have a second native language
¤ Other... <open-text>

Pilot Results[edit]

Main article: Pilot

Survey Results[edit]

The survey was run in 13 languages from 26 June 2019 - 1 July 2019 (see task T212444 for technical details). See this worklog for a description of how we select languages. After cleaning and removing responses from individuals under the age of 18, the surveys ended up with the following response counts (Polish included in this table but see October 2019 Results for details):

Survey	Response Count	Countries with at least 500 responses
Arabic (ar)	7741	Saudi Arabia, Egypt, Iraq
German (de)	4144	Germany
English (en -- Worldwide)	6181	United States, India
English (en -- Africa)	8043	South Africa, Nigeria, Kenya, Egypt
Spanish (es)	11897	Spain, Mexico, Argentina, Colombia, Peru, Chile
Persian (fa)	7036	Iran
French (fr -- Worldwide)	4401	France
French (fr -- Africa)	3122	Morocco, Algeria
Hebrew (he)	586	Israel
Hungarian (hu)	1216	Hungary
Norwegian (no)	737	Norway
Polish (pl)	688	Poland
Romanian (ro)	1336	Romania
Russian (ru)	4565	Russia, Ukraine
Ukrainian (uk)	1148	Ukraine
Chinese (zh)	2190	Taiwan

Debiasing / Analysis Features[edit]

Along with responses to the survey questions, we connect the survey responses with the following data:

Request (contextual) features: country, continent, day of week, time of day
Article demand: average page views, average number of sitelinks (languages in which the article appears)
Article topic: proportion of reader's page views that went to biographies of men, biographies of women, articles with coordinates (geolocated), articles with a point-in-time, whether the reader in the same country as the coordinates of the article, article's instance-of property (aggregated to one of several superclasses)
Article quality (based on ORES): article length, infonoise (ratio of parsed text to wikitext), number of level-two headings, number of level-three or greater headings, number of templates, number of ref tags, number of wikilinks, number of external links
Session: session length (time), session length (number of page views), average time between page views, initial referer class for session (external, internal, unknown), where in the session the survey was taken, number of unique Wikipedia languages visited, whether the reader was signed in during the session, whether the reader viewed a Main Page

See this worklog for an analysis of the effectiveness of our approach for reconstructing reader sessions. For most features that are an average across a reading session, we also compute the entropy of that value (i.e. a measure of how uniform the session is). For example: did an individual read articles that had consistent numbers of page views or some articles that had a lot of page views and some articles that were more niche and had many fewer page views. Note that many of the article features rely on Wikidata -- without these interlanguage links and structured properties, we would be unable to do much of these inherently multilingual analyses.

Results[edit]

The bar charts below show the debiased results for all questions along with 99% confidence intervals. The results for the information-need-related questions below largely match those from the 2017 surveys where there was language overlap between the two surveys.

Prior Knowledge[edit]

We see substantial variation by language around prior knowledge before reading an article. At the extremes, about 80% of Hungarian Wikipedia readers indicate that they are already familiar with the topic that they are reading, while less than half of Chinese Wikipedia readers indicate that they are familiar.

Information Depth[edit]

We again see substantial variation by language with the highest proportion of respondents for each category being 58% of readers for an "overview" in Hebrew Wikipedia, 46% of readers for an "in-depth" read in Persian Wikipedia, and 44% of readers for a "fact" in Norwegian Wikipedia.

Motivation[edit]

Intrinsic learning remains the primary motivation of readers. In a few languages (English, German, Norwegian), media is also a primary motivation. Note that due to the fact that school was in session for some countries at the end of June (notably more so in the Southern Hemisphere) but not for others, care should be taken before interpreting differences in work/school between languages too much.

Age[edit]

Readers under the age of 25 are the most prevalent population. The notable exceptions to this are for Norwegian and German, where the age distribution is much more uniform. Keep in mind that whether school was in session might affect age of readers -- this is especially pertinent when considering the results for Hebrew and Spanish Wikipedias.

Gender[edit]

Readers across all languages skew towards identifying as men. There is substantial variation though, with readers in language communities like Romanian being quite close to gender parity while much larger gaps are seen in languages like Persian or Norwegian.

Education[edit]

Readers (over the age of 18) most often have between 13-16 years of education, which generally would be interpreted as some amount of college. Hebrew is an exception here, which is likely explained by compulsory military service for most individuals over the age of 18 in Israel.

Locale[edit]

Most readers are from urban areas, though again German and Norwegian are slightly more balanced in this regard.

Native Language[edit]

For many Wikipedias, the vast majority of readers -- e.g., over 95% -- include that language as one of their native languages. The exceptions are English and French, which are non-native languages for many readers in Africa but still the main Wikipedia language edition to which they turn. Anecdotally, there were many readers from Africa who listed native languages that do not even have a Wikipedia edition. Care should also be taken in the interpretation of Chinese Wikipedia, which has some interesting language adaptations.

Reader Behavior Analyses[edit]

We also correlated the responses from the information need and demographic questions above with various metrics related to reader behavior (see Debiasing / Analysis Features). From this, we note a few consistent trends across most of the languages surveyed:

Men generate more pageviews per reading session than women:^[2]

Survey	Average # pageviews per session (Men)	Average # pageviews per session (Women)
Arabic (ar)	2.465 [2.350-2.614]	1.862 [1.753-2.007]
German (de)	3.935 [3.128-5.513]	2.127 [1.915-2.392]
English (en -- Worldwide)	2.853 [2.706-3.046]	2.355 [2.166-2.598]
English (en -- Africa)	2.424 [2.304-2.544]	2.122 [1.997-2.337]
Spanish (es)	2.791 [2.533-3.256]	2.181 [1.964-2.668]
Persian (fa)	2.705 [2.575-2.884]	2.188 [2.029-2.398]
French (fr -- Worldwide)	2.831 [2.600-3.068]	2.068 [1.887-2.354]
French (fr -- Africa)	2.064 [1.945-2.204]	1.897 [1.774-2.061]
Hebrew (he)	2.234 [1.928-2.543]	1.595 [1.405-1.867]
Hungarian (hu)	2.357 [2.125-2.710]	1.836 [1.604-2.160]
Norwegian (no)	2.431 [2.071-3.287]	1.851 [1.574-2.203]
Polish (pl)	2.294 [2.067-2.589]	2.021 [1.734-2.359]
Romanian (ro)	2.300 [2.012-2.803]	1.783 [1.636-1.972]
Russian (ru)	2.651 [2.503-2.825]	2.050 [1.938-2.191]
Ukrainian (uk)	2.766 [2.410-3.292]	1.862 [1.666-2.106]
Chinese (zh)	3.068 [2.798-3.360]	2.406 [2.178-2.808]

For most topics, men and women show equal interest (i.e. are equally likely to read an article about the topic). There are topics that skew more heavily towards readers who are men (Sports, Military History) or women (Medicine, Entertainment). The table shows the main article topics and for how many surveys we found a significantly higher likelihood of a man viewing the topic, woman viewing the topic, or no significant difference between men and women:

Topic	Skews Men	Skews Women	Balanced
Culture—Sports	12	0	4
STEM—Technology	12	0	4
History and Society—Transportation	11	0	5
History and Society—Military and warfare	10	0	6
Culture—Broadcasting	0	10	6
STEM—Biology	0	9	7
STEM—Medicine	0	9	7
History and Society—Business and economics	9	0	7
Culture—Entertainment	0	5	11
Culture—Games and toys	5	0	11
History and Society—History and society	1	4	11
STEM—Physics	4	0	12
STEM—Space	4	0	12
Culture—Music	0	3	13
Culture—Performing arts	0	3	13
Geography—Africa	3	0	13
Geography—Americas	2	1	13
Geography—Asia	1	2	13
History and Society—Politics and government	3	0	13
Culture—Food and drink	1	1	14
Culture—Biography	1	1	14
Geography—Europe	2	0	14
STEM—Geosciences	2	0	14
STEM—Science	2	0	14
Culture—Plastic arts	0	1	15
Geography—Oceania	0	1	15
STEM—Chemistry	0	1	15
Culture—Internet culture	0	0	16
Culture—Philosophy and religion	0	0	16
Culture—Visual arts	0	0	16
History and Society—Education	0	0	16
STEM—Mathematics	0	0	16

October 2019 Validation Check[edit]

From 26 September 2019 to 30 October 2019, surveys were deployed in Russian, Polish, and English Wikipedia. The goal was to determine whether a monthlong survey reached a different reader population -- namely less-frequent readers. It also provides data at a different time of year, which allows us to test how seasonality appears to affect the results. See task T232525 for greater context. After cleaning and removing responses from individuals under the age of 18, we achieved the following response counts:

Survey	Response Count	Countries with at least 500 responses
English (en)	1704	United States
Polish (pl)	688	Poland
Russian (ru)	1130	Russia

For most metrics, no significant differences were seen between the week-long June and month-long October results for Russian and English. In particular, the gender results were nearly identical. The exceptions related to students now being in school in much of the Russian- and English-speaking world: namely an increase in individuals under the age of 18 (from 16% to 22% in English and from 22% to 27% in Russian) and corresponding uptick in work/school as a motivation as well as in-depth information and intrinsic learning for motivation in English. The results for Polish Wikipedia are included above with the June results because no major differences were seen based on when the survey was deployed.

Takeaways[edit]

Reader diversity[edit]

Despite the (many) gaps that we see, there is an incredible diversity of backgrounds that readers bring to Wikipedia. On one hand, we see language editions like French or English where, due to colonialism, there are many countries for which English or French is a second language and therefore many readers are non-native speakers. On the other hand, we see for languages like Polish or Norwegian that the readership is largely focused in a single country and with readers who are native speakers.

Pageviews as a proxy for demand[edit]

We found that pageviews tend to be even more imbalanced in who generates them than the underlying reader population -- i.e. we estimate that 67% of readers to Wikipedia are men on any given day but 72% of pageviews on Wikipedia are generated by men. While this is still more balanced than the editor population, it should make us cautious about using pageviews as a pure proxy for reader demand -- i.e. they reflect reader demand from the existing population of readers, not necessarily all the people that we hope to read Wikipedia. Looking at some of the more popular articles from the week in which the survey was run (as measured by aggregate views to articles with the same Wikidata ID), it's clear that some articles are popular via wide appeal and some are popular through relatively wide appeal but largely just to men:

Chernobyl disaster and Chernobyl (miniseries) were the 1st- and 13th-most-viewed articles by the survey respondents and were generally popular in every language surveyed and had about 69% of their pageviews coming from men. Other articles with pageviews that had similar breakdowns were Billie Eilish, Elizabeth II, and the Solar eclipse of July 2, 2019.
2019 Africa Cup of Nations, 2019 Copa América, and 2019 FIFA Women's World Cup were the 2nd-, 4th-, and 7th-most-viewed articles but about 85% of their pageviews came from men and they were still popular in pretty much every language surveyed. The same was true about G20 as the 3rd-most-viewed article, but obviously it has a very different topic and potential impact than the articles about soccer tournaments. Many other articles in the top-50-most-viewed have over 80% male readership.

Representation matters[edit]

We see a clear self-focus bias amongst readers -- i.e. people read about content that is related to their identity or context. For instance, women are more likely than men to read biographies of women (and vice versa). People read articles about places near to them. Younger readers are more likely to read about younger people. While we cannot establish any causal pathways from these surveys, these findings do indicate that a Wikipedia with more diverse content will support a more diverse readership. That is, on principle we should be working to reduce knowledge gaps in content on Wikipedia (e.g., gender gaps), but it very possibly (in the long-term) also has positive effects in bringing in a more diverse population of readers (and by extension, hopefully editors).

Pipeline of participation inequality[edit]

Shaw and Hargittai^[3] propose a pipeline of participation inequality on Wikipedia, where inequalities such as the gender gap amongst editors on Wikipedia actually arise at various stages that can be viewed as prerequisites to editing -- e.g., access to internet, awareness of Wikipedia, reader of Wikipedia, knowing that Wikipedia can be edited, and only then in editing Wikipedia. This survey provided strong empirical evidence supporting this model and showing in which regions addressing the gender gap amongst editors actually requires focusing on readers first. As part of understanding the apparent discrepancies between our results and past surveys, these surveys also highlight the importance of not grouping frequent readers in with infrequent readers of Wikipedia. For instance, data from various surveys and studies^[4]^[5] has demonstrated that in many places there is no gender gap when you ask "Do you read Wikipedia?" but one does appear when you ask "Did you read Wikipedia yesterday?". The readership data from these surveys appears to be reflective of people who read Wikipedia on any given day -- i.e. more like "Did you read Wikipedia yesterday?" than "Do you read Wikipedia?".

References[edit]

↑ Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684.
↑ Reading sessions are delineated by 1 hour of inactivity per: Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).
↑ Shaw, Aaron; Hargittai, Eszter (1 February 2018). "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication 68 (1): 143–168. ISSN 0021-9916. doi:10.1093/joc/jqx003.
↑ Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Journal of Economic Behavior and Organization (Social Science Research Network) 163: 262-276. Retrieved 5 August 2020.
↑ Zickuhr, Kathryn; Rainie, Lee (13 January 2011). "Wikipedia, past and present". Pew Research Center: Internet, Science & Tech. Retrieved 5 August 2020.

[1] Hale, Scott A. (2014). "Multilinguals and Wikipedia Editing". Proceedings of the 2014 ACM conference on Web science - WebSci '14: 99–108. doi:10.1145/2615569.2615684.

[geiger13using-2] Reading sessions are delineated by 1 hour of inactivity per: Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).

[3] Shaw, Aaron; Hargittai, Eszter (1 February 2018). "The Pipeline of Online Participation Inequalities: The Case of Wikipedia Editing". Journal of Communication 68 (1): 143–168. ISSN 0021-9916. doi:10.1093/joc/jqx003.

[4] Hinnosaar, Marit (26 April 2019). "Gender Inequality in New Media: Evidence from Wikipedia". Journal of Economic Behavior and Organization (Social Science Research Network) 163: 262-276. Retrieved 5 August 2020.

[5] Zickuhr, Kathryn; Rainie, Lee (13 January 2011). "Wikipedia, past and present". Pew Research Center: Internet, Science & Tech. Retrieved 5 August 2020.

[1]

[2]

[3]

[4]

[5]