Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-05-21

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Tuesday, May 21, 2019[edit]


The survey contains a locale question (listed below) that is aimed at understanding whether respondents are from urban/rural areas. The question is included because locale has been determined to be related to awareness and also provides a sense of what services might be available to the respondents. During the translation piloting for the survey, it was identified that the question/answers had a few issues:

  • Mismatch: both descriptions (e.g., metropolitan area) and numeric guides (e.g., over 1,000,000 people) were given and the labels did not match the numbers in certain countries.
  • Ambiguity: for someone in a suburb, it was not clear whether they should answer as if they were part of the larger metropolitan area or just their smaller area.
  • Difficulty of estimating: many people do not know the population of their region and would have a hard time categorizing it exactly.
  • Difficulty of translation: no translations existed for this question, while related questions from GESIS did have existing translations in many languages.
Original question:

Which of the following best describes the area in which you live in?
¤ A rural area (fewer than 3,000 people)
¤ A small town (3,000 to 15,000 people)
¤ A town (15,000 to 100,000 people)
¤ A city (100,000 to 1,000,000 people)
¤ A metropolitan area (over 1,000,000 people)
¤ Prefer not to say

As a result, we undertook a small analysis (code) to determine whether the responses from the pilot experiment could be accurately predicted through IP address geolocation information provided by MaxMind. The basic process is as follows:

  1. Match each survey respondent's answer to the locale question to their location based on EventLogging.
  2. Load in Geonames' database of place names, locations, and populations from allCountries.txt
  3. Map each IP-based place name to a Geonames place, ensuring that the Geonames place and IP coordinates are within some threshold (e.g., 25 km) and just using the coordinates as a backup if no place matches are found.
  4. Convert the population numbers to the labels (e.g., metropolitan area) provided in the survey using the ranges provided in the question.


Recall of IP -> Population Process[edit]

The Geonames-based process had pretty high recall. Out of the approximately 800K devices that potentially saw the survey:

  • 90% successfully matched to a place w/ population data
  • 5% couldn't find matching place (mostly because the IP address failed to be matched to a location by MaxMind)
  • 4% matched but didn't have population data
  • 1% matched but the lat-lon point associated with the IP address was more than 25km from the location of the city

Alignment between IP estimates and self-reports[edit]

In the table below, a user who self-reported "A metropolitan area" and was IP-geolocated to "A city" would be counted as part of the second cell (0.16). If you see a cell value like 0.36, that means that 36% of the people who self-reported that row value were geolocated to that column value. The diagonal then is when self-report matches IP-based geolocation. So for people who self-report metropolitan area, we see that 51% are also located to a metropolitan area based on their IP address (and another 16% are geolocated to cities). For people who self-report rural area though, we see that 27% of them would be identified as being associated with metropolitan areas and only 18% would be correctly identified as rural.

Overall, we see that while there is general alignment, the small number of respondents who self-report smaller regions are overwhelmed from false positives in the IP-based process. The overall proportions of respondents in each category are fairly accurate though despite this noise. Some speculation as to why there are mismatches between self-report and IP-based geolocation:

  • IP errors: some not insignificant percentage of devices are likely incorrectly geolocated (see MaxMind's documentation).
  • Varying interpretations of locale: people who live in suburbs (and are correctly geolocated as such) might consider themselves to be part of the larger metropolitan area. For instance, if you live in Minneapolis, the city technically has a population of 400K and therefore is a city by our definition (100K - 1M), but the Twin Cities area has a population over 3 million and feels much more like a metropolitan area.
  • People responding to survey at work: if you live in a suburb or rural area but work in a city (or vice versa), you could report the correct information for your home and the IP geolocation could work perfectly, but the answers still would not be aligned.
Self-Report / IP A metropolitan area A city A town A small town A rural area No data Total Count Total Proportion
A metropolitan area 0.51 0.16 0.08 0.03 0.06 0.16 228 0.36
A city 0.32 0.41 0.09 0.02 0.03 0.13 164 0.26
A town 0.21 0.19 0.36 0.05 0.04 0.15 111 0.18
A small town 0.18 0.18 0.14 0.26 0.05 0.20 66 0.11
A rural area 0.27 0.16 0.16 0.07 0.18 0.16 44 0.07
Prefer not to say 0.23 0.08 0.31 0.08 0.00 0.31 13 0.02
Total Count 211 150 94 36 33 102 626
Total Proportion 0.34 0.24 0.15 0.06 0.05 0.16


Given the issues with the existing question but lack of precision in the IP-based method, it was decided that a locale question would still be asked but one that was simpler to translate and more perception-oriented (to reflect the subjective nature of the question). This question is below. The IP-based analysis will still be used as a second data point.

New Question:

Would you describe the place where you live as....
¤ A farm or home in the country
¤ A country village
¤ A small city or town
¤ The suburbs or outskirts of a big city
¤ A big city
¤ Prefer not to say