Global Reach/India Survey Documentation

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search


This documentation is a supplement to our India Phone Survey results. It provides an overview of the context and methodologies through which the phone survey data were gathered and organized. To enable those interested in further investigation, we also provide some recommendations on how to use the raw data to optimize meaningful analysis and exploration.

India phone survey[edit]

There are a total of 19 questions in the survey, addressing the following categories:

  • Internet use
  • Mobile phone use (smartphones & basic voice/SMS phones)
  • Awareness and use of Wikipedia
  • General demographics

Phone surveys were conducted between June and October 2016 by Votomobile. This survey is a composite of 7 individual regional surveys. This approach was taken to minimize the number of languages offered to an individual caller from 12 languages to just the 3-6 commonly used languages in that region. We also varied the number of responses collected per region to approximately reflect the population of that region in comparison to the total population of India. We chose a high number of samples to minimize the margin of error and to provide enough data for useful analysis of different regions of India.

Here are the main questions this survey was designed to answer. However, analyzing the full data set allows you to conduct more in-depth data explorations and gain further insights around these questions:

  • What is the actual number of people who use the internet?
(Real-world behavior makes this difficult to measure from industry reports, since people might have access to the internet through schools, friends, internet cafés, public Wifi, etc.)
  • For internet users: What do people mostly use the internet for?
  • For non-internet users: Why not use the internet?
  • How many people use smartphones?
  • Do people with smartphones use the internet from just Wifi? Or just cellular service?
  • How many people think that they don’t use the internet, but still use Facebook or WhatsApp?
  • How many people have heard of Wikipedia? What do they use it for? How often?
  • If they have heard of Wikipedia, but aren’t using it, why not?

Selection of the 7 individual surveyed regions[edit]

  • Regions chosen had to be derived from the geographical areas of ‘calling circles / area codes’ of India’s mobile phone system.
  • From the calling circle coverage, areas of similar geography and language use were combined into 7 distinct regions for separate surveys.
  • Each regional survey size was influenced by its population relative to the population of India.
  • Including all of India’s languages and geographies would have been cost prohibitive. However, the languages and regions chosen are expected to cover more than 95% of India’s population.

Table of Regions[edit]

Calling Groups Areas Included Languages Included
A Tamil Nadu, Kerala Hindi, Malayalam, Punjabi, Tamil, Telugu, English
B Uttar Pradesh, Bihar, Rajasthan, Jharkhand, Uttarkhand, Delhi, North Eastern states Hindi, Punjabi, English
C Maharashtra, Gujarat Gujurati, Hindi, Marathi, English
D Madhya Pradesh, Odisha, Chhattisgarh Hindi, Marathi, Odia, Punjabi, English
E Punjab, Haryana, Himachal Pradesh Hindi, Punjabi, English
F West Bengal, Assam Assamese, Bengali, English, Hindi, Punjabi
G Andhra Pradesh, Karnataka, Telangana Kannada, Tamil, Telugu, English

Where to get the data[edit]

Flow diagram of survey questions
  • The full data set can be found at:
Dan Foy (2016). India phone survey 2016. figshare. doi:10.6084/m9.figshare.5404834
This is the canonical version which contains a CSV including every answer from each of the 9235 responses.
  • The full text of the questions can be found here.

Using the data effectively for analysis[edit]

Looking at India as a whole[edit]

For an overview of the Indian population, you should turn on the “India Representation Subset” filter to obtain a subset of 2700 responses, with each regional survey size contribution determined by its population percentage.

Looking at regional subsets[edit]

Studying the data set from a regional level should provide additional insights. India consists of regions that are drastically different from each other, and drawing conclusions by combining all its regions may not always give us a holistic view of the population. To avoid this, you must ensure that the “India Representation Subset” filter is turned off before filtering to the region of interest.

Important to note: The regional and India representation filters should not be used in combination, because together they can reduce the available regional data significantly.

Impact of combining regional and country filtering:

  • For instance, let's focus on Calling Group A (Tamil Nadu, Kerala).
  • When only the regional filter is on, 865 full responses are available for analysis.
  • When both the regional filter and “India Representation Subset” (country proportionality) filter are on, only 247 of the 865 full responses are available, causing analysis to be less statistically significant.

Individual survey responses[edit]

Within the CSV file, each row represents one survey taken, with each column containing the response to the associated question. In certain cases, some questions that should have been asked were not, and these entries were marked as “Missing’'.

  • When analyzing results from questions Q9A-Q9D, Q12 and Q13, set filter to just “Full Responses”.
  • When analyzing results from any other questions, you can include non-full responses to increase the sample size with fully valid data.
  • When filtering to the “India Representation Subset”, all responses are already from the full response set and no special treatment is needed.

Facebook / WhatsApp questions[edit]

The questions asking if the respondents use Facebook or WhatsApp are only asked if they previously said that they do not use the internet. This is by design - we wanted to use this question to gauge how many people did not understand that Facebook was part of the internet. The responses to these two questions were not intended to measure the full use of Facebook or WhatsApp.

Non-linear progression & Margin of Error[edit]

It is important to note that this survey is non-linear. Depending on how a question is answered, the flow of the rest of the survey may change. For example, if a respondent says that he or she does not have a smartphone, we skip the smartphone-related questions. You can review the flow diagram to see how the survey progresses. For proper statistical validity, our survey size is large enough where the questions asked of all respondents have a 95% degree of certainty of being accurate within a 2% margin of error.


Addressing Biases[edit]

One issue with phone surveys is the tendency for some respondents to favor the first response to a question. To address this problem, most of the survey questions presented the responses in a random order for each call. This distributes any bias evenly among the responses instead of accumulating it all on one response. Note that questions that have a 'none of these' or 'other' response always kept this option as the last one presented. A couple of survey questions, however, have a strong order dependency of their responses and are confusing if they are presented in a completely random order. For instance, when we ask how often they use Wikipedia, asking in a non-sequential order would not make sense (e.g. an order of “once a week”, “once a month”, “once a day”). For these questions, we would randomly present the question in one of two orders: either from lowest to highest, or highest to lowest.

Calculation of Proportionality[edit]

To achieve a full India representation, we introduced proportionality to determine the number of responses we select per region for analysis:

  • We determined the actual regional population of India referencing “List of states and union territories of India by population”.
  • We summed up all the actual population of each region represented in the survey to 1,151,284,905.
  • We calculated the % of total population each calling circle represented in the survey. For instance, calling circle A (Tamil Nadu and Kerala) had a total population of 105,526,635, which constituted to about 9% of the total population.
  • We proportionalized sample size to 2700 and calculated the number of responses per region to take into consideration a full India representation.
  • We ordered raw data chronologically and filtered out complete responses based on calculated proportionality. We added a column “ India Representation Subset” and indicated selected response as “TRUE”. To obtain data for a full India representation, simply select “TRUE”.
Calling Group Zone Population % of Total Population Proportionalized Response Size (2700 responses)
Group A Tamil Nadu 72,138,958 6.27%
Kerala 33,387,677 2.9% 247.59
Group B Uttar Pradesh 199,281,477 17.31%
Bihar 103,804,637 9.02%
Rajasthan 68,621,012 5.96%
Jharkhand 32,966,238 2.86% 949.05
Group C Maharashtra 112,372,972 9.76%
Gujarat 60,383,628 5.24% 405.00
Group D Madhya Pradesh 72,597,565 6.31%
Odisha 41,947,358 3.64%
Chhattisgarh 25,540,196 2.22% 328.59
Group E Punjab 27704236 2.41%
Haryana 25,353,081 2.2%
Himachal Pradesh 6,864,602 0.6% 140.67
Group F West Bengal 91,347,736 7.93%
Assam 31,169,272 2.71% 287.28
Group G Andhra Pradesh 49,386,799 4.29%
Karnataka 61,130,704 5.31%
Telangana 35,286,757 3.06% 341.82
1,151,284,905 100.00% 2700

Skipped questions / Full responses[edit]

Votomobile experienced a logic flow problem with some of the responses, which led to a small set of questions being occasionally skipped (only possible with Q9A-Q9D, Q12 and Q13). When one of those questions was incorrectly skipped, that particular response in the spreadsheet is set to ‘Missing’, and the entry in the ‘Full response’ column is set to FALSE for filtering purposes.

To address this issue, Votomobile conducted extra full surveys to make up for the incomplete responses. In the current spreadsheet, both the original (with ‘Missing’ marked where needed) and the additional responses are combined together for analysis. For our initial analysis of the data set, we only used responses marked as “Full Response” for our results.

External links[edit]