Research:Geo-aggregation of Wikipedia edits

From Meta, a Wikimedia project coordination wiki

This page documents a planned research project.
Information may be incomplete and change before the project starts.


This proposal seeks to create a privacy-preserving dataset containing information about the geography of edits to Wikipedia. While closely related to Priedhorsky et al.'s proposal to create a geographic dataset of page views, anonymyzing editor geography requires more care because an editor's history is publicly available, tied to their Wikipedia handle or IP address.

Background[edit]

This proposal only considers edits to geospatial articles in Wikipedia tagged with latitude and longitude coordinates (e.g. articles about cities, states, landmarks, etc.). Future proposals can explore other types of edits.

The geography of edits to Wikipedia capture deep insights into the successes and challenges of Wikipedia's goal to capture the sum of human knowledge. The geographic places Wikipedians make their edits from (e.g. their home countries) reflect the extent to which Wikipedia engages a global community of volunteers. The geography places Wikipedians edit (e.g. the cities, parks, countries, etc.) provide insight into the amount of information in Wikipedia about geographic areas. Taken together, these two geographies can provide insight into which geographic areas are (under)represented in Wikipedia, and what geographic perspectives the articles about those places might convey.

Prior research has found that extensive inequalities exist in both coverage and editorship. For example, countries with lower socio-economic indicators tend to have both fewer editors per capita, and less information about them.[1] In addition, edits about economically impoverished places tend to have fewer local edits than wealthy places.[2]

However, the unavailability of geographic editor datasets has served as a significant barrier to research on editor geographies. Researchers studying the geography of edits to Wikipedia have historically needed to geocode anonymous editors IP address (which ignores the huge share of work performed by logged-in editors) or have attempted to predict an editor's geographic location based on their User page and editing history (a error-prone and dangerously self-referential approach). This project seeks to provide a geographic dataset containing information about Wikipedians' edits that supports more accurate and in depth research about the geography of information on Wikipedia while preserving the anonymity of Wikipedians themselves.

Is this possible?[edit]

A prudent question the Wikipedia community may ask itself about this research is: Is it possible to sufficiently anonymize editors' geographic data? To show that it is, consider the following two data.

  • For all geospatial articles located within the United Kingdom, research has found that 76.7% of edits come from the U.K., 5.3% of edits come from the U.S., and the remaining 18% come from editors in "other" countries.
  • X% of edits to the article about the city of Songnim in North Korea came locally from within the city itself.

Intuitively, because the first statistic captures thousands of editors' activity, it is unlikely to compromise a Wikipedian's anonymity. On the other hand, the release of the second datum is much more likely to violate a Wikipedian's expectations for privacy. Therefore, the question in this research is not whether anonymization is possible (the U.K. example shows it is), but what levels of aggregation balance the communities' needs for anonymity with researchers' needs for accurate data.

Privacy concerns[edit]

The overarching privacy concerns in this project reflect many of the concerns outlined in the proposal to geographically aggregate pageviews. Much of the content in this section is adapted from that proposal.

Like the pageview proposal, this proposal concerns three identifiers: 1) a Wikipedian's real name or other real-world identifiers, 2) a Wikipedian's username, and 3) a Wikipedian's IP address. In addition, we must protect a Wikipedian's geographic location, which is a sensitive attribute. Finally, an additional source of external data that affects this proposal is a user's editing history.

As with the pageview proposal, we seek to prevent two classes of disclosure:

  • Pseudonym resolution: The discovery that two or more previously disconnected accounts or IP addresses represent the same editor.
  • Location disclosure: The disclosure of an editor's geographic location.

As described in Priedhorsky et al.'s pageview proposal, research has established three metrics that capture the degree of anonymity in a datase (the bullets below are a direct quotation from their proposal). An excellent summary of these and other methods for privacy preservation can be found in Aggarwal and Yu [3].

  • k-anonymity: Any given individual resides in an equivalence class of k − 1 others. For example, a given editor can be linked to no fewer than k candidate reading histories. (Note that this example interpretation of k may differ from others in this proposal.)
  • l-diversity: Any given individual resides in an equivalence class with at least l “well represented” values of each sensitive attribute. The notion of well-representedness is complex and often ill-defined. A plausible though imprecise example in our case is that any given editor can be linked to no fewer than l locations.
  • t-closeness: Any given individual resides in an equivalence class whose distribution of sensitive attribute values is within t of the global distribution. For example, the probability distribution of an editor’s location differs from the location distribution of all Wikipedia editors by no more than t.

Methods[edit]

To achieve proper anonymization, we will incorporate two levels of aggregation and two additional levels of filtering. Each record in the anonymized dataset will consist of a five tuple containing:

  • The Wikipedia project (e.g. en).
  • A date range (e.g. July, 2014).
  • A geospatial level of aggregation for articles associated the record (e.g. articles about the U.K.).
  • A geospatial level of aggregation for editor locations associated with the record (e.g. editors from the U.S.).
  • A count indicating the number of edits that matched the previous three attributes.

We will apply an approach similar to that described in Priedhorsky et al.'s proposal, where we "walk" article and editor geographic aggregation up through increasing scales (e.g. city, state, country, continent, etc.) until we achieve a sufficient level of l-diversity. We will engage the Wikimedia Foundation and Wikipedia community to determine an appropriate threshold for l-diversity. No aggregation will be performed below the project level (i.e. not sub-projects).

Below is a possible algorithmic sketch for anonymization. We anticipate that it will be iteratively refined as we analyze the anonymized dataset and receive feedback from the Wikipedia community. To aggregate the dataset, we will repeat the following steps until l-diversity is sufficient for all editors.

  1. Consider each candidate Wikipedian w (logged in or anonymous) who has edited a page in the date range.
  2. For each of w's edits, compute the geographic set intersection of editor geography across all five-tuples corresponding to the articles they edited.
  3. While l-diversity is below the desired threshold, increase the geographic scale of either article or editor aggregation.
  4. If sufficient l-diversity cannot be achieved at continent-level aggregation, we will not include the record in the anonymized dataset.

Selecting a time interval[edit]

As we increase the timespan of data collected, we will have more data about any particular geographic region. This suggests that we would be able to provide data at finer levels of geographic scale with more data. Thus, the time range necessary to produce data at a given level of geographic scale (e.g. country vs. state) depends on the i-diversity threshold deemed appropriate.

As a reference point, in our CHI paper used three months of editor data. In this data, 3000 (article country, editor country) pairs had 10 or more edits; this accounts for 7.5% of all possible pairs. One way to limit the amount of data that must be actively analyzed is using smaller time ranges for heavily edited language editions (e.g. "en-wiki") and larger time ranges for less edited language editions (e.g. "sw-wiki").

Including Non-Geographic Articles[edit]

Non-geographic articles can be included by extending the geographic tree to include a root that differentiates between geographic and non-geographic articles. In this scenario, all non-geographic articles would be grouped together. Tree structures could also be used to split non-geographic articles (similar to geographic articles), however, care must be taken to assign every article to exactly one leaf in the tree. Otherwise the above anonymization algorithm may be less robust.

Opting out[edit]

As detailed in Priedhorsky et al.'s proposal, a variety of methods could be used to allow editors to opt out of the anonymized dataset. These include:

  • Active opt-out: A banner, check-box, etc. on edit pages that allows editors to not be included in the dataset.
  • Active opt-in: A banner, check-box, etc. on edit pages that allows editors to be included in the dataset.
  • Blanket opt-out: Exclude all logged-in editors.
  • Do not track: Exclude editors who have the "do-not-track" header turned on.

We will work with the Wikimedia Foundation and Wikipedia community to determine an approach that maintains Wikipedian's expectations of privacy while still being feasible from an engineering perspective. We note that while the "blanket opt-out" strategy is sufficient for Priedhorsky et al.'s proposal, it would negate the value of the dataset described in this proposal: anonymized editor IPs that could be geocoded are already available in editing history, and this proposal attempts to enhance this data by reflected the geography of logged-in editors.

Discussion[edit]

Feedback is welcome on the project's talk page.

References[edit]

  1. Graham, Mark and Hogan, Bernie, Uneven Openness: Barriers to MENA Representation on Wikipedia (April 29, 2014). Graham, M., and Hogan, B. 2014. Uneven Openness: Barriers to MENA Representation on Wikipedia. Oxford Internet Institute Report, Oxford UK.
  2. Sen, S., Ford, H., Musicant, D., Graham, M., Keyes, O., and Hecht, B. 2015. Barriers to the Localness of Volunteered Geographic Information. Proceedings of CHI 2015. New York: ACM Press.
  3. Aggarwal and Yu, Privacy-Preserving Data Mining – Models and Algorithms, Springer, 2008