Research:Mapping Wikipedia in the Middle East and North Africa

Contact

Mark Graham

Oxford Internet Institute

Research:Projects

This page documents a completed research project.

Key personnel[edit]

Mark Graham Oxford Internet Institute
Bernie Hogan Oxford Internet Institute
Ilhem Allagui American University of Sharjah
Ali Frihida University Tunis El Manar
Gavin Baily Tracemedia
Kalina Bontcheva University of Sheffield
David Palfrey Oxford Internet Institute
Ahmed Medhat Mohammed Oxford Internet Institute

Former personnel

Richard Farmbrough Oxford Internet Institute

Project summary[edit]

We are investigating Wikipedia's representations of the Middle East, North Africa and East Africa. In particular, we are interested in both who is being represented, with what frequency, and who is doing the representing. To this end we are using data collected from [dumps.wikimedia.org/backup-index.html], to analyse patterns throughout Wikipedia's history.

We are capturing location-oriented data in two ways, and seek assistance for a third.

Geocoded data. We have parsed the current versions of seven languages of interest: English, French, Swahili, Arabic, Egyptian Arabic and Persian. For each one we have sought out articles for locations based on a parsing of geocodes within the articles (including body text as well as info boxes).
NLP-based Location grammars. We are using GATE in order to assess where possible the self-identified locations of individuals, as well as cultural heritage based on Babel info boxes.
We would like to assess the locations of logged in users, within the confines of respecting their privacy. The section below Methods, describes this more fully. To note, we are keen to ensure that our data is open access, and to that extent we are interested in APIs to private data that can output anonymized data to us rather than the explicit downloading of private data.

Alongside these location-oriented tasks is an interest in the following research topics:

Patterns of conflict and collaboration: To what extent to geographical markers of identity play a role in the collective task of editing and maintaining articles.
Geographic patterns of second-level metrics: At present, there are numerous maps indicating vast geographic and linguistic inequalities on Wikipedia, but these tend not to address differences in quality, contentiousness and scope as determined through extensive processing of database dumps.
Location-based clustering of authors: To what extent to authors from a country edit together? Is this more than one would expect based on shared interests?

Finally, one of the key goals of our project is education and increased content creation. As such, we are holding four meetings across the Middle East and North Africa in the coming years in order to both disseminate our results and to further engage individuals from the Arab world on Wikipedia.

Context[edit]

In particular significant effort has been invested (and continues to be invested) in understanding the self-identified location and origin of the editors. Clearly this will be likely to remain an incomplete measure, since the nature of contributions is such that many editors do not, for various reasons, provide any such information, and even when they do it may well not be structured in an easily recoverable way. Nonetheless this is a key component to understanding the reality of representation on WMF projects, and by extension on the Internet as a whole. Allied with this, of course, is the principle of geolocating IP edits, which provides, again, an incomplete (and not necessarily fully accurate, due to use of proxies, VPNS etc.) picture of those editors who choose not log in to a user account, in this case, of course, the biggest lacuna is the geolocations of logged in users, which is where anonymised aggregated data from the Foundation's database will close a gap.

Methods[edit]

File:Ar level1 area.png

Typical visualisation produced by the project.

For more details on this topic, see /geography.

Analysis of WMF dumps
- Static analysis,
- Data parsing
- History analysis
Analysis of user pages
- NLP analysis to determine user self-identification
Analysis of data aggregated by WMF
- Statistical analysis of aggregated geolocation information
Comparison of intermediate statistics gathered above
Editor surveys/interviews

Dissemination[edit]

Findings will be shared in various ways:-

By workshops with those attempting outreach of WMF or other open projects in the region of interest.
Publication of project material and data via the OII's website
Publication of academic papers via the most appropriate journals and other outlets
Production of tools to assist in enhancing articles

Wikimedia Policies, Ethics, and Human Subjects Protection[edit]

Proposed interview with users, as part of data gathering exercise.

Approval from Oxford University Social Sciences and Humanities Inter-Divisional Research Ethics Committee for user interviews. SSD/CUREC1A/11-253 5 September 2011

The project is committed to scrupulous ethical practices, and any additional requirements needing ethical approval will be subject to the Oxford University Social Sciences and Humanities Inter-Divisional Research Ethics Committee process.

Benefits for the Wikimedia community[edit]

Our project has three main outputs, all of which are likely to be of interest to the Wikipedia community. They are scholarly output, dynamic web resources and dissemination workshops with translated materials. In the end, all three are oriented towards increased content creation as well as a clear articulation of mechanisms that will help bring new members into Wikipedia.

Scholarly outputs.
Specific research questions have been asked, but in addition other questions are being asked and answered as the project proceeds. These will be made available through a combination of traditional scholarly output, more contemporary means such as the project web page, blog posts, mailing lists and on Wiki pages and traditional mainstream media interviews and articles.
Dynamic web resources.
A multilingual tool to provide assistance in improving article quality is part of the overall project.
Dissemination workshops including translated materials - both the scholarly output and workshop material.
See next sub-section for more detail

Current Workshops[edit]

Our first workshop will be held in Cairo on the 21st and 22nd of October, 2012. Please contact us if you are interested in attending. We have funding for travel and accomodations for Wikipedians from the MENA region.

Upcoming workshops[edit]

Our next workshop will be in Amman, Jordan from the 26th-27th January 2013.

Time line[edit]

Date	Project goals
April 2011	Project initiation
September 2011	Draft article location results
November 1011	NLP analysis on user pages complete
December 2011	Extraction of data relating to user
April 2012	Workshop on initial results and fact-finding
April 2013	Workshop on web resources and scholarly outputs

Funding[edit]

International Development Research Centre grant # 106228
John Fell Fund project 101/549

References[edit]

Zerogeography preliminary results

External links[edit]

Contacts[edit]

Mark Graham - immedium@gmail.com +44(0) 1865 287203
Bernie Hogan - bernie.hogan@oii.ox.ac.uk +44(0) 1865 287198