Learning patterns/Collecting data on requests for adminship

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
A learning pattern foronline engagement
Collecting data on requests for adminship
MechaDuck.png
problemRequests for adminship are an important part of the Wikipedia community. There are several scientific studies trying to explain and predict voting behaviour within these requests. These tend to make use of the same dataset as it is difficult to obtain the election data.
solutionA systematic approach and knowledge regarding web-scraping are necessary to collect data on requets for adminship.
creatorASociologist
endorse
created on12:44, 1 June 2022 (UTC)
status:DRAFT

What problem does this solve?[edit]

Requests for adminship are an important part of the Wikipedia community. There are several research studies trying to explain and predict voting behaviour within these requests such as the studies of Leskovec et al. (2010a[1], 2010b[2]), Turek et al. (2011)[3], Picot-Clemente et al. (2015)[4], Burke and Kraut (2008)[5], Kordzadeh and Kreider (2016)[6], Oppong-Tawiah et al. (2016)[7], Cabunducan et al. (2011)[8] or Lee et al. (2012) [9].

These studies need to collect their own data regarding request for adminship or make use of previously shared data which many do. In this learning pattern, I want to highlight how the election data can best be collected and cleaned, as well as discuss problems that can arise in the process.

What is the solution?[edit]

This learning pattern focuses on the requests for adminship and takes the German Wikipedia as a starting point example.

Step 1: Finding the data[edit]

Requests for adminship are generally well-documented and easy to find. An archive of previous elections can be found through the page on requests for adminship (see the different language versions). The way these requests are archived depends on the language version and also the time of consideration. The voting process might also have changed throughout the years. It is important to be aware of such changes when wanting to collect and analyse election data. For example, in the early days of the German Wikipedia, administrators were mostly suggested by one user and accepted the nomination, while in 2020, 300 people cast their votes in favour or against candidates.

Step 2: Scraping the data[edit]

Even though the election data is user-written and not process-generated, it generally comes with a relatively consistent, pre-defined structure so that the data can be collected with an automated web-scraper. For each election in the German Wikipedia, there is generally one separate Wikipedia page, with archiving standards having changed slightly across the years. Such a page page presents an introduction of the candidate, the beginning and end date of the election, and designated support, oppose' and neutral sections for the voters. Voters make an edit in the section corresponding their opinion and sign with their name.

To collect data on the German requests for adminship, I wrote a webscraper using R (and the packages rvest and httr) to visit all years of election and extract the election information. The web scraper visited each election page and tried to collect the candidate's name, the date the election closed, and all voters taking part in the election and the direction of their vote. Aborted elections can also be collected.

To collect the voters and their opinion, the HTML page can be split into separates parts by the section headings. The web scraper then collected which user signed under which text part, therefore identifying which user voted how. In case a username appeared in multiple parts of the text, I classified it as a problematic vote and checked this manually. This can happen in cases where users change their opinion and cross out one of their votes, or if they put their vote in one part but comment on a vote cast by someone else in another part. It is advisable to improve this technique and allow for better automated classification (i.e. by skipping votes that were crossed out and by skipping comments to previously casted votes).

My webscraper be found in my GitHub repository. Feel free to adapt - and improve! - it and use it for your own needs.

Step 3: Cleaning the data[edit]

After collection, the election data needs to be cleaned.

Potential issues that come up are the following:

  • Multiple votes cast by the same user.
  • Votes cast by users not eligible to vote (depends on the eligibility criteria).

It can depend on the research question how inconsistencies in the data should be resolved. In some cases, it might be most reasonable to exclude these voters.

Is the problem already solved?[edit]

Before you dive into data collection, check whether the data has already been collected by another busy bee.

Please add yourself to the following list if you have collected election data:

Already collected Wikipedia request for adminship data
User / Person Wikimedia project Time period covered Data access
ASociologist German Wikipedia 2001 - March 2020 Not yet publicly available, please get in touch if you would like to use the data.
Robert West, Hristo S. Paskov, Jure Leskovec, and Christopher Potts English Wikipedia 2003-2013 https://snap.stanford.edu/data/wiki-RfA.html
Add yourself
Add yourself

When to use[edit]

  • Use this pattern when you want to collect data and run analyses on requests for adminship (see references list to find previous literature using this data).
  • I have used this approach in my project.

Endorsements[edit]

See also[edit]

Related patterns[edit]

External links[edit]

References[edit]

  1. Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon (2010). "Governance in Social Media: A Case Study of the Wikipedia Promotion Process". Fourth International AAAI Conference on Weblogs and Social Media. 
  2. Leskovec, Jure; Huttenlocher, Daniel; Kleinberg, Jon (2010). "Predicting Positive and Negative Links in Online Social Networks". Proceedings of the 19th International Conference on World Wide Web. ACM Press. 
  3. Turek, Piotr; Spycha, Justyna; Wierzbicki, Adam; Gackowski, Piotr (2011). "Social Mechanism of Granting Trust Basing on Polish Wikipedia Requests for Adminship". Lecture Notes in Computer Science. Spriinger. 
  4. Picot-Clemente, Romain; Bothorel, Cecile; Jullien, Nicolas (2015). "Contribution, Social Networking, and the Request for Adminship Process in Wikipedia". Proceedings of the 11th International Symposium on Open Collaboration. ACM Press. 
  5. Burke, Moira; Kraut, Robert (2008). "Mopping Up". Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work. ACM Press. 
  6. Kordzadeh, Nima; Kreider, Christopher (2016). "Revisiting Request for Adminship (RfA) within Wikipedia: How Do User Contributions Instill Community Trust?". Journal of the Southern Association for Information Systems 4 (1). doi:10.3998/jsais.11880084.0004.102. 
  7. Oppong-Tawia, Divinus; Bassellier, Genevieve; Ramaprasad, Jui (2016). "Social Connectedness and Leadership in Online Communities". Social Media and Digital Collaboration Conference. ACM Press. 
  8. Cabunducan, Gerard; Castillo, Ralph; Lee, John (2011). "Voting Behavior Analysis in the Election of Wikipedia Admins". 2011 International Conference on Advances in Social Networks Analysis and Mining. IEEE. 
  9. Lee, John; Cabunducan, Gerard; Malinoao, Jasmine; Cabarle, Francis; Castillo, Ralph (2012). "Uncovering the Social Dynamics of Online Elections". Journal of Universal Computer Science 18 (4). doi:10.3217/JUCS-018-04-0487.