Grants:Project/Eurecat/Community Health Metrics: Understanding Editor Drop-off/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2019-20 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:Project/Eurecat/Community Health Metrics: Understanding Editor Drop-off.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learning from the first half of the grant period.

Summary[edit]

In the first part of the project, we have defined the framework for the analysis, rooted in the state of the art both from an academic and a community perspective.

We have collected data from different sources to obtain a composite set of metrics of activity and interaction over time, accounting for different aspects including edit activity, reverts, discussion patterns, emotional expression, user warnings, editor lifecycle, and community dynamics.

In this way, we have set the basis for an analysis of drop-off dynamics for different language communities and for specific collectives. We have also created a scalable and extensible infrastructure for developing indicators and dashboards for community health.

We have started to disseminate our work, presenting the approach and collecting feedback at a community event (ItWikiCon) and an academic workshop (the WikiWorkshop).

Methods and activities[edit]

We overview some of the most relevant methods and activities carried away during this half of the project, until March 2021:

Analysis of the state of the art on drop-off: we revised definitions and motivations for drop-off, based on both academic literature and community documentation
Definition of the analysis framework: we defined the structure of the datasets:
- Information about the users, in order to be able to compare different groups based on admin status, gender, native language, etc.
- monthly metrics of activity and interaction patterns from different sources, by page and by user
- We decided to aggregate namespaces into six groups, according to (Welser et al, 2011): Content (0,6), Content Talk (1,7), User (2), User Talk (3), Wikipedia (4,5), Infrastructure (all the other namespaces)
- We decided to aggregate user status in 3 levels, building on (Arazy et al, 2015): Anonymous editors (level 0), Registered editors (level 1), Expert editors (level 2), Admins (level 3 and above).
Preliminary analysis of Wikipedia History dumps: for 4 language editions (ca, en, es, it):
- we created a database with basic indicators of user activity over months, such as edits in different groups of namespaces, entropy of the distribution of activity across articles (https://github.com/WikiCommunityHealth/wikimedia-user-metrics )
- we started computing metrics of conflict based on reverts (Yasseri et al, 2014) (https://github.com/WikiCommunityHealth/wikimedia-revert )
Preliminary analysis of Wikipedia Talk pages: for 4 language editions (ca, en, es, it). We leveraged the WikiConv dataset (link) to compute several metrics regarding Wikipedia talk pages:
- we implemented a library to efficiently process the WikiConv dumps and extract, filter, and sort editor interactions (wikiconv-crunch)
- we started computing structural patterns metrics of discussions over time such as length of discussion chains, h-index, and mutual replies (Laniado et al, 2011; Kaltenbrunner & Laniado, 2012) (https://github.com/WikiCommunityHealth/wikiconv-structural-patterns )
- we started computing sentiment analysis metrics to characterize emotional expression of editors in talk pages over time using the NRC lexicon (https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)
Extraction of self-disclosed characteristics of the editors and information from templates:
- We extracted from user templates information about the languages spoken by editors (https://github.com/WikiCommunityHealth/wiki-users-gender):
  - we readapted an existing library (wikidump) (Consonni et al, 2019) to extract Babel templates from the Wikipedia XML dumps
- We extracted gender information about the editors (https://github.com/WikiCommunityHealth/wiki-users-gender):
  - We extracted gender information about the editors from the Wikipedia API
  - We analyzed user-boxes disclosing user gender for 4 language editions (ca, en, es, it)
- We extracted user warnings, wikibreaks and retirement templates
  - we readapted an existing library (wikidump) (Consonni et al, 2019) to extract different kinds of transcluded templates from the Wikipedia history dumps (Consonni et al, 2020), tracking when they were added and removed
Dissemination activities:
- We presented the project and collected feedback at the ItWikiCon 2020 conference https://2020.itwikicon.org/
- We presented a preliminary description of the framework of analysis at the WikiWorkshop https://wikiworkshop.org/2021/ (Miquel et al, 2021)

References[edit]

Arazy, O., Ortega, F., Nov, O., Yeo, L., & Balila, A. (2015). Functional roles and career paths in Wikipedia. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 1092-1105).

Consonni, C., Laniado, D., & Montresor, A. (2019). WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 13, pp. 598-607).

Iosub, D., Laniado, D., Castillo, C., Morell, M. F., & Kaltenbrunner, A. (2014). Emotions under discussion: Gender, status and communication in online collaboration. PloS one, 9(8).

Kaltenbrunner, A., & Laniado, D. (2012). There is no deadline: time evolution of Wikipedia discussions. In Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration (pp. 1-10).

Laniado, D., Tasso, R., Volkovich, Y., & Kaltenbrunner, A. (2011). When the Wikipedians talk: Network and tree structure of Wikipedia discussion pages. In Fifth International AAAI Conference on Weblogs and Social Media.

Marc Miquel-Ribé, Cristian Consonni and David Laniado (2021). Wikipedia Editor Drop-Off: A Framework to Characterize Editors’ Inactivity. Wiki Workshop 2021, part of The Web Conference 2021.

Yasseri, T., Spoerri, A., Graham, M., & Kertész, J. (2014). The most controversial topics in Wikipedia. Global Wikipedia: International and cross-cultural issues in online collaboration, 25.

Welser, H. T., Cosley, D., Kossinets, G., Lin, A., Dokshin, F., Gay, G., & Smith, M. (2011). Finding social roles in Wikipedia. In Proceedings of the 2011 iConference (pp. 122-129).

Midpoint outcomes[edit]

These are the main outcomes of the project so far:

Source code organized in 18 Github repositories, see the GitHub organization https://github.com/WikiCommunityHealth
Datasets:
- editor-level metrics of activity over time
- page-level controversy metrics over time, based on the edit history and talk pages
Publication at Wikiworkshop:
- Marc Miquel-Ribé, Cristian Consonni and David Laniado. Wikipedia Editor Drop-Off: A Framework to Characterize Editors' Inactivity (PDF). Wikiworkshop 2021.
Presentation at ItWikiCon 2020 (slides).
We have started setting up a website to host the dashboards produced with our analyses.

Finances[edit]

Our finances are on track with the proposal, see Community Health Metrics: Understanding Editor Drop-off/Finances

Learning[edit]

What are the challenges[edit]

Team capacity: one of our team members, Pablo Aragón (elaragon), stepped down to join the Wikimedia Research at WMF (as Pablo (WMF)), and has become an advisor of the project. We incorporated Marc Miquel (marcmiquel), who previously had the role of advisor, into the team. Marc will be actively working on the project until its completion. He has strong experience in the context of editor engagement, specifically from his PhD dissertation.

Coordination work: we were able to add to the team several dedicated and talented students, this meant dealing with some additional administrative work, on our institution and on the University side, and the coordination work required. We are convinced however that this extra effort is a worth investment for the project.

Data: we have collected data from different sources and in different formats; it is not trivial to put it all together in a comprehensive framework, we are working in this direction in order to be able to combine data and metrics. Furthermore, dealing with data from the projects is not always easy, as in many cases they are not structured or they change over time. For example, we have found a couple of bugs, which we have reported to Phabricator (T276119, T276120).

Limitations of quantitative methods: the fact that drop-off is often due to external reasons makes it harder to advance and test hypotheses only relying on on-wiki editor interaction data; to get insights on external vs internal reasons for drop-off with respect to the community, we have extracted templates for wikibreaks and retirement, and we plan to run interviews to members from different language communities.

What is working well[edit]

We didn't set out to use specific learning patterns, however, we could recognize several in what we have done. We also want to share some of our experience about what is working well, patterns in bold were created by the team involved in this project.

Collaborate with students for their thesis project

We involved several students in this project, we believe that it is a good opportunity for a first experience with research.

Git repositories for software

We set up a GitHub organization (WikiCommunityHealth) to work as an umbrella for all code repositories that we created for the project.

Using open licenses

All the code we are publishing is released under an free license.

A playful logo builds identity and invites interaction

Since we started publishing code on GitHub, we realized we needed a logo to communicate the goal and scope of the project. We have created this logo.

Next steps and opportunities[edit]

We have worked on extracting and processing data from different sources to describe different aspects of the editors’ activity and interactions; now we need to make sense of these data taken together, modelling their temporal evolution, identifying patterns and correlations.

Further, we need to select from all these data some relevant metrics and indicators to be shown within dashboards aimed to help the communities.

We aim to run interviews with members from different communities to collect their input in order to complement our computational analyses and guide the development of the tools and interfaces for the communities.

Grantee reflection[edit]

As members of the community and, at the same time, researchers studying community dynamics, we are grateful for this opportunity to put our effort directly at the service of the community. We acknowledge the availability and prompt assistance of the Wikimedia Cloud Services team that provided help with the project’s server, and the support and precious feedback of our team of advisors.