Grants:Project/Eurecat/Community Health Metrics: Understanding Editor Drop-off/Timeline

From Meta, a Wikimedia project coordination wiki


Monthly updates[edit]

Please prepare a brief project update each month, in a format of your choice, to share progress and learnings with the community along the way. Submit the link below as you complete each update.

October 2020[edit]

  • Participation to ItWikiCon 2020 (online) with:
  • Recruited 2 students from the University of Trento, Alessio and Francesco, that are going to start working on the project as part of their thesis (B.Sc. in Computer Science)
  • Coding the first version of the script to retrieve metrics from Mediawiki History dumps.
  • Debate with some Catalan Wikipedians (Amical Wikimedia) to understand the reasons for editor retention and drop-off.
  • Created some tentative visualizations with a non-web tool (Tableau) to understand community engagement dynamics and overall composition. They were shared with the Catalan Wikipedia community in their Village pump [pdf].

November 2020[edit]

  • Requested a cloud VPS server on Wikimedia Labs: phabricator:T267162.
  • Created a GitHub organization and posted the script we are working on to retrieve and calculated metrics based on Wikipedia history of revisions (in-progress).
  • Started a first version of the project Meta page to disseminate the results and (more importantly) engage participants to share their insight and, if they wish, join the project.
  • Analysis of the state of the art on drop-off: revised definitions and motivations for drop-off, based on both academic literature and community documentation.

December 2020[edit]

  • Started collaboration with two other bachelor students from the University of Trento for their thesis, Nicola and Eugenio. Nicola will work on analyzing the emotional expressions contained in discussions in Wikipedia talk pages. Eugenio will analyze the life-cycle of users based on their editing activity levels.
  • Preliminary analysis of Wikipedia Talk pages: for 4 language editions (ca, en, es, it). We leveraged the WikiConv dataset (figshare) to compute several metrics regarding Wikipedia talk pages:
    • We implemented a library to efficiently process the WikiConv dumps and extract, filter, and sort editor interactions (wikiconv-crunch)
    • We started computing structural patterns metrics of discussions over time such as length of discussion chains, h-index, and mutual replies.

January 2021[edit]

  • Definition of the analysis framework: we defined the structure of the datasets:
    • Information about the users, in order to be able to compare different groups based on admin status, gender, native language, etc.
    • Monthly metrics of activity and interaction patterns from different sources, by page and by user
    • namespaces aggregated into six groups, according to (Welser et al, 2011) and user status in 3 levels, building on (Arazy et al, 2015).
  • Preliminary analysis of Wikipedia History dumps: for 4 language editions (ca, en, es, it):
    • created a database with basic indicators of user activity over months, such as edits in different groups of namespaces
    • started computing metrics of conflict based on reverts.

February 2021[edit]

  • Started to compute sentiment analysis metrics to characterize the emotional expression of editors in talk pages over time using the NRC lexicon.
  • Extraction of self-disclosed characteristics of the editors and information from templates:
    • We readapted an existing library (wikidump) (Consonni et al, 2019) to extract Babel templates from the Wikipedia XML dumps
    • We extracted gender information about the editors (private repo):
      • We extracted gender information about the editors from the Wikipedia API
      • We analyzed user-boxes disclosing user gender for 4 language editions (ca, en, es, it)

March 2021[edit]

  • Started collaboration with another bachelor student, Samuele, from the University of Trento for his thesis. Samuele will work on analyzing the effect of wikibreaks templates and user warnings on editor drop-off.
  • Submitted to WikiWorkshop a paper on our framework to characterize editor inactivity
  • We readapted an existing library (wikidump) (Consonni et al, 2019) to extracted from user Babel templates information about the languages spoken by editors (wikidump > languages.py)

April 2021[edit]

  • We extracted wikibreaks and retirement templates, in all languages where they were available (wikidump > wikibreaks.py) to analyze the usage of these templates, their influence on the editor’s activity level.
  • Computed statistics about structural discussion patterns by page from the WikiConv dataset
  • Presented paper at WikiWorkshop 2021 (slides)
  • Implemented a dashboard webapp prototype

May 2021[edit]

  • Statistics on user life-cycle and drop-off in relation to user attributes like gender, registration date and administrator status, for different Wikipedia language editions
  • Extracted user warnings for ca, it, es, en (wikidump > user_warnings.py) to study the relationship between receiving a user warning and an editor level of activity on the Wikipedia. This required to be able to extract different kinds of transcluded and substituted templates from the Wikipedia XML dumps tracking when they were added and removed.
  • Analysis of user warnings and wikibreaks over time with the aim of identifying patterns and inconsistencies between user templates and their actions.

June 2021[edit]

  • Statistics about reverts patterns over time in different language editions.
  • We started designing the questions we would use in the interviews based on the state of the art of Editor Drop-off and prepared a list of valuable Wikipedians (some of them part of the UCoC drafting committee) to request their input.
  • Computed emotional expression over months by article and by user group
  • Knowledge transfer across team members, quality (control) evaluation, and documentation of the code of the different approaches.

July 2021[edit]

  • Meeting with Wikimedia Poland in order to understand their community engagement activities. We were requested to investigate the active community (e.g., valuable experienced editors) to assess the potential risks of its decrease in the number of valuable editors (notes).
  • We started rolling out the questionnaire to community members from very different Wikipedia language editions, including underrepresented communities.
  • Analysis of user warnings in relation to user activity patterns and drop-off.
  • We developed the Meta page of the project and added all the necessary sections to disseminate our work to the different target readers (e.g., community leaders, technical contributors, Wikimedia researchers, etc.).

August 2021[edit]

  • We organized a session along with people from the Wikimedia Research Team: Wikimania presentation and open discussion “Indicators for the Wikimedia Projects”. Slides (PDF,[[ video recording)
  • We deployed new VPS on Wikimedia cloud and migrated the datasets and configurations: phabricator:T284687.
  • We interviewed some expert editors from underrepresented groups in the Movement and finished collecting the qualitative data.

September 2021[edit]

  • We collected all the data and completed the analysis of the Drop-Off Questionnaire. There were a total of 32 participants (40% of women) from 27 Wikipedia language editions (13 from underrepresented languages).
  • We finished the conceptualization of the “Vital Signs” indicators to understand the state of community growth and renewal.

October 2021[edit]

  • We participated in WikiArabia conference and did an analysis on the Vital Signs focused on Arabic languages. 15/10/2021 | WikiArabia 2021 talk (program page) | Session: Measuring Arabic Wikipedia Community Health: Are We “Open” to Community Growth and Renewal?. (slides and notes PDF | video recording).
  • We completed a report for Wikimedia Poland and released it in Meta-wiki.
  • We continued coding some metrics that were suggested to us during conversations with Wikipedians.
  • We released the Drop-off Questionnaire Report (preliminary version PDF) and we will iterate on it.

November 2021[edit]

  • We participated in Wikiindaba and did an analysis on the Vital Signs focused on African languages. 5/11/2021 | Wikiindaba 2021 talk (program page) | Session: African language Wikipedias - indicators for development, growth and renewal. (slides and notes PDF | video recording).
  • We participated in Wikimedia CEE conference and did an analysis on the Vital Signs focused on Central and Easter European languages. 7/11/2021 | Wikimedia CEE Meeting 2021 talk (program page) | Session: Measuring Central and Eastern Europe Wikipedias Growth and Renewal. (slides and notes PDF | video recording)
  • We participated in la Viquitrobada (Catalan Wikipedia gatehering) and did an analysis on the Vital Signs focused on Catalan Wikipedia. 20/11/2021 | Viquitrobada (Catalan Wikipedia Annual Gathering) 2021 talk (program page) | Session: Measuring Catalan Wikipedia Community Health: Are We “Open” to Community Growth and Renewal?. (slides and notes PDF (català)).
  • We held various calls with Wikimedia Poland and the Volunteer Support Network (an affiliate group) in order to collect feedback on the Vital Signs.
  • We started developing the website for the Vital Signs using Plotly and Dash technologies.

December 2021[edit]

  • We completed a report (PDF) for Wikimedia Italia as they asked for the values for their Community Vital Signs.
  • Data retrieval and processing to generate the Vital Sign metrics.

January 2022[edit]

  • We developed an initial prototype of the Vital Signs website and dashboards.
  • Data analysis: time series clustering of the evolution of the number of active editors in each Wikipedia language edition.
  • We collected the state-of-the-art about community growth and renewal for an upcoming academic publication.
  • We developed a peak detection method to identify relevant changes in the timelines representing different aspects of editor interactions.

February 2022[edit]

March 2022[edit]

  • We submitted a revised version of the "Sustainability” journal article.
  • We automatized the data generation for the Vital Signs website.
  • We submitted a short paper on patterns of growth, stagnation and decline to the WikiWorshop 2022.
  • We have published online a first preliminary version of the Vital Sign dashboards: vitalsigns.wmcloud.org.

April 2022[edit]

Is your final report due but you need more time?



Extension request[edit]

New end date[edit]

Novemeber 30th, 2021

Rationale[edit]

We are requesting an extension of 3 for this project, with the new end date being November 30th, 2021 (final report due December, 31st).

During this project, we have managed the transition of a team member - Pablo (WMF) who joined the Wikimedia Research team - and the onboarding of a new one, marcmiquel.

Our progress has been steady, as it can be seen on the project’s timeline and in the Mid-point report. We have created a framework for studying editor drop-off and community health, and we have processed and analyzed data from different sources obtaining some valuable results; we would need the requested additional time to combine the different datasets and results obtained in order to achieve further actionable knowledge and publish dashboards to disseminate the results to the Wikimedia and scientific communities.

We are available for questions and comments, and we would like to thank you in advance for your support.

--CristianCantoro (talk) 09:22, 1 September 2021 (UTC)

on behalf of the team

Approved[edit]

New end date of November 30, 2021 is approved.

Marti (WMF) (talk) 15:03, 10 February 2022 (UTC)