Research:Wikipedia Edit Types/Edit Summaries

Tracked in Phabricator:
Task T348333

Created

20:56, 9 October 2023 (UTC)

Contact

Isaac (WMF)

Wikimedia Foundation

Collaborators

Marija Sakota

EPFL

Guosheng Feng

EPFL

Robert West

EPFL

Duration: 2023-05 – ??

Open access
via arXiv

Open data
via [1]

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The edit types library is oriented towards providing a machine-readable, consistent manner of summarizing what actions were taken in a given edit on Wikipedia. This complements the goals of an edit summary, which is the human-readable summary of what was changed by an edit. While edit summaries can be very useful to moderators on Wikipedia who are seeking to understand a given edit and choose how to respond to it, they unfortunately are often incomplete. This iteration of the project examines how edit types might be used to help generate edit summary recommendations in an effort to help improve the quality of edit summaries on Wikipedia.

From edit types to edit summaries

Starting with a simple case: on Wikidata, for example, each edit is composed of a single, specific action that can be automatically described in the summary such as "Created claim: Penguin Random House author ID (P9802): 114" which indicates that the edit created a claim and that claim was for P9802 with a value of 114.^[1] Thus, what one might consider to be the basic edit types for Wikidata also stand in very nicely for an automatically-generated, human-readable edit summary too.

Editing textual content on Wikipedia^[2] is often more complex though and benefits from a higher-level description of what was changed that balances conciseness with completeness. To generate useful edit summaries for this subset of edits is more likely complicated than a set of heuristics or multiclass classification task. It is a task for which language models are well-suited. A model should be able to be trained to take as input some manner of diff, understand the nuances of language so it can parse whether a fact is being updated or typo fixed or weasel word is being removed etc., and then generate a description that mimics how a Wikipedian would have chosen to describe the change.

Data

This project explores the possibility of training a language model to generate edit summaries (that could then be recommended to editors) for the subset of edits that affects textual content on English Wikipedia. Training data can be found in this folder: https://analytics.wikimedia.org/published/datasets/one-off/isaacj/edit_summaries/

Results

Qualitative Analysis

There is frightfully little research describing the current usage of edit summaries so we undertook a basic content analysis of existing edit summaries on English Wikipedia. We analyzed a stratified sample of 100 random edits made in August 2023 to English Wikipedia, taking 25 edits each from the following editor groups to capture a diverse set of editor expertises: IP editors (also called anonymous editors), newcomers (editors with less than 10 edits), mid-experienced editors (10 - 1000 edits), and experienced editors (1000+ edits). The goal was to better understand good-faith, individual contributions to English Wikipedia with an attempted summary so we excluded edits by bots, edits with blank or automatic edit summaries, and revert-related edits. Notably, blank edit summaries otherwise comprise 46% of the edits from August 2023 though that proportion varies greatly by editor type: 74% of edits for IP editors, 46% of edits for newcomers, 58% for medium-experienced editors, and 38% of edits for experienced editors.

Two of the authors each coded all 100 summaries and the results are the table below. Each edit summary was coded for several criteria that are described by the English Wikipedia community in their guidance for how to write an edit summary:

Summary (what): does the edit summary attempt to describe what the edit did. For example, "added links".
Summary (why): does the edit summary attempt to describe why the edit was made. For example, "Edited for brevity and easier reading".
Misleading: Is the edit summary overly vague or misleading per English Wikipedia guidance. For example, "updated" without explaining what was updated is too vague.
Inappropriate: Could the edit summary be perceived as inappropriate or uncivil per English Wikipedia guidance.
Generate-able (what): could a language model feasibly describe the "what" of this edit based solely on the edit diff.
Generate-able (why): could a language model feasibly describe the "why" of this edit based solely on the edit diff.

Statistics on agreement for qualitative coding for each facet and the proportion of how many edit summaries met each criteria. Ranges are a lower bound (neither of the coders marked an edit) and an upper bound (both of the coders marked an edit) but not standard confidence intervals and no comparison tests are conducted between groups.
Metric	Summary (what)	Explain (why)	Misleading	Inappropriate	Generate-able (what)	Generate-able (why)
% Agreement	0.89	0.8	0.77	0.98	0.97	0.8
Cohen's Kappa	0.65	0.57	0.50	-0.01	0.39	0.32
Overall (n=100)	0.75 - 0.86	0.26 - 0.46	0.23 - 0.46	0.00 - 0.02	0.96 - 0.99	0.08 - 0.28
IP editors (n=25)	0.76 - 0.88	0.20 - 0.44	0.40 - 0.64	0.00 - 0.08	0.92 - 0.96	0.04 - 0.16
Newcomers (n=25)	0.76 - 0.84	0.36 - 0.48	0.24 - 0.52	0.00 - 0.00	0.92 - 1.00	0.12 - 0.20
Mid-experienced (n=25)	0.76 - 0.88	0.28 - 0.52	0.16 - 0.36	0.00 - 0.00	1.00 - 1.00	0.08 - 0.28
Experienced (n=25)	0.72 - 0.84	0.20 - 0.40	0.12 - 0.32	0.00 - 0.00	1.00 - 1.00	0.08 - 0.48

Overall, we see relatively high agreement about how to apply the codes though the Cohen's kappa can be lower for some categories, reflecting that these judgments can be difficult and highly subjective -- i.e. what should be in an edit summary is not always straightforward to identify. The vast majority (80%) of current edit summaries are oriented towards the "what" and only to a much lesser degree (30-40%) the "why" of the edit. This aligns with the judgment of which aspect is feasible to generate via a language model: the "what" is almost always possible (>95%)^[3] but the "why" is far less obvious (20%) from the edit diff alone. Accurately describing the "why" would require outside context that a model would not have access to such as information about a source being added or events happening in the world.

A sizeable minority (35%) of edit summaries were labeled as "misleading". This was generally due to overly vague edit summaries or edit summaries that only mentioned part of the edit---e.g., saying "added citations" when a large amount of new content was also added. This points to the challenge of training a model to produce a high-quality edit summary based on existing data given these frequent omissions or vagueness.

Almost no edit summaries are inappropriate, likely because highly inappropriate edit summaries would be deleted by administrators and therefore not appear in our dataset. This suggests that it is (thankfully) unlikely for a model trained on edit summaries to learn to suggest inappropriate summaries.

Model Development

Forthcoming.

References

↑ https://www.wikidata.org/w/index.php?title=Q42&diff=prev&oldid=1951763454
↑ Many changes in Wikipedia that do not directly change the core text of a page can likely be described in a similar way to Wikidata. This is typified by the many tools/bots that make standard, simple changes to articles such as updating URLs to archived versions or enforcing basic manual of style norms around links. In most cases, these tools can and do auto-generate useful summaries.
↑ Exceptions were mainly very complicated table edits that required knowledge of other content to judge their impact.
↑ Yang, Diyi; Halfaker, Aaron; Kraut, Robert; Hovy, Eduard (2017). "Identifying Semantic Edit Intentions from Revisions in Wikipedia" (PDF). aclweb.org: 2000-2010. doi:10.18653/v1/D17-1213. Retrieved 15 October 2021.
↑ Asthana, Sumit; Tobar Thommel, Sabrina; Halfaker, Aaron Lee; Banovic, Nikola (2021-10-18). "Automatically Labeling Low Quality Content on Wikipedia By Leveraging Patterns in Editing Behaviors". Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2): 359:1–359:23. doi:10.1145/3479503.

[1] ttps://www.wikidata.org/w/index.php?title=Q42&diff=prev&oldid=1951763454

[2] Many changes in Wikipedia that do not directly change the core text of a page can likely be described in a similar way to Wikidata. This is typified by the many tools/bots that make standard, simple changes to articles such as updating URLs to archived versions or enforcing basic manual of style norms around links. In most cases, these tools can and do auto-generate useful summaries.

[3] Exceptions were mainly very complicated table edits that required knowledge of other content to judge their impact.

[4] Yang, Diyi; Halfaker, Aaron; Kraut, Robert; Hovy, Eduard (2017). "Identifying Semantic Edit Intentions from Revisions in Wikipedia" (PDF). aclweb.org: 2000-2010. doi:10.18653/v1/D17-1213. Retrieved 15 October 2021.

[5] Asthana, Sumit; Tobar Thommel, Sabrina; Halfaker, Aaron Lee; Banovic, Nikola (2021-10-18). "Automatically Labeling Low Quality Content on Wikipedia By Leveraging Patterns in Editing Behaviors". Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2): 359:1–359:23. doi:10.1145/3479503.

[1]

[2]

[3]

[4]

[5]

From edit types to edit summaries

Data

Results

Qualitative Analysis

Model Development

See Also

References