Research:Wikipedia Edit Types/Python package
This sub-project focuses on the technical implementation of the initial edit-types Python package for identifying what information is changed by a revision on Wikipedia. While the Python package will continue to evolve as needs evolve, the implementation as of February 2026 covers generating diffs from wikitext and HTML for edits to main namespace (0) articles on Wikipedia and identifying the specific nodes -- e.g., Templates, References, Words, etc. There has been exploratory code for further semantic groupings of nodes such as Content Generation vs. Content Maintenance but these are not officially part of the package.
Edit Types
[edit]There are two main approaches to building a taxonomy of edit types:
- Edit actions -- i.e. the "what" or all of the types of changes you might make to an article. These are easier to define / detect, more "basic" / "atomic", and can be thought of as a straightforward manner for turning an edit diff into a set of structured features. They are very useful for making edit recommendations / building ML models but can be hard to interpret the "why" behind edit actions for analysis purposes. This is typified by the Structured Tasks.
- Edit intentions / semantics -- i.e. the "why" or all the goals you might have in making an edit. These are more amorphous, often composed of different edit actions, and tell the story of what a given editor is seeking to do with an edit. They are less useful for recommenders / modeling but more useful for summary analysis / computational social science. This is typified by Yang et al.[1]
This work will begin with the edit actions component. We have mainly defined edit actions based on the various wikitext syntax available and HTML specs though have extended these where feasible to cover other common and largely-standardized cases -- e.g., references in wikitext via <ref> tags; infoboxes in HTML via infobox CSS class. Depending on which input content format is used, the diffing depends on the capabilities in mwparserfromhell (an existing, well-maintained Python-based parser for wikitext) or mwparserfromhtml (a new Python-based parser for HTML developed to largely mirror mwparserfromhell and support workflows like generating diffs). While the edit actions do not tell the story, for many "intentions", it's reasonable to define simple rules about what collection of actions define an "intention" -- e.g., adding a new sentence + citation -> content generation.
Edit Types Taxonomy
[edit]While effort is made to maintain parity between wikitext- and HTML-based diffing, not all elements within the HTML can be easily defined by just looking at an article's wikitext. Below are current edit types (presented in a hierarchy that provides some insight into how they are detected):
- Elements/Nodes:
- Links
- Categories (documentation)
- External links (documentation)
- Wikilinks (documentation)
- Sources
- Citations (documentation)
- References (documentation)
- Media
- Images (documentation)
- Audio (documentation)
- Video (documentation)
- Miscellaneous
- Comments (documentation)
- Math elements (documentation)
- Text Formatting (documentation)
- Templates: (documentation) this is a special case -- within wikitext, they are a first-order element; within HTML, they are not but any element can be labeled as transcluded or not.
- Structure
- Sections (documentation)
- Headings (documentation)
- Content containers
- Infoboxes (documentation)
- Lists (documentation)
- Tables (documentation)
- Annotations
- Message boxes (documentation)
- Inline clean-up tags (documentation)
- Navigation boxes (documentation)
- Notes (documentation)
- Links
- Text (support for the below comes largely via the mwtokenizer library but both wikitext and HTML have logic for extracting plaintext):
- Paragraphs
- Sentences
- Words
- Punctuation
- Whitespace
Each edit type then has four associated potential actions: insert, remove, change, move. So a given edit can be broken down into pieces that will include an element-type (e.g., image), an action (e.g., change), and possibly further details about what actually happened (e.g., caption was changed from X -> Y).
When it comes to detecting these components in the HTML, the elements fall into a few categories:
- Things with direct HTML equivalents -- e.g.,
<section>for sections;<h#>for headings;<img>for images. - Things with direct Parsoid annotations -- e.g.,
rel="mw:WikiLink"for wikilinks;rel="mw:PageProp/Category"for categories. - Things that are generated by templates and have relatively coherent norms thanks to most wikis following English's example that allow us to identify them pretty consistently across languages -- e.g.,
infoboxCSS class +<div>or<table>for infoboxes;<sup>superscript tag w/ a wikilink to a non-article namespace for inline cleanup tags like[citation needed]. - Plaintext (paragraphs; sentences; words; whitespace; punctuation), which is generated based on a series of rules about whether to extract the text or not from a given HTML element based on what type it is -- e.g., include text from wikilinks but not from categories.
Edit Diffs and Detectors
[edit]Computing textual diffs is a long-standing challenge and a central feature of the wikis -- i.e. wikitext diffs (and more recently visual diffs) underlie the ability of editors to efficiently patrol edits to articles. We do not directly reuse these technologies for two reasons: 1) our goal is to support large-scale analyses, which generally means that our implementation must exist in Python (which can easily be applied to the Data Lake via PySpark UDFs), and, 2) the goal of on-wiki diffs is to provide a visually-coherent, human-interpretable explanation of changes whereas our goal is to provide a structured, machine-interpretable description of has changed. That said, our work most closely matches (and draws much inspiration and code from) the Visual Editor diffs, which also contain some semantic explanations of the changes occurring in a diff.
The diffing and detection process can be split into two stages:
- Tree diffing: this is the high-level determination of what changed and where in an article -- e.g., a wikilink was changed. It's the first stage in the diffing process and is particularly helpful for detecting moves and bringing more structure to the diff. The outputs are then passed on to the node differ (explained below) to further process.
- Node diffing: this is the specific determination of what happened -- e.g., the title for that wikilink was changed. This stage also is where we do some more fine-grained disambiguation of what was changed -- e.g., which namespace the wikilink references.
Results
[edit]The library can be viewed here: https://pypi.org/project/mwedittypes/
The current state of the detectors can be explored through this interface: https://wiki-topic.toolforge.org/diff-tagging?lang=en
See Also
[edit]- Halfaker and Taraborelli. Automated classification of edit types
- Asthana et al. Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behaviors
References
[edit]- ↑ Yang, Diyi; Halfaker, Aaron; Kraut, Robert; Hovy, Eduard (2017). "Identifying Semantic Edit Intentions from Revisions in Wikipedia" (PDF). aclweb.org: 2000-2010. doi:10.18653/v1/D17-1213. Retrieved 15 October 2021.