Research talk:Automated classification of edit types

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Work log

Picking up the torch[edit]

Looks like I am finally going to get to work on this project a bit. I've been working with some researchers at CMU (Diyi Yang & Bob Kraut) to replicate some past work exploring edit classification modeling. E.g Daxenberger & Gurevych (2012). Regretfully, they built their model from a very small and contrived sample. I think we'll want to build ours from a pure random sample of article so that it can be applied to the history page of any article. However, Daxenberger and Gurevych have done some preliminary work that I think we can take advantage of. E.g. they provide a feature set (that I've started to incorporate into revscoring's dependency injector) and they propose a set of edit categories. I'll start up a separate thread about those. --EpochFail (talk) 22:32, 26 June 2015 (UTC)

Edit classification schemes[edit]

So, I think the classification schemes will be a point of contention. We're going to come up with schemes that get it wrong and argue about it. This is fine. But we must start somewhere. So I'm pulling the classification scheme used by the following study:

Daxenberger, J., & Gurevych, I. (2012, December). A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles. In COLING (pp. 711-726).

They chose to combine Insert/Modify/Delete actions with a class of wiki artifacts

  • File -- Edits affecting files (media content)
  • Information -- Textual edits affecting information content
  • Markup -- Edits affecting markup segments
  • Reference -- Edits affecting links, inter-wiki/language links or bibliographical references and citations
  • Template -- Edits affecting templates

For some reason, they conflate links, inter-wikilinks and citations in a single category and argue "[...] these edits refer to the same action in the sense of referencing something." I respectfully disagree and find that interwikilinks, internal links and references mean very different things.

They also include three special types of changes that, for some reason, do not fall into information-modification:

  • Paraphrase -- Textual edits paraphrasing words or sentences
  • Relocation -- Edits moving entire lines
  • Spelling/Grammar -- Edits correcting spelling or grammatical errors

Daxenberger & Gurevych note that there was a high level of disagreement between labelers about which edits were Information-Modification, Paraphrase and Spelling/Grammar. It's totally unclear to me what they mean by paraphrase, but the difference between Information-Modification and Spelling/Grammar can be bounded at changing the meaning of the text.

Finally they include two revision-level types that are exclusive with the rest:

  • Revert -- Edits restoring a previous state of a page
  • Vandalism -- Edits deliberately compromising Wikipedia’s integrity

These I take great issue with. They simply break down the work of the model. A revert, by their definition, does not need a predictive model, but rather can be detected by identity matching and vandalism detection is much better understood in related work. Surely you can both modify information *AND* vandalize! So, I'll think about this for a bit and propose an extended scheme. --EpochFail (talk) 21:48, 29 June 2015 (UTC)

OK. Here's what I've got. I think it is pretty good. Also, I suspect that we can detect many of these near-exactly (e.g. File insertion & Template removal). I think we'll struggle more with rephrase vs. spelling grammar.

  • File -- "[[File:Hat.jpg]]"
  • Category -- "[[Category:Headware]]"
  • Internal link -- "[[Biology]]" or "[[:File:Hat.jpg]]"
  • Interwiki link -- "[[:en:Biology]]"
  • External link -- "[//]"
  • Reference -- "<ref>Aldrich et al., Hats are coming. HATS'15.</ref>"
  • Data -- Dates, numerical values, categorical values. Not within natural language. E.g., "May 26th, 1914" "Episode 35" "male"
  • Template -- A transclusion of some other page. E.g. "{{citation needed}}"
  • Formatting -- Markup that relates to text formatting. E.g. <div>, <span>, "''", "'''", etc.
  • Markup -- Wikitext markup that isn't formatting, internal/external link, reference or file. E.g. "{|" or "{{{variable|}}}"
Text-specific. This applies to natural language content. E.g. "I have a lovely bunch of coconuts."
  • Insert/Modify/Delete information -- Adds/Changes/Removes meaning to/of/from the text.
  • Rephrase -- Changes the wording without changing the meaning.
  • Spelling/grammar -- Fixes spelling and/or grammar issues without substantially affecting word order.

--EpochFail (talk) 22:31, 29 June 2015 (UTC)

Here's a screenshot of a naive Wiki labels form that would work but probably also greatly annoy our users.

A naive form for asking about edit type classes with Wiki labels is presented.
Edit type form. A naive form for asking about edit type classes with Wiki labels is presented.

We'll need to do better than this. It might require that we think creatively about creating new OOUIjs elements. --EpochFail (talk) 23:12, 29 June 2015 (UTC)

Heres the yaml form description that generated that form. --EpochFail (talk) 23:22, 29 June 2015 (UTC)

@EpochFail: very excited about the initial scoping of the project, I want to take some time to work on this with you before Wikimania. --DarTar (talk) 21:58, 11 July 2015 (UTC)

Wikipedia edit summary legend[edit]

Check it out. en:Wikipedia:Edit_summary_legend We should consider this as a source of desirable classes to predict. --EpochFail (talk)

A lot of different types of copy editing in here:
  • "Cleanup", "Copy edit", "Grammar", "Spelling", "Tweaks", "Capitalization"
There's also a couple types of content moving:
  • "Move"
  • "Reorg"
There's some subjective categories:
  • "Not notable"
  • "Point of view"
Some references the thing being changed
  • "Whitespace"
  • "External links"
  • "Internal links"
  • "Disambiguation" (a type of "internal links")
  • "Snap double redirect" (a type of "internal links")
  • "Category" (just included these in "internal links")
  • "Headers"
A few that we didn't think of before
  • "Merge" -- Copy-paste from another article
  • "Null edit" -- No changes
  • "Punctuation"
  • "Redirect"
  • "Reply" -- as in talk page replies
One of things that I think we can take-away from this is the desire to differentiate different types of copy editing. E.g., did just one word change? Was a sentence reshuffled (without changing meaning)? Was capitalization changed?
Another thing that we can essentially get for free is a more nuanced approach to exact-matching the thing being changed. E.g. is the internal link a category? Is it changing where a link goes without changing the display phrase (disambiguation/redirect fixing). Are headers being added or removed? Is a table being inserted or a row of a table being modified? These are things we might get by parsing and comparing entire revisions rather than looking at the diff. --EpochFail (talk) 19:22, 14 October 2015 (UTC)

Mockup of history page[edit]

A mockup of edit type category icons is presented on top of an article history page.
Edit category history page mockup. A mockup of edit type category icons is presented on top of an article history page.

Hey folks. I made a really really naive mockup of edit categories appearing on the history page. I plan to post this on en:WP:VPM to do some recruitment to this project. --EpochFail (talk) 19:27, 21 October 2015 (UTC)

Analysis of Gadget-defaultsummaries.js[edit]

Reviewing Here's the summaries that are listed in the code.


  • "Spelling/grammar correction"
  • "Fixing style/layout errors"
  • "Reverting vandalism or test edit"
  • "Reverting unexplained content removal"
  • "Copyedit (minor)"
  • "Expanding article"
  • "Adding/improving reference(s)"
  • "Adding/removing category/ies"
  • "Adding/removing external link(s)"
  • "Adding/removing wikilink(s)"
  • "Removing unsourced content"
  • "Removing linkspam per WP:EL"
  • "Clean up"
  • "Copyedit (major)"


--EpochFail (talk) 14:41, 23 October 2015 (UTC)

EpochFail in the talk shouldn't you insert also "Start". I mean it is different when you start a discussion, and when you suggest or comment something in a discussion that started. Some users are more involved than others in creating debates or pointing issues on wikipedia. Their pattern is actually quite important, they can create trends.--Alexmar983 (talk) 08:22, 26 March 2016 (UTC)
Hi Alexmar983. In this post, I was really just discussing the defaultsummaries gadget. But we've moved well beyond this WRT the taxonomy we are working with. See en:WP:Labels/Edit types/Taxonomy & it:WP:Labels/Tipologia degli edit/Tassonomia. We've completed two pilot labeling runs and we're just about to start the full labeling run. :) --EpochFail (talk) 15:16, 26 March 2016 (UTC)

Complete taxonomy[edit]

Diyi and I have started building a complete taxonomy of edit types that includes types we'd never be able to predict. Check it out: Research:Automated classification of edit types/Taxonomy Please feel free to edit boldly to add anything you think is missing. --EpochFail (talk) 16:03, 23 October 2015 (UTC)

Posted a call to action on the English Wikipedia Village Pump[edit]

See en:Wikipedia:Village_pump_(technical)#Automatically detecting edit types -- New project. Help wanted.. --EpochFail (talk) 21:40, 29 October 2015 (UTC)