Determining reversion

From Meta, a Wikimedia project coordination wiki

This is a short article to discuss a proposed algorithm for determining that the likelihood of any particular edit constitutes a "reversion" (or revert).

The algorithm works by searching the article history for a revision which is as close as possible to the revision in question. It then compares the size of the diff corresponding to the edit in question, with the diff to the candidate revision (the revision which may have been reverted to). If the revision in question is identical to the candidate revision, the edit will be declared a reversion with 100% likelihood. If the revision in question isn't identical, the size of the diff will be compared with the size of the most recent diff. If the size of the diff in question is the same as the diff to the candidate revision, the "likelihood" will be 50%.

Description of the diff function[edit]

The contents of any two revisions of a page can be compared using the "diff" function. This function is already incorporated into the MediaWiki software. The "diff" describes the differences between two versions of a page by describing a series of "edits" that when applied to the original version result in the changed version. An "edit" applies to a single line of the Wikitext, which can be an entire paragraph or just a few words. An "edit" can be an "insertion", a "deletion", or a "change", where an insertion is the insertion of a new line between two lines in the original, a deletion is the deletion of a line from the original, and change is the substitution of one line in the original with another.

Within each "change" edit, the "diff" function highlights the portions of the line that has changed and the portions of the line that have remained the same.

Definition: the cardinality of a page is the number of characters in the Wikitext of that page.
Definition: the cardinality of a diff is the sum of:
  • the number of characters in all the deleted lines of a diff
  • the number of characters in all the inserted lines of a diff
  • the number of characters in the changed portions of the changed lines of a diff
Definition: the relative cardinality of a diff is the the cardinality of a diff divided by the cardinality of the original revision of a page.

What consitutes a reversion[edit]

Let H represent the revision history of a page where Hi represents the ith revision in the history in chronological order. H1 is the first version of a page.

Let Hn be an edit in a page's revision history that we want to determine whether or not Hn constitutes a reversion.

Let B represent revisions H1...n-2, inclusive.

Let D(Hx, Hy) represent the cardinality of the diff between Hx and Hy

Let Hi be that member of B with the smallest D(Hi, Hn).

The likelihood of reversion is:

An edit which is identical to some previous revision (but not to Hn-1) thus has a likelihood of reversion of 100%. This could be considered to be the most restrictive definition of reversion.

A least restrictive definition of "reversion" could be could be if Hi is such that D(Hi, Hn) < D(Hn-1, Hn). The likelihood of reversion for any such edit Hn is thus greather than 50%.

Therefore, the definition of reversion can vary anywhere between a likelihood of reversion of 50% and 100%.


The likelihood of reversion could also be written as:

If, as suggested in David's wikitech-l post, we use 90% as the threshold, then reversions will satisfy the condition:

In other words, the condition for declaring a reversion is simply that the diff is more than 9 times larger than the diff to the candidate. In general, it has to be:

times larger than the diff to the candidate, where α is the threshold. So if a user wishes to have their reversion not marked as a reversion, they have to pad it out with changes which are one ninth of the size of the content they are reverting, say with an HTML comment. If they want to revert 100 characters, they have to change 11 characters somewhere else.


From wikitech mailinglist:

David Friedland wrote:
> I have written up a short, math-y description of an algorithmic method
> for determining whether or not a given revision constitutes reversion.
It won't work. No matter how clever and complicated your algorithm gets,
people can just study it and then make edits that *just* fall outside
the definition of a reversion.
Timwi

This is assuming that the users want to make edits fall outside the definition of a reversion. While MediaWiki is primarily developed for the Wikipedia where this would be a problem, I think the identification of reversions could be very useful for other users not plagued by vandalism and trolling. Even though it would not identify reversions that do not want to be identified in Wikipedia it would probably be useful there too, because:

The history page can potentially display the development of a wikitext clearer and more exact through, (a) identifying exactly when the current reverted version first appeared, and (b) (potentially) give a more objective figure of percentual change between revisions (this is really an additional feature that to me seems feasable in a similar way to identifying reversion). For users trusting each other these features may be quite useful for wikitexts with long histories. --Dittaeva 17:57, 18 Apr 2004 (UTC)

This old problem has been discussed also in wm2010:Submissions/Flagged revisions study results, wm2010:Submissions/Edit and Revert Trends. --Nemo 15:46, 27 July 2010 (UTC)[reply]