Grants:IdeaLab/Similarity measure for AbuseFilter
What is the problem you're trying to solve?
When a contribution is reverted, it is way to easy to respond with a new revert. Such revert wars can spiral out of control very rapidly.
There are thus several proposals to formulate rules against such revert wars, but there are no tool available to stop them from being possible in the first place.
What is your solution?
It is possible to make a special so-called locally sensitive hash digest which can then be used for similarity measurements of a contribution against similar hash digests from previous contributions or reverts. The reverts are most interesting for revert wars, while contributions can be used for detection of spam. If a similar revert is detected then the contribution can be assigned a similarity index. This number can then be used in AbuseFilter to make a decision whether the contribution or further reverts should be blocked or simply just tagged.
New contributions that are similar, but with a lower similarity index, should be allowed. That makes it possible to adjust a rejected contribution and then make a save of the updated contribution.
An editor should be allowed to do the opposite change of something (s)he has done previously. If this isn't allowed simple copy-paste editing would be disallowed. A kind of
previous_change_similarity could scan the last ten revisions for a title and report the largest change similarity, that is in absolute value, or the previously accumulated sum for this user only. If the absolute value of previous change similarity against current change similarity is below a threshold then it is accepted as a copy edit.
Note that a working solution must somehow relate to the access level the user holds. If not an user with low access level might revert edits done by a more well-renowned user without anyone being able to undo the revert. This is just a small part of the total problem, as an user with more access rights should be able to override an user with lower access rights. This is again just a small part of an even larger problem, as a better solution would be temporal karma with a base capital given by the user rights.
Text fragments can be compared by specially crafted hash digests. Those are often formed by hashed strings comprising of 3-5 characters. There are large group of such algorithms, often called w:Locality-sensitive hashing, but the important point to note is that some of them has where n is the length of the text. By caching previous digests we get a search with where m is the number of previously processed texts. The former is somewhat heavy (large constant) while the later is somewhat lightweight, aka . It is although a lot better than most methods for w:edit distance, the standard solution has in comparison . It is also possible to simplify this into , but then with some limitations.
There is a proposal Grants:IdeaLab/1RR minimal delay which needs this functionality, or some functionality of similar type.
For a more in-depth description see Grants:IdeaLab/Similarity measure for AbuseFilter/Technical description.
To create a minimal and yet effective measure of similarity between edits, thereby making it possible to detect (and possibly stop) edit wars and revert chains.
About the idea creator
I've been a contributor on Wikimedia projects for more than ten years, and have a cand.sci. in math and computer sciences.
Expand your idea
Would a grant from the Wikimedia Foundation help make your idea happen? You can expand this idea into a grant proposal.
Develop the code to implement a similarity measure for AbuseFilter according to technical description. The final outcome will be an additional variable
change_similarity that can be used in filters for AbuseFilter. The variable should be described at the page mw:Extension:AbuseFilter/Rules format.
It would make it possible to make filters that can tag edit wars, or even stop them completely.
A developer for approx a half man-month.