Research talk:Revision scoring as a service/Work log/2016-01-23

Saturday, January 23, 2016[edit]

Hey folks. Today, I'm looking at past papers about detecting vandalism with the goal of summarizing their methodological strategies.

adler10detecting[edit]

B Adler, Luca de Alfaro, and Ian Pye. Detecting wikipedia vandalism using wikitrust. Notebook papers of CLEF, 1:22–23, 2010. PDF

90.4% ROC-AUC in realtime
- Recall: 0.828, Precision: 0.308, False pos.: 0.173
Argues that static dumps of Wikipedia that might contain vandalism is a problem
Zero-delay vs. Historical features
Introduces WikiTrust
Couldn't actually use author reputation features in realtime due to limitations of the WikiTrust system
- Instead, they had a "trust" score for a version of the article that was generated using this reputation measure (among other things). It seems that this is where WikiTrust helped the signal for their realtime classifier.
Does discuss "realtime"ness of the feature set.
Introduces exact feature set. This takes up 2.5 pages.
Reports AUC on both training set and test set. (Should probably implement that in revscoring)
Tests against the PAN dataset. The PAN dataset is old, but it is shared. We should probably test against the PAN dataset too.
They provide and ORES-like API! It even returns a JSON result. But it's English Wikipedia only. :(

--EpochFail (talk) 18:06, 23 January 2016 (UTC)[reply]

adler11wikipedia[edit]

B Thomas Adler, Luca De Alfaro, Santiago M Mola-Velasco, Paolo Rosso, and Andrew G West. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational linguistics and intelligent text processing, pages 277–288. Springer, 2011. PDF

Addresses the problem of the "open model"
Discusses the scale and effort needed to patrol damage, but doesn't do a back-of-the-envelope for effort
Report results as an optimal model (maximize precision and recall) and at fixed precision (99% precision yields 30% recall). Doesn't seem to do a fixed recall estimate.
Considers the computation and data gathering costs of features
This time, "immediate" vandalism problem is discussed before the "historical" vandalism problem
Reports results as PR-AUC (and notes that this is the stat used in the PAN competition)
"A brief history of vandalism detection"
- ClueBot has 100% precision but extremely low recall (but 50% is the given stat)
- Text features: See Potthast, Smets & Chin
- West incorporates temporal/spacial features
- Potthast build a classifier that mixed all past work together and it worked "better" (but how much!? in what ways?)
- Using PAN 2010 dataset as truth
Uses IP geolocation to figure out the local time of day for anons (doesn't mention registered editors)
Very small set of features:
- 3 Metadata
- 2 Text features (uppercase ratio and digit ratio)
- 2 Types of Language features -- pronouns and badwords
- 3 types of reputation features -- user-reputation, country-reputation (Anons), WikiTrust histogram
Argues that PR-AUC is a better measure as it accounts for vandalism being a rare occurrence
Provides a full page table of all features used with a brief summary and some metadata labels
Presents precision recall curves broken down by feature sets and past work

--EpochFail (talk) 18:06, 23 January 2016 (UTC)[reply]

harpalani11language[edit]

Manoj Harpalani, Michael Hart, Sandesh Singh, Rob Johnson, and Yejin Choi. Language of vandalism: Improving wikipedia vandalism detection via stylometric analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 83–88. Association for Computational Linguistics, 2011.

Assumes that vandals share the same "genre" in language
First research to dig into using "deep" NLP to detect vandalism in contrast of using "shallow" NLP (Wang and McKeown (2010), Chin et al. (2010), Adler et al. (2011))
They use PCFG which we are currently investigating to use, This paper actually motivated us to give it a try.
Used the PAN corpus of annotated edits.
They have features that are almost impossible to implement in production: "How many times the article has been reverted", "Previous vandalism count of the article"...
They focused only on edits that add or change text, they ignored edits that removes text
Since vandalism is a skewed class (3% in their corpus) they emphasized on F score over AUC
"Adding language model features to the baseline increases the F-score slightly (53.5%), while the AUC score is almost the same (91.7%). Adding PCFG based features to the baseline (denoted as +PCFG) brings the most substantial performance improvement: it increases recall substantially while also improving precision, achieving 57.9% F-score and 92.9% AUC."

Yours sincerely Amir (talk) 20:21, 23 January 2016 (UTC)[reply]

wang10got[edit]

William Yang Wang and Kathleen R McKeown. Got you!: automatic vandalism detection in wikipedia with web-based shallow syntactic-semantic modeling. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1146–1154. Association for Computational Linguistics, 2010.

chin2010detecting[edit]

Chin, Si-Chi, et al. "Detecting Wikipedia vandalism with active learning and statistical language models." Proceedings of the 4th workshop on Information credibility. ACM, 2010.

The main thing to notice is that he's using upwards of 1 million features (!).

The paper was written in 2009, so many things might have changed, but Belani claims that Cluebot operates not via ML but by badword regex. Is this still true in 2015?
Belani examines words not just in article but also in editorial comment (and ip of editor)
Target class is vandalism or not.
Data used: pages-meta-history.xml.bz2
Only considered anonymous edits.
Merge multiple successive by same editor into one edit (but record how many consecutive edits were merged and use this integer as a feature).
Included html/wiki markup in word list.
For each word in global word set (see below), add a feature for both difference and ratio of number of occurrences.
Neither TFiDF nor word stemming were used.
Ancillary features:
- (Bool) Edit Empty?
- (Bool) Comment Entered?
- (Bool) Marked Minor Edit?
- (Bool) Edit was change to External Links?
- (Int) First four octets of IPv4 of anon editor.
- (Int) N-1 where N is the number of consecutive edits by same editor
- (Int) num chars and num words in previous and current revision and their differences.
Three squashing functions were considered for mapping feature values into interval [0,1].
- arctan(x) * 2/pi
- delta(x>0)
- min [ ln(x+1) / ln(max(X) + 1) , 1] (X is max feature value on training set

Note that for 2, we don’t need both difference and ratio features.

Quoting Belani: "2,000,000 cases from the years 2001-08 were extracted from articles starting with the letters A to M. 1,585,397 unique words and an average of 104 words per case were extracted. Because additions and subtractions of words were processed as separate features, the number of features in the dataset is at least twice the number of unique words, or 3,170,794. Additionally, because atan and log-lin scaled datasets also include word ratio features, the number of features in them is at least four times the number of unique words, or 6,341,588. "
Uses Logistic Regression (mentions that linear SVM could work).
Interestingly, the best performing squashing function was the binary function (lolwut).
It also occured to me that, any way that we decide to do this, we will have to make sense of *parametrized families* of features in our code, since we can’t define thousands of features individually.

Arthur's comment regarding this paper. Moved from email Amir (talk) 05:26, 25 January 2016 (UTC)[reply]

itakura2009using[edit]

Itakura, Kelly Y., and Charles LA Clarke. "Using dynamic markov compression to detect vandalism in the wikipedia." Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2009.

It uses a rather strange feature extraction: It compresses data using DMC method and then get compression ratio "judged the test edit to be spam if the compression ratio with the spam training set was higher than the compression ratio with the ham training set" (ham = good edit, spam = bad edit)
Vandalism it can detect is limited to adding texts and changing texts (it can't detect text removal)
There is another research that work in similar way but using LZW instead of DMC method.
This method is more effective in finding vandalism in adding texts, it gives 13% precision and 92.8% recall. IMO it's better to use this method as a feature (alongside other features like user age) not a classifier. But as a stand-alone classifier it can help patrols given the precision and recall.

Humbly Amir (talk) 07:29, 25 January 2016 (UTC)[reply]

General notes[edit]

Discussions of vandalism models tend to include a discussion of fitness for "zero-delay" and "historical" vandalism detection. It seems that realtime takes a back seat because it tends to have substantially lower fitness.
When discussing realtime system, the idea of realtime computation of the feature set is not discussed very carefully. In the context of actually using these classifiers to do something useful, both realtime features and realtime computation is necessary.