Research talk:Automated classification of edit quality/Work log/2018-08-13
Monday, August 13, 2018
Active learning for editquality
Active learning is a special case of semi-supervised machine learning in which a learning algorithm (learner) is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. 
The major benefit of AL is not the dramatic increase in accuracy of prediction, but rather an impressive decrease in the cost of data labeling.
The most popular, pool AL framework suggests the following: a model is trained on a small amount of labeled data (seed data), then an acquisition function decides which data points from an unlabeled data pool would benefit the model more and asks an expert to label those; when these new data are added to the training set, the whole model gets retrained. This process is repeated until there is no more increase in accuracy. Usually, there is a train set (split for cross-validation), as well as a validation, and a test sets in this framework. For research purposes, people usually do not employ human oracles. They take a fully labeled set instead and hide labels in a part of data points effectively generating the pool. Later on, the labels of the points selected from the pool by the acquisition function get recovered.
The recent research papers about AL branch into three major directions :
1) AL + deep learning (AL+DL);
2) AL + generative adversarial networks (AL+GAN); GAN here is used to generate data on demand. Jia-Jie Zhu and Jose Bento proposed "to use GANs to synthesize informative training instances that are adapted to the current learner." "We then ask human oracles to label these instances. The labeled data is added back to the training set to update the learner. This protocol is executed iteratively until the label budget is reached."
3) AL as a reinforcement learning problem (AL+RL). Meng Fang et.al.  proposed a method called "Policy-based Active Learning", which, if used in the multilingual configuration, shows a better performance than uncertainty sampling and random sampling. "Our algorithm does not use a fixed heuristic, but instead learns how to actively select data, formalized as a reinforcement learning (RL) problem. An intelligent agent must decide whether or not to select data for annotation in a streaming setting, where the decision policy is learned using a deep Q-network".
Since deep learning algorithms are beyond ORES's computational power, we won't be able to use them in Revscoring directly.
What's the use for Wikipedia?
For Wikipedia, it is important to be able to add more informative data points to the existing dataset without making volunteers label a lot of data. And if we travel back in time, we can find papers that describe AL applicable to the tree-based classifiers like Random Forest of Gradient Boosting which we use.
The dataset that has already been labeled can be split on seed and test set. The new, unlabeled data can be drawn from the "reverted-for-damage" edits and constitute the pool data. Then we need to pick an acquisition function that will strategically select the most valuable data points from the pool. Yifan Fu (2012)  offer a nice survey on instance selection for AL. For the classification task, they suggest two major groups of the query strategy:
1. Uncertainty sampling. For example, an instance with Pbadfaith = 0.5 or Pdamaging = 0.5 would be a good candidate for labeling and including into the training set. This threshold can be also 0.4/0.6 or even 0.3/0.7 depending on how confident we are in the classifier.
2. Variance reduction. "An algorithm searches the best possible instances to minimize the output variance and the total expected error" (Refer to the page 12 for the formula).
Yarin Gal et.al. describe five acquisition functions appropriate for classification problems:
1. "Choose pool points that maximize the predictive entropy";
2. Bayesian Active Learning by Disagreement (winning algorithm) "Choose pool points that are expected to maximise the information gained about the model parameters, i.e. maximise the mutual information between predictions and model posterior (BALD, (Houlsby et al., 2011)) <...> Points that maximise this acquisition function are points on which the model is uncertain on average, but there exist model parameters that produce disagreeing predictions with high certainty";
3. "Maximise the Variation Ratios <... >Like Max Entropy, Variation Ratios measures lack of confidence".
4. "Maximise mean standard deviation <...> averaged over all c classes x can take. Compared to the above acquisition functions, this is more of an ad-hoc technique used in recent literature."
5. "Random acquisition (baseline)" where the acquisition function returns a draw from a uniform distribution over the interval [0, 1]. "Using this acquisition function is equivalent to choosing a point uniformly at random from the pool".
To choose the best acquisition function, we could try all of them on the labeled dataset (with the AL framework as described above, "for research purposes"). Then the function that will give the smallest error at full data can be used to select data points from our unlabeled pool.
- Houlsby, Neil, Huszar, Ferenc, Ghahramani, Zoubin, and ´Lengyel, Mat´ e. Bayesian active learning for classification ´and preference learning. arXiv preprint arXiv:1112.5745, 2011.