Research:Sentiment analysis tool of new editor interaction

From Meta, a Wikimedia project coordination wiki

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


This sprint aims at developing an automatic classifier of messages found in new editor interactions. Previous studies provide us a set of coded examples of those messages, categorized into "praise", "criticism", "warning", etc. Based on the coded examples, I apply supervised machine learning methods to build the classifier. If successfully built, the tool will enable us to analyse new editor interaction from different perspectives (e.g, for different categories of contents) with less labor.

Topic[edit]

This sprint is rather about building a fundamental tool to be used to answer further research questions. One of such possible questions would be to find the most important features that contribute to deciding a message is a praise, criticism, educational comment etc. This sprint is largely related to RQ2.

Process[edit]

I first look at the content of the messages, and see what linguistic features can be extracted and can contribute the most to deciding the tone of the message. Different clues such as the sender's edit history might be included later.

I split the coded examples into the training set and the testing set for the classifier. After applying a supervised learning method with the training set, the performance of the classifier is evaluated on the testing set. The process of training and testing will be iterated for different feature sets and different hyperparameters of the classifier to see which is the most effective way of the classification.

Training procedure[edit]

  1. For each coded message:
 * Extract raw features and put it to a MongoDB in the following structure ::
    
    {
      "entry" {
        "rev_id":   2894772,
        "title": "Yosri",
        "text": "Hi ....",
        "timestamp": "...",
        "sender": {},
        "receiver": {}
      },
      "labels": {
         "praise":  false,
         "criticism":   false,
         "warning": true,
         ...
      },
      "features": {
        "ngram":   {"type": "assoc", "values": {...}},
        "SentiWN": {"type": "assoc", "values": {...}},
        ...
      }
      "vector": {
        1: 1,
        2: 3.5,
        ...
      },
      ...
    }
  1. Convert the raw features into vectors, and update all entries in the MongoDB. (Different selection of features and/or hash kernels may be used here.)
    Different features include:
    • SentiWordNet's sentiment polarity scores of the words used in the message
    • N-grams of wikilinks found in the message.
  2. Train a classifier with the feature vectors.
  3. Output the resulting model.

Results and discussion[edit]

Label Accuracy
Criticism 75.4717% (80/106)
Teaching 66.9811% (70/106)
Warning 76.4151% (81/106)
Praise_Thanks 67.9245% (72/106)

Classifiers are trained on 4/3 of the coded examples created in a previous sprint, and tested on the rest of them.

The accuracies above might not look too bad, but actually they are only slightly better than the 'baselines'. For example, 61% of the 'Teaching' examples are negative. This means that it is assured that the classifier can get 61% accuracy, if it always classifies anything as negative.

The features I currently give to the classifiers are:

  • Real-valued variables of sentiment scores of the words in the message, given by SentiWordNet
  • Binary-valued variables corresponding to regular expressions detecting some Wikipedia-specific notations (e.g., [[Wikipedia:Articles for deletion]], [[WP:...]])

Software[edit]

The tools developed for this sprint including a set of preprocessing modules, feature extraction templates, supervised classification modules (with the help of liblinear) and evaluation scripts are available at https://github.com/whym/wikisentiment.

Future work[edit]

It is necessary to improve the accuracy in order for this classifier to be used. I am aware of several things that will lead to better accuracies, and working on them:

  • Better treatment of Wikipedia diffs (currently I just pick up a relevant HTML element <td class="diff-addedline"> from the API output [1])
  • Expanding Wikipedia specific patterns
  • Using word N-grams as features

References[edit]