Research:Identification of Unsourced Statements/Feasibility Analysis

From Meta, a Wikimedia project coordination wiki

To make sure that the spaces of positive (sentences with citations) and negative (without citations) examples are separable, we do some preliminary feasibility test.

Feasibility analysis: Automatically labeled data[edit]

We first test the feasibility of the framework, i.e. the separability of sentences with and without citations in the feature space, by using the raw automatically labeled data. We use as training data the sentences from featured biographies. We considered sentences with an inline citation as positives, and sentences without a citation as negatives.

Featured Biography Article Data[edit]

We created a training with all 7692 negatives and an equal number of positives. Below the results on cross-validation for existing sentences in the data.

All Featured Article Data[edit]

To assess the generalisability of the previous methodology, we also test with data from all featured articles (73,280 negatives and an equal number of positives).

Results show that a system using word vectors + random forests is able to detect sentences needing citations with around 75% accuracy. Accuracy of citation need detection on automatically labeled data