To make sure that the spaces of positive (sentences with citations) and negative (without citations) examples are separable, we do some preliminary feasibility test.
Feasibility analysis: Automatically labeled data
We first test the feasibility of the framework, i.e. the separability of sentences with and without citations in the feature space, by using the raw automatically labeled data. We use as training data the sentences from featured biographies. We considered sentences with an inline citation as positives, and sentences without a citation as negatives.
Featured Biography Article Data
We created a training with all 7692 negatives and an equal number of positives. Below the results on cross-validation for existing sentences in the data.
All Featured Article Data
To assess the generalisability of the previous methodology, we also test with data from all featured articles (73,280 negatives and an equal number of positives).