Research:Language-Agnostic Topic Classification/Wikidata model productionization

From Meta, a Wikimedia project coordination wiki
Aug 8, 2020
Aaron Halfaker

This project aims to productionize a Wikidata-based topic prediction model for the ORES environment. The initial proposal and descriptions for the project can be found here.

This project was undertaken as a part of the 2020 Summer Outreachy internship. I would like to thank my amazing mentors Isaac Johnson and Aaron Halfaker for their guidance and support throughout the project.


The coding part of the project was done in two parts:

  1. Preprocess the Wikidata dump and learn the word embeddings for relevant PIDs and QIDs using Fasttext. Find my work here.
  2. Train a supervised model using GBC on article embeddings (average word embeddings of all the words in that article) of some labeled Wikidata items. Find my work here.

Phase - 1: Preprocessing and learning the word embeddings[edit]

mwtext library already has a pipeline that preprocesses the Wikipedia dumps and learns embeddings for preprocessed wikitext. We had to add a new utility to this library so that it supported Wikidata. The new utility does the following in its preprocessing step:

  1. Filter out irrelevant Wikidata items that are in the dump. We filtered items that belong to at least one of the following categories.
    1. Wikidata items that are redirects (e.g., (Q18511155))
    2. Wikidata items with no sitelinks to any Wikipedia (e.g. Q47586969)
    3. Wikidata items that sitelinks to Wikipedia pages that aren't articles, i.e-Wikipedia pages with a non-zero namespace (e.g. Q8207058)
  2. Extract relevant information
For each Wikidata item that isn't filtered out, the utility extracts and returns a list of PIDs and QIDs corresponding to that item. See Topic_Classification_of_Wikidata_Items

At the end of the preprocessing step, we have a bunch of Wikidata items with their corresponding lists of Properties and values. Treating these PIDs and QIDs as words, we use Fasttext to learn the embeddings for each such IDs.

Phase-2: Training a classifier[edit]

For each item in the training dataset, a list of PIDs and QIDs are extracted. The article embedding is calculated by taking the average of word embeddings (obtained in Phase-1) for those IDs. We then train a Gradient Boosting Classifier model with the following hyperparameters:

 n_estimators = 150
 max_depth = 5
 max_features = log2
 learning_rate = 0.1 


While all the existing models in drafttopic for Wikitext use Gradient Boosting Classifier, the experimental API for Wikidata was trained using Fasttext. We conducted several experiments to compare the performance of these two classifiers and decide which one works best for Wikidata. All the results of the experiments can be found here.

Experiment-1: Evaluation using cross validation on training dataset[edit]

The first phase of the experiment evaluated the performance of models that were trained by varying different factors:

  • vocabulary size: the number of most frequent words (and their embeddings) that were retained in Phase-1. vocab size of 10000, 50000, and 100000 were used.
  • classifier: Gradient Boosting vs Fasttext
  • training samples: balanced vs imbalanced.
  • size of training dataset. ~64000 and ~256000 were used.

The statistics are the aggregate of the results obtained after a five fold cross validation on the training dataset.


  1. Classes with higher population rate, e.g. Culture.Biography.Biography*, did pretty good with all the models and had a very high precision/recall rates. Classes with rare occurances, e.g. STEM.Mathematics, had pretty poor performance throughout.
  2. Increasing the vocab size from 10k to 50k had some improvement (2-3% overall) in the performance, for both balanced and imbalanced dataset. Further increasing the vocab size to 100k didn't show a significant performance boost for the model trained using balanced datasets.
  3. The performance of models trained using a balanced dataset and scaled by the population rate of the classes seems poor compared to those trained using an imbalanced dataset. But the next phase of the experiment shows that this contrast in the statistics might not be very reliable.
  4. Fasttext models trains comparatively faster than GBC (a few seconds vs almost an hour) and seems to have a better performance as shown by the statistics obtained after five-fold cross validation.

Experiment-2: Evaluation using a separate imbalanced testing dataset[edit]

Next, we collected ~150k Wikidata items that weren't used for the training process. We then evaluated the performances of following four models on this dataset.

  • Gradient Boosting model that was trained using ~64000 balanced dataset.
  • Gradient Boosting model that was trained using ~64000 imbalanced dataset.
  • Fasttext model that was trained using ~64000 balanced dataset.
  • Fasttext model that was trained using ~64000 imbalanced dataset.


Classifier Trained on recall precision f1 accuracy roc_auc pr_auc
Fasttext 63961, unbalanced dataset (micro=0.794, macro=0.655) (micro=0.813, macro=0.737) (micro=0.801, macro=0.688) (micro=0.966, macro=0.985) (micro=0.969, macro=0.959) (micro=0.84, macro=0.69)
Fasttext 63944, balanced dataset (micro=0.791, macro=0.69) (micro=0.8, macro=0.681) (micro=0.792, macro=0.675) (micro=0.965, macro=0.984) (micro=0.967, macro=0.961) (micro=0.833, macro=0.686)
Gradient Boosting 63961, unbalanced dataset (micro=0.775, macro=0.614) (micro=0.83, macro=0.725) (micro=0.798, macro=0.66) (micro=0.968, macro=0.985) (micro=0.966, macro=0.951) (micro=0.83, macro=0.642)
Gradient Boosting 63944, balanced dataset (micro=0.789, macro=0.674) (micro=0.805, macro=0.7) (micro=0.792, macro=0.679) (micro=0.964, macro=0.984) (micro=0.966, macro=0.962) (micro=0.828, macro=0.664)

It was observed that all four models had similar performances, and there wasn't any one such model that did considerably well compared to others. The other factor that was weighed in was the training time, for which Fasttext was way ahead of Gradient Boosting. However, using Fasttext would mean a lot of additions to the existing revscoring architecture. Since we didn't have a strong reason to prefer Fastext over Gradient Boosting in terms of performance -- we decided to make use of the existing utilities for training by sticking to Gradient Boosting, instead of updating them for Fasttext.

Final Results[edit]

Gradient Boosting classifier with the following parameters was used to train the final model:

Hyper parameters
  • n_estimators = 150
  • max_depth = 5
  • max_features = log2
  • learning_rate = 0.1
Size of training dataset 63944, balanced samples
Vocab Size 10000
Embeddings dimension 50

Statistics for Gradient Boosting model after five-fold cross validation on the balanced training dataset[edit]

Overall performance (scaled by population rates):

recall (micro=0.719, macro=0.621)
precision (micro=0.7, macro=0.554)
f1 (micro=0.703, macro=0.571)
accuracy (micro=0.978, macro=0.99)
roc_auc (micro=0.956, macro=0.951)
pr_auc (micro=0.721, macro=0.543)

Label wise statistics (scaled by population rates):

Statistics for Gradient Boosting model on an imbalanced testing dataset[edit]

Overall performance:

recall (micro=0.789, macro=0.674)
precision (micro=0.805, macro=0.7)
f1 (micro=0.792, macro=0.679)
accuracy (micro=0.964, macro=0.984)
roc_auc (micro=0.966, macro=0.962)
pr_auc (micro=0.828, macro=0.664)

Label wise statistics:

Although the statistics obtained from the balanced dataset doesn't look very good, the imbalanced dataset is a closer representation of the distribution of the actual data. The performance of the model on imbalanced data is considerably better and was taken into account while judging the performance of the model that will finally be used in production.