Research:Language-Agnostic Topic Classification/Wikidata model productionization

Created
	Aug 8, 2020
Contact
	Dibya Gautam IIITD
Mentors
	Isaac Johnson Wikimedia Foundation
	Aaron Halfaker

This project aims to productionize a Wikidata-based topic prediction model for the ORES environment. The initial proposal and descriptions for the project can be found here.

This project was undertaken as a part of the 2020 Summer Outreachy internship. I would like to thank my amazing mentors Isaac Johnson and Aaron Halfaker for their guidance and support throughout the project.

Methods[edit]

The coding part of the project was done in two parts:

Preprocess the Wikidata dump and learn the word embeddings for relevant PIDs and QIDs using Fasttext. Find my work here.
Train a supervised model using GBC on article embeddings (average word embeddings of all the words in that article) of some labeled Wikidata items. Find my work here.

Phase - 1: Preprocessing and learning the word embeddings[edit]

mwtext library already has a pipeline that preprocesses the Wikipedia dumps and learns embeddings for preprocessed wikitext. We had to add a new utility to this library so that it supported Wikidata. The new utility does the following in its preprocessing step:

Filter out irrelevant Wikidata items that are in the dump. We filtered items that belong to at least one of the following categories.
1. Wikidata items that are redirects (e.g., (Q18511155))
2. Wikidata items with no sitelinks to any Wikipedia (e.g. Q47586969)
3. Wikidata items that sitelinks to Wikipedia pages that aren't articles, i.e-Wikipedia pages with a non-zero namespace (e.g. Q8207058)
Extract relevant information

For each Wikidata item that isn't filtered out, the utility extracts and returns a list of PIDs and QIDs corresponding to that item. See Topic_Classification_of_Wikidata_Items

At the end of the preprocessing step, we have a bunch of Wikidata items with their corresponding lists of Properties and values. Treating these PIDs and QIDs as words, we use Fasttext to learn the embeddings for each such IDs.

Phase-2: Training a classifier[edit]

For each item in the training dataset, a list of PIDs and QIDs are extracted. The article embedding is calculated by taking the average of word embeddings (obtained in Phase-1) for those IDs. We then train a Gradient Boosting Classifier model with the following hyperparameters:

 n_estimators = 150
 max_depth = 5
 max_features = log2
 learning_rate = 0.1

Experiments[edit]

While all the existing models in drafttopic for Wikitext use Gradient Boosting Classifier, the experimental API for Wikidata was trained using Fasttext. We conducted several experiments to compare the performance of these two classifiers and decide which one works best for Wikidata. All the results of the experiments can be found here.

Experiment-1: Evaluation using cross validation on training dataset[edit]

The first phase of the experiment evaluated the performance of models that were trained by varying different factors:

vocabulary size: the number of most frequent words (and their embeddings) that were retained in Phase-1. vocab size of 10000, 50000, and 100000 were used.
classifier: Gradient Boosting vs Fasttext
training samples: balanced vs imbalanced.
size of training dataset. ~64000 and ~256000 were used.

The statistics are the aggregate of the results obtained after a five fold cross validation on the training dataset.

Observations[edit]

Classes with higher population rate, e.g. Culture.Biography.Biography*, did pretty good with all the models and had a very high precision/recall rates. Classes with rare occurances, e.g. STEM.Mathematics, had pretty poor performance throughout.
Increasing the vocab size from 10k to 50k had some improvement (2-3% overall) in the performance, for both balanced and imbalanced dataset. Further increasing the vocab size to 100k didn't show a significant performance boost for the model trained using balanced datasets.
The performance of models trained using a balanced dataset and scaled by the population rate of the classes seems poor compared to those trained using an imbalanced dataset. But the next phase of the experiment shows that this contrast in the statistics might not be very reliable.
Fasttext models trains comparatively faster than GBC (a few seconds vs almost an hour) and seems to have a better performance as shown by the statistics obtained after five-fold cross validation.

Experiment-2: Evaluation using a separate imbalanced testing dataset[edit]

Next, we collected ~150k Wikidata items that weren't used for the training process. We then evaluated the performances of following four models on this dataset.

Gradient Boosting model that was trained using ~64000 balanced dataset.
Gradient Boosting model that was trained using ~64000 imbalanced dataset.
Fasttext model that was trained using ~64000 balanced dataset.
Fasttext model that was trained using ~64000 imbalanced dataset.

Observations[edit]

Classifier	Trained on	recall	precision	f1	accuracy	roc_auc	pr_auc
Fasttext	63961, unbalanced dataset	(micro=0.794, macro=0.655)	(micro=0.813, macro=0.737)	(micro=0.801, macro=0.688)	(micro=0.966, macro=0.985)	(micro=0.969, macro=0.959)	(micro=0.84, macro=0.69)
Fasttext	63944, balanced dataset	(micro=0.791, macro=0.69)	(micro=0.8, macro=0.681)	(micro=0.792, macro=0.675)	(micro=0.965, macro=0.984)	(micro=0.967, macro=0.961)	(micro=0.833, macro=0.686)
Gradient Boosting	63961, unbalanced dataset	(micro=0.775, macro=0.614)	(micro=0.83, macro=0.725)	(micro=0.798, macro=0.66)	(micro=0.968, macro=0.985)	(micro=0.966, macro=0.951)	(micro=0.83, macro=0.642)
Gradient Boosting	63944, balanced dataset	(micro=0.789, macro=0.674)	(micro=0.805, macro=0.7)	(micro=0.792, macro=0.679)	(micro=0.964, macro=0.984)	(micro=0.966, macro=0.962)	(micro=0.828, macro=0.664)

It was observed that all four models had similar performances, and there wasn't any one such model that did considerably well compared to others. The other factor that was weighed in was the training time, for which Fasttext was way ahead of Gradient Boosting. However, using Fasttext would mean a lot of additions to the existing revscoring architecture. Since we didn't have a strong reason to prefer Fastext over Gradient Boosting in terms of performance -- we decided to make use of the existing utilities for training by sticking to Gradient Boosting, instead of updating them for Fasttext.

Final Results[edit]

Gradient Boosting classifier with the following parameters was used to train the final model:

Hyper parameters	n_estimators = 150 max_depth = 5 max_features = log2 learning_rate = 0.1
Size of training dataset	63944, balanced samples
Vocab Size	10000
Embeddings dimension	50

Statistics for Gradient Boosting model after five-fold cross validation on the balanced training dataset[edit]

Overall performance (scaled by population rates):

recall	(micro=0.719, macro=0.621)
precision	(micro=0.7, macro=0.554)
f1	(micro=0.703, macro=0.571)
accuracy	(micro=0.978, macro=0.99)
roc_auc	(micro=0.956, macro=0.951)
pr_auc	(micro=0.721, macro=0.543)

Label wise statistics (scaled by population rates):

topic	n	TP	FP	TN	FN	recall	precision	f1	pr_auc
Culture.Biography.Biography*	16670	15762	464	908	46810	0.946	0.929	0.937	0.952
Culture.Biography.Women	4110	3125	679	985	59155	0.76	0.504	0.606	0.589
Culture.Food and drink	1318	613	126	705	62500	0.465	0.371	0.413	0.352
Culture.Internet culture	2966	1948	140	1018	60838	0.657	0.517	0.578	0.549
Culture.Linguistics	1466	934	56	532	62422	0.637	0.852	0.729	0.656
Culture.Literature	5367	3996	404	1371	58173	0.745	0.619	0.676	0.726
Culture.Media.Books	1974	1560	136	414	61834	0.79	0.609	0.688	0.659
Culture.Media.Entertainment	1733	857	162	876	62049	0.495	0.429	0.459	0.433
Culture.Media.Films	2295	1896	122	399	61527	0.826	0.829	0.828	0.813
Culture.Media.Media*	14383	11572	1135	2811	48426	0.805	0.67	0.731	0.813
Culture.Media.Music	2583	2027	247	556	61114	0.785	0.807	0.795	0.818
Culture.Media.Radio	1156	857	44	299	62744	0.741	0.711	0.726	0.741
Culture.Media.Software	1750	685	307	1065	61887	0.391	0.094	0.152	0.094
Culture.Media.Television	2230	1510	176	720	61538	0.677	0.68	0.678	0.664
Culture.Media.Video games	2147	1758	54	389	61743	0.819	0.732	0.773	0.801
Culture.Performing arts	1334	741	116	593	62494	0.555	0.478	0.514	0.414
Culture.Philosophy and religion	2702	1074	285	1628	60957	0.397	0.472	0.431	0.339
Culture.Sports	5925	5186	249	739	57770	0.875	0.929	0.901	0.933
Culture.Visual arts.Architecture	2648	1867	230	781	61066	0.705	0.672	0.688	0.673
Culture.Visual arts.Comics and Anime	1508	1007	140	501	62296	0.668	0.416	0.513	0.558
Culture.Visual arts.Fashion	1199	669	98	530	62647	0.558	0.242	0.338	0.215
Culture.Visual arts.Visual arts*	6070	4131	554	1939	57320	0.681	0.566	0.618	0.666
Geography.Geographical	3464	2226	359	1238	60121	0.643	0.7	0.67	0.698
Geography.Regions.Africa.Africa*	6449	4664	414	1785	57081	0.723	0.462	0.564	0.639
Geography.Regions.Africa.Central Africa	1145	697	83	448	62716	0.609	0.244	0.349	0.321
Geography.Regions.Africa.Eastern Africa	1114	704	56	410	62774	0.632	0.262	0.371	0.253
Geography.Regions.Africa.Northern Africa	1280	774	108	506	62556	0.605	0.321	0.42	0.343
Geography.Regions.Africa.Southern Africa	1244	859	81	385	62619	0.691	0.411	0.515	0.514
Geography.Regions.Africa.Western Africa	1142	774	75	368	62727	0.678	0.297	0.413	0.277
Geography.Regions.Americas.Central America	1331	707	87	624	62526	0.531	0.569	0.55	0.494
Geography.Regions.Americas.North America	7625	5064	1169	2561	55150	0.664	0.682	0.673	0.726
Geography.Regions.Americas.South America	1532	1082	142	450	62270	0.706	0.681	0.693	0.691
Geography.Regions.Asia.Asia*	11647	8432	835	3215	51462	0.724	0.715	0.719	0.756
Geography.Regions.Asia.Central Asia	1086	671	70	415	62788	0.618	0.306	0.41	0.462
Geography.Regions.Asia.East Asia	2717	1727	241	990	60986	0.636	0.665	0.65	0.625
Geography.Regions.Asia.North Asia	2076	1336	163	740	61705	0.644	0.579	0.609	0.55
Geography.Regions.Asia.South Asia	2366	1612	135	754	61443	0.681	0.839	0.752	0.708
Geography.Regions.Asia.Southeast Asia	1721	1059	119	662	62104	0.615	0.668	0.641	0.557
Geography.Regions.Asia.West Asia	2160	1473	129	687	61655	0.682	0.794	0.734	0.662
Geography.Regions.Europe.Eastern Europe	3533	2472	234	1061	60177	0.7	0.771	0.733	0.71
Geography.Regions.Europe.Europe*	12939	9372	1810	3567	49195	0.724	0.642	0.681	0.744
Geography.Regions.Europe.Northern Europe	4221	2571	601	1650	59122	0.609	0.643	0.626	0.644
Geography.Regions.Europe.Southern Europe	2438	1565	268	873	61238	0.642	0.673	0.657	0.618
Geography.Regions.Europe.Western Europe	3076	1934	417	1142	60451	0.629	0.657	0.643	0.63
Geography.Regions.Oceania	2638	1859	138	779	61168	0.705	0.839	0.766	0.75
History and Society.Business and economics	3502	1544	569	1958	59873	0.441	0.315	0.367	0.248
History and Society.Education	2243	1113	255	1130	61446	0.496	0.489	0.493	0.4
History and Society.History	3172	1154	360	2018	60412	0.364	0.403	0.382	0.315
History and Society.Military and warfare	3238	1677	296	1561	60410	0.518	0.622	0.565	0.521
History and Society.Politics and government	4590	2406	329	2184	59025	0.524	0.731	0.611	0.603
History and Society.Society	2971	897	166	2074	60807	0.302	0.48	0.371	0.318
History and Society.Transportation	3629	2615	169	1014	60146	0.721	0.809	0.762	0.712
STEM.Biology	2916	2237	91	679	60937	0.767	0.948	0.848	0.816
STEM.Chemistry	1270	690	138	580	62536	0.543	0.294	0.382	0.27
STEM.Computing	1968	828	332	1140	61644	0.421	0.182	0.254	0.149
STEM.Earth and environment	1627	918	114	709	62203	0.564	0.594	0.579	0.522
STEM.Engineering	2195	1284	141	911	61608	0.585	0.596	0.591	0.51
STEM.Libraries & Information	1174	605	87	569	62683	0.515	0.203	0.291	0.238
STEM.Mathematics	1137	307	107	830	62700	0.27	0.068	0.109	0.125
STEM.Medicine & Health	1726	769	180	957	62038	0.446	0.499	0.471	0.398
STEM.Physics	1219	448	107	771	62618	0.368	0.168	0.23	0.126
STEM.STEM*	16449	12609	2766	3840	44729	0.767	0.477	0.588	0.768
STEM.Space	1365	932	47	433	62532	0.683	0.795	0.735	0.686
STEM.Technology	3648	1396	424	2252	59872	0.383	0.219	0.279	0.213

Statistics for Gradient Boosting model on an imbalanced testing dataset[edit]

Overall performance:

recall	(micro=0.789, macro=0.674)
precision	(micro=0.805, macro=0.7)
f1	(micro=0.792, macro=0.679)
accuracy	(micro=0.964, macro=0.984)
roc_auc	(micro=0.966, macro=0.962)
pr_auc	(micro=0.828, macro=0.664)

Label wise statistics:

topic	n	TP	FP	FN	TN	recall	precision	f1	pr_auc
Culture.Biography.Biography*	47623	46074	1566	1549	100684	0.967	0.967	0.967	0.983
Culture.Biography.Women	5861	4753	2808	1108	141204	0.811	0.629	0.708	0.73
Culture.Food and drink	925	451	255	474	148693	0.488	0.639	0.553	0.462
Culture.Internet culture	1214	853	193	361	148466	0.703	0.815	0.755	0.565
Culture.Linguistics	2638	1821	162	817	147073	0.69	0.918	0.788	0.772
Culture.Literature	4860	3618	1017	1242	143996	0.744	0.781	0.762	0.817
Culture.Media.Books	1557	1309	229	248	148087	0.841	0.851	0.846	0.817
Culture.Media.Entertainment	1373	669	336	704	148164	0.487	0.666	0.563	0.542
Culture.Media.Films	4515	4025	249	490	145109	0.891	0.942	0.916	0.895
Culture.Media.Media*	19238	16841	3654	2397	126981	0.875	0.822	0.848	0.917
Culture.Media.Music	7302	6284	1378	1018	141193	0.861	0.82	0.84	0.888
Culture.Media.Radio	801	639	122	162	148950	0.798	0.84	0.818	0.822
Culture.Media.Software	412	163	308	249	149153	0.396	0.346	0.369	0.297
Culture.Media.Television	2777	1967	423	810	146673	0.708	0.823	0.761	0.785
Culture.Media.Video games	934	786	86	148	148853	0.842	0.901	0.87	0.862
Culture.Performing arts	1041	602	427	439	148405	0.578	0.585	0.582	0.594
Culture.Philosophy and religion	3837	1826	917	2011	145119	0.476	0.666	0.555	0.518
Culture.Sports	23849	21760	1540	2089	124484	0.912	0.934	0.923	0.953
Culture.Visual arts.Architecture	4261	3223	952	1038	144660	0.756	0.772	0.764	0.729
Culture.Visual arts.Comics and Anime	801	540	403	261	148669	0.674	0.573	0.619	0.628
Culture.Visual arts.Fashion	299	181	243	118	149331	0.605	0.427	0.501	0.349
Culture.Visual arts.Visual arts*	6876	5181	2255	1695	140742	0.753	0.697	0.724	0.78
Geography.Geographical	8125	5746	1673	2379	140075	0.707	0.774	0.739	0.792
Geography.Regions.Africa.Africa*	3154	2318	1530	836	145189	0.735	0.602	0.662	0.672
Geography.Regions.Africa.Central Africa	261	176	256	85	149356	0.674	0.407	0.508	0.499
Geography.Regions.Africa.Eastern Africa	183	129	149	54	149541	0.705	0.464	0.56	0.432
Geography.Regions.Africa.Northern Africa	487	325	266	162	149120	0.667	0.55	0.603	0.531
Geography.Regions.Africa.Southern Africa	498	352	333	146	149042	0.707	0.514	0.595	0.528
Geography.Regions.Africa.Western Africa	250	180	254	70	149369	0.72	0.415	0.526	0.365
Geography.Regions.Americas.Central America	1329	799	401	530	148143	0.601	0.666	0.632	0.583
Geography.Regions.Americas.North America	23096	16939	4747	6157	122030	0.733	0.781	0.757	0.835
Geography.Regions.Americas.South America	2712	2128	807	584	146354	0.785	0.725	0.754	0.77
Geography.Regions.Asia.Asia*	20162	15600	3425	4562	126286	0.774	0.82	0.796	0.849
Geography.Regions.Asia.Central Asia	274	173	207	101	149392	0.631	0.455	0.529	0.505
Geography.Regions.Asia.East Asia	4626	3321	938	1305	144309	0.718	0.78	0.748	0.715
Geography.Regions.Asia.North Asia	2128	1496	535	632	147210	0.703	0.737	0.719	0.705
Geography.Regions.Asia.South Asia	6412	4831	599	1581	142862	0.753	0.89	0.816	0.824
Geography.Regions.Asia.Southeast Asia	2353	1523	452	830	147068	0.647	0.771	0.704	0.661
Geography.Regions.Asia.West Asia	4534	3390	610	1144	144729	0.748	0.848	0.794	0.801
Geography.Regions.Europe.Eastern Europe	7013	5654	1049	1359	141811	0.806	0.844	0.824	0.814
Geography.Regions.Europe.Europe*	31173	25152	8462	6021	110238	0.807	0.748	0.776	0.852
Geography.Regions.Europe.Northern Europe	11246	7877	2617	3369	136010	0.7	0.751	0.725	0.797
Geography.Regions.Europe.Southern Europe	5424	3935	1431	1489	143018	0.725	0.733	0.729	0.774
Geography.Regions.Europe.Western Europe	7773	5941	1958	1832	140142	0.764	0.752	0.758	0.785
Geography.Regions.Oceania	6396	4901	610	1495	142867	0.766	0.889	0.823	0.841
History and Society.Business and economics	3785	1753	1144	2032	144944	0.463	0.605	0.525	0.479
History and Society.Education	3007	1790	932	1217	145934	0.595	0.658	0.625	0.594
History and Society.History	4139	1690	1042	2449	144692	0.408	0.619	0.492	0.49
History and Society.Military and warfare	5523	3219	922	2304	143428	0.583	0.777	0.666	0.69
History and Society.Politics and government	10973	6959	1247	4014	137653	0.634	0.848	0.726	0.767
History and Society.Society	3107	1011	355	2096	146411	0.325	0.74	0.452	0.445
History and Society.Transportation	5836	4529	567	1307	143470	0.776	0.889	0.829	0.843
STEM.Biology	13125	11974	300	1151	136448	0.912	0.976	0.943	0.949
STEM.Chemistry	511	269	312	242	149050	0.526	0.463	0.493	0.457
STEM.Computing	999	439	247	560	148627	0.439	0.64	0.521	0.452
STEM.Earth and environment	1745	1175	288	570	147840	0.673	0.803	0.733	0.683
STEM.Engineering	2164	1409	383	755	147326	0.651	0.786	0.712	0.66
STEM.Libraries & Information	207	108	160	99	149506	0.522	0.403	0.455	0.337
STEM.Mathematics	140	34	143	106	149590	0.243	0.192	0.215	0.104
STEM.Medicine & Health	1988	999	542	989	147343	0.503	0.648	0.566	0.518
STEM.Physics	326	128	226	198	149321	0.393	0.362	0.376	0.253
STEM.STEM*	22642	20834	12464	1808	114767	0.92	0.626	0.745	0.924
STEM.Space	853	639	72	214	148948	0.749	0.899	0.817	0.81
STEM.Technology	1753	624	529	1129	147591	0.356	0.541	0.429	0.401

Although the statistics obtained from the balanced dataset doesn't look very good, the imbalanced dataset is a closer representation of the distribution of the actual data. The performance of the model on imbalanced data is considerably better and was taken into account while judging the performance of the model that will finally be used in production.