Research talk:Characterizing Wikipedia Reader Behaviour/Demographics and Wikipedia use cases/Work log/2019-09-11
Add topicThursday, September 12, 2019
[edit]This work log documents my current progress at expanding the ORES drafttopic model, which predicts topics for a given English Wikipedia article, to other languages on Wikipedia. The goal is to map any given Wikipedia article to one or more, human-interpretable labels that identify what high-level topics relate to that article. These topics can then be used to understand reader behavior by mapping page views from millions of articles to a much much smaller set of topics. In particular, for the debiasing and analysis of the reader demographic surveys, I have well over one million unique article page views across more than one hundred languages that need evaluated.
Example
[edit]If someone were to read the article for the Storm King Art Center, an outdoor sculpture garden in Hudson, NY, USA, we would want to map this page view to topics such as Culture (it is an outdoor art museum) and Geography (it has a physical location). Furthermore, we may want more fine-grained labels within Culture such as Art, and within Geography such as North America. One approach to doing this is through WikiProjects -- there exist many hundreds of WikiProjects in English Wikipedia that are all associated with a specific topic, and each of these WikiProjects has added their template to articles that they believe to be important to their topic. In this case, the talk page for Storm King has been tagged with templates from WikiProject Museums, WikiProject Visual arts, WikiProject Hudson Valley, and WikiProject Public Art. Based on the WikiProject Directory, a mapping can be built between these specific WikiProjects and higher-level categories that they belong to -- in this case: "Culture.Arts", "Culture.Plastic arts", "Culture.Visual arts", and "Geography.Americas".
In practice, mapping an article to the WikiProjects that have tagged it and then a list of topics is not straightforward. In some cases, the Directory is not well-formed and WikiProjects can be inadvertently left out (e.g., WikiProject Europe and several others appear outside of the sections for Geography.Europe) or assigned to odd categories (e.g., a broken link for Cities of the United States can assign all WikiProjects under Geography.Americas to Geography.Cities instead). The templates used by a given WikiProject can be many as well. For WikiProject Public Art, the template used with Storm King is actually WikiProject Wikipedia Saves Public Art as opposed to Template:WikiProject Public Art. And finally, while most English Wikipedia articles have at least one WikiProject tagging them, any given WikiProject is likely to not have tagged many articles that reasonably might be associated with their topic area.
Why WikiProjects?
[edit]No taxonomy of topics will be perfect and the nature of mapping all of Wikipedia to ~50 topics is an incredibly reductive task. A taxonomy of topics based on WikiProjects from English Wikipedia is being used in this work. This naturally raises concerns about whether these topics are appropriate for other language editions, but I have not encountered a clearly superior taxonomy for Wikipedia articles and appreciate that the WikiProjects taxonomy is easily derived and modifiable. Further details can be seen in the initial paper written about the drafttopic model[1].
Additional models based purely on Wikidata were also explored but abandoned as it was deemed that the Wikidata instance-of / sublcass taxonomy does not map closely enough to more general topic taxonomies. Wikipedia categories are famously difficult to map to a coherent taxonomy and do not readily scale across languages as well. Outside taxonomies such as DBpedia introduce additional data processing complexities. More details are contained within the See Also section below for those who are interested in exploring these alternatives.
Modeling
[edit]While looking up the WikiProjects that have tagged an article works well for long-standing articles on English Wikipedia, some method is needed for automatically inferring these topics when articles are new or outside of English Wikipedia. That is, a model needs to be built that can predict what topics should be applied to any given Wikipedia article.
Existing drafttopic model
[edit]The existing ORES drafttopic model takes the text of a page, represents each word via word embeddings, and makes predictions based on an average of these word embeddings. This allows it to capture the nuances of language while not requiring the article to already have much of the structure (Wikidata, links) that established articles on Wikipedia often have. This approach was taken so that the model would be applicable to drafts of new articles. It has the drawback, however, of being difficult to scale to other languages. Approaches such as multilingual word embeddings are still largely unproven and bring about many other challenges around preprocessing, loading the rather large word embeddings into memory, and needing the text for each article (which is nontrivial when analyzing over one million articles across more than 100 language editions).
Wikidata Model
[edit]For my particular context of representing page views to existing Wikipedia articles, I am not restricted to just article text. So as not to build separate language models for each Wikipedia, we choose to represent a given article not by its text but by the statements on its associated Wikidata item. This is naturally language independent and, intuitively, many Wikidata statements map directly to topics (e.g., an item with the occupation property and physician value should probably fall under STEM.Medicine). We treat these Wikidata statements like a bag of words.
Gathering training data
[edit]I wrote a script that loops through the dump of the current version of English Wikipedia, checks pages in the article talk page namespace (1) and retains any talk page that has templates whose name includes "wp" or "wikiproject".
Get English Wikipedia articles tagged with WikiProject templates |
---|
import bz2
import json
import re
import mwxml
import mwparserfromhell
def norm_wp_name(wp):
return re.sub("\s\s+", " ", wp.strip().lower().replace("wikipedia:", "").replace('_', ' '))
dump_fn = '/mnt/data/xmldatadumps/public/enwiki/20190701/enwiki-20190701-pages-meta-current.xml.bz2'
output_json = 'wp_templates_by_article.json'
articles_kept = 0
processed = 0
with open(output_json, 'w') as fout:
dump = mwxml.Dump.from_file(bz2.open(dump_fn, 'rt'))
for page in dump:
# talk pages for existing articles
if page.namespace == 1 and page.redirect is None:
# get templates from most recent revision
rev = next(page)
wikitext = mwparserfromhell.parse(rev.text)
templates = wikitext.filter_templates()
# retain templates possibly related to WikiProjects
possible_wp_tmps = []
for t in templates:
template_name = norm_wp_name(t.name.strip_code())
if 'wp' in template_name or 'wikiproject' in template_name:
possible_wp_tmps.append(template_name)
# output talk pages with at least one potential WikiProject template
if possible_wp_tmps:
page_json = {'talk_page_id': page.id, 'talk_page_title': page.title, 'rev_id':rev.id,
'templates':possible_wp_tmps}
fout.write(json.dumps(page_json) + '\n')
articles_kept += 1
processed += 1
if processed % 100000 == 0:
print("{0} processed. {1} kept.".format(processed, articles_kept))
print("Finished: {0} processed. {1} kept.".format(processed, articles_kept))
|
These templates are mapped to topics via the existing drafttopic code (with a few adjustments to clean the directory).
Map WikiProject templates to topics |
---|
def get_wp_to_mlc(mid_level_categories_json):
"""Build map of WikiProject template names to mid-level categories.
NOTE: mid_level_categories_json generated as outmid from: https://github.com/wikimedia/drafttopic
Parameters:
mid_level_categories_json: JSON file with structure like:
{
"wikiprojects": {
"Culture.Music": [
"Wikipedia:WikiProject Music",
"Wikipedia:WikiProject Music terminology",
"Wikipedia:WikiProject Music theory",
...
Returns:
dictionary of wikiproject template names mapped to their mid-level categories. For example:
{'wikiproject music': ['Culture.Music'],
'wikiproject cycling: ['History_And_Society.Transportation', 'Culture.Sports'].
...
}
"""
with open(mid_level_categories_json, 'r') as fin:
mlc_dir = json.load(fin)
wp_to_mlc = {}
for mlc in mlc_dir['wikiprojects']:
# for each WikiProject name, build standard set of template names
for wp in mlc_dir['wikiprojects'][mlc]: # e.g., "Wikipedia:WikiProject Trains "
normed_name = norm_wp_name(wp) # e.g., "wikiproject trains"
short_name = normed_name.replace("wikiproject", "wp") # e.g., "wp trains"
shorter_name = short_name.replace(" ", "") # e.g., "wptrains"
flipped_name = normed_name.replace("wikiproject", "").replace(" ", "") + "wikiproject" # e.g., "trainswikiproject"
wp_to_mlc[normed_name] = wp_to_mlc.get(normed_name, []) + [mlc]
wp_to_mlc[short_name] = wp_to_mlc.get(short_name, []) + [mlc]
wp_to_mlc[shorter_name] = wp_to_mlc.get(shorter_name, []) + [mlc]
wp_to_mlc[flipped_name] = wp_to_mlc.get(flipped_name, []) + [mlc]
# common templates that do not fit standard patterns
one_offs = {'wpmilhist':'wikiproject military history',
'wikiproject elections and referendums':'wikiproject elections and referenda',
'wpmed':'wikiproject medicine',
'wikiproject mcb':'wikiproject molecular and cell biology',
'wikiproject palaeontology':'wikiproject paleontology',
'u.s. roads wikiproject':'wikiproject u.s. roads',
'wpjournals':'wikiproject academic journals',
'wpcoop':'wikiproject cooperatives',
'wpukgeo':'wikiproject uk geography',
'wikiproject finance':'wikiproject finance & investment',
'wpphilippines':'wikiproject tambayan philippines',
'wikiproject philippines':'wikiproject tambayan philippines',
'wptr': 'wikiproject turkey',
'wikiproject molecular and cellular biology':'wikiproject molecular and cell biology',
'wp uk politics':'wikiproject politics of the united kingdom',
'wikiproject nrhp':'wikiproject national register of historic places',
'wikiproject crime':'wikiproject crime and criminal biography',
'wpj':'wikiproject japan',
'wikiproject awards':'wikiproject awards and prizes',
'wikiprojectsongs':'wikiproject songs',
'wpuk':'wikiproject united kingdom',
'wpbio':'wikiproject biography'}
for tmp_name, mapped_to in one_offs.items():
wp_to_mlc[tmp_name] = wp_to_mlc[mapped_to]
return wp_to_mlc
|
Another script then loops through the Wikidata JSON dump and maps each talk page and its topics to a Wikidata item (joining on the title or QID if available) and associated claims.
Get Wikidata properties for set of QIDs or articles |
---|
import bz2
import json
def get_wd_claims(qids=None, titles=None):
"""Get Wikidata properties for set of QIDs.
NOTE: This process takes ~24 hours. It will return all QIDs that were found,
which maybe be less than the total number searched for.
All properties are retained for statements. Values are also retained if they are Wikidata items.
In practice, this means that statements like 'instance-of' are both property and value while
statements like 'coordinate location' or 'image' are just represented by their property.
Args:
qids: dictionary or set of QIDs to retain -- e.g., {'Q42', 'Q3107329', ...}
titles: dictionary or set of English Wikipedia titles to retain -- e.g., {'Svetlana Alexievich', 'Kitten', ...}
Returns:
dictionary of QID to list of claims tuples and corresponding dictionary of QID to title. For example:
{'Q42': [('P31', 'Q5'), ('P18', ), ('P21', 'Q6581097'), ...],
'Q3107329': [('P31', 'Q47461344'), ...],
...
},
{'Q42':'Douglas Adams', 'Q3107329':'The Hitchhiker's Guide to the Galaxy (novel)', ...}
"""
dump_fn = '/mnt/data/xmldatadumps/public/wikidatawiki/entities/20190819/wikidata-20190819-all.json.bz2'
items_found = 0
qid_to_claims = {}
qid_to_title = {}
print("Building QID->properties map from {0}".format(dump_fn))
with bz2.open(dump_fn, 'rt') as fin:
next(fin)
for idx, line in enumerate(fin, start=1):
# load line as JSON minus newline and any trailing comma
try:
item_json = json.loads(line[:-2])
except Exception:
try:
item_json = json.loads(line)
except Exception:
print("Error:", idx, line)
continue
if idx % 100000 == 0:
print("{0} lines processed. {1} kept.".format(
idx, items_found))
qid = item_json.get('id', None)
if not qid or (qids is not None and qid not in qids):
continue
en_title = item_json.get('sitelinks', {}).get('enwiki', {}).get('title', None)
if titles is not None and en_title not in titles:
continue
claims = item_json.get('claims', {})
claim_tuples = []
# each property, such as P31 instance-of
for property in claims:
included = False
# each value under that property -- e.g., instance-of might have three different values
for statement in claims[property]:
try:
if statement['type'] == 'statement' and statement['mainsnak']['datatype'] == 'wikibase-item':
claim_tuples.append((property, statement['mainsnak']['datavalue']['value']['id']))
included = True
except Exception:
continue
if not included:
claim_tuples.append((property, ))
if not claim_tuples:
claim_tuples = [('<NOCLAIM>', )]
items_found += 1
qid_to_claims[qid] = claim_tuples
qid_to_title[qid] = en_title
return qid_to_claims, qid_to_title
|
Building supervised model
[edit]I use fastText to build a model that predicts a given Wikidata's topics based on its claims. A more complete description of fastText and how to build this model is contained within this PAWS notebook. Notably, there is some pre-processing to go from the JSON files outputted above and fastText-ready files.
JSON -> fastText format |
---|
import argparse
import os
from random import sample
import pandas as pd
def to_dataframe(data_fn):
"""Parses Wikidata claims for fastText processing"""
print("Converting {0} -> fastText format.".format(data_fn))
data = pd.read_json(data_fn, lines=True)
data.set_index('QID', inplace=True)
data = data.sample(frac=1, replace=False)
return data
def wikidata_to_fasttext(data, fasttext_datafn, fasttext_readme):
"""Write xy-data and associated metadata to respective files."""
qid_to_metadata = {}
potential_metadata_cols = [c for c in data.columns if c not in ('claims', 'mid_level_categories')]
if potential_metadata_cols:
print("Metadata columns: {0}".format(potential_metadata_cols))
for qid, row in data.iterrows():
qid_to_metadata[qid] = {}
for c in potential_metadata_cols:
qid_to_metadata[qid][c] = row[c]
if 'mid_level_categories' in data.columns:
y_corpus = data["mid_level_categories"]
else:
y_corpus = None
x_corpus = data["claims"].apply(
lambda row: " ".join([' '.join(pair) for pair in sample(row, len(row))]))
write_fasttext(x_corpus, fasttext_datafn, fasttext_readme, y_corpus, qid_to_metadata)
def write_fasttext(x_data, data_fn, readme_fn, y_data=None, qid_to_metadata={}):
"""Write data in fastText format."""
written = 0
skipped = 0
no_claims = 0
with open(readme_fn, 'w') as readme_fout:
with open(data_fn, 'w') as data_fout:
for qid, claims in x_data.iteritems():
if not len(claims):
no_claims += 1
claims = '<NOCLAIM>'
if y_data is not None:
lbls = y_data.loc[qid]
if len(lbls):
mlcs = ' '.join(['__label__{0}'.format(c.replace(" ", "_")) for c in lbls])
data_fout.write("{0} {1}\n".format(mlcs, claims))
else:
skipped += 1
continue
else:
data_fout.write("{0}\n".format(claims))
if qid_to_metadata:
readme_fout.write("{0}\t{1}\n".format(qid, qid_to_metadata.get(qid, {})))
else:
readme_fout.write("{0}\n".format(qid))
written += 1
print("{0} data points written to {1} and {2}. {3} skipped and {4} w/o claims.".format(written, data_fn, readme_fn,
skipped, no_claims))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data_fn")
parser.add_argument("--train_prop", type=float, default=1.)
parser.add_argument("--val_prop", type=float, default=0.)
parser.add_argument("--test_prop", type=float, default=0.)
args = parser.parse_args()
if args.train_prop + args.val_prop + args.test_prop != 1:
raise ValueError("Train/Val/Test proportions must sum to 1.")
data = to_dataframe(args.data_fn)
base_fn = os.path.splitext(args.data_fn)[0]
if args.train_prop == 1:
fasttext_datafn = '{0}_{1}_data.txt'.format(base_fn, args.join_approach)
fasttext_readme = '{0}_{1}_qids.txt'.format(base_fn, args.join_approach)
wikidata_to_fasttext(data, fasttext_datafn, fasttext_readme)
else:
train_idx = int(len(data) * args.train_prop)
val_idx = train_idx + int(len(data) * args.val_prop)
if train_idx > 0:
train_data = data[:train_idx]
print("{0} training datapoints.".format(len(train_data)))
fasttext_datafn = '{0}_{1}_train_{2}_data.txt'.format(base_fn, args.join_approach, len(train_data))
fasttext_readme = '{0}_{1}_train_{2}_qids.txt'.format(base_fn, args.join_approach, len(train_data))
wikidata_to_fasttext(train_data, fasttext_datafn, fasttext_readme)
if val_idx > train_idx:
val_data = data[train_idx:val_idx]
print("{0} validation datapoints.".format(len(val_data)))
fasttext_datafn = '{0}_{1}_val_{2}_data.txt'.format(base_fn, args.join_approach, len(val_data))
fasttext_readme = '{0}_{1}_val_{2}_qids.txt'.format(base_fn, args.join_approach, len(val_data))
wikidata_to_fasttext(val_data, fasttext_datafn, fasttext_readme)
if val_idx < len(data):
test_data = data[val_idx:]
print("{0} test datapoints.".format(len(test_data)))
fasttext_datafn = '{0}_{1}_test_{2}_data.txt'.format(base_fn, args.join_approach, len(test_data))
fasttext_readme = '{0}_{1}_test_{2}_qids.txt'.format(base_fn, args.join_approach, len(test_data))
wikidata_to_fasttext(test_data, fasttext_datafn, fasttext_readme)
|
Performance
[edit]As a baseline, this Wikidata model is compared to the existing drafttopic model for English Wikipedia (more drafttopic statistics here). Notably, this is not a apples-to-apples comparison because the Wikidata model is trained on a much larger, and different dataset that is much less balanced than the drafttopic dataset. For both models, the false negative rate is much higher than it really is for many classes due to the sparsity of WikiProject templating -- i.e. there are generally many articles that reasonably fit within a given WikiProject that are not labeled as such.
Grid search was used to determine the best choice for fastText hyperparameters. This demonstrated that model performance is largely robust to specific choices, though higher dimensionality of embeddings, learning rates, and number of epochs led to greater overfitting.
dim | epoch | lr | minCount | ws | train micro f1 | val micro f1 | train macro f1 | val macro f1 |
---|---|---|---|---|---|---|---|---|
50 | 10 | 0.05 | 3 | 5 | 0.815 | 0.809 | 0.64 | 0.627 |
50 | 10 | 0.05 | 3 | 10 | 0.815 | 0.808 | 0.64 | 0.624 |
50 | 10 | 0.05 | 3 | 20 | 0.815 | 0.808 | 0.64 | 0.626 |
50 | 20 | 0.05 | 3 | 5 | 0.822 | 0.812 | 0.661 | 0.635 |
50 | 20 | 0.05 | 3 | 10 | 0.822 | 0.811 | 0.661 | 0.634 |
50 | 20 | 0.05 | 3 | 20 | 0.822 | 0.811 | 0.66 | 0.634 |
50 | 30 | 0.05 | 3 | 5 | 0.827 | 0.813 | 0.674 | 0.64 |
50 | 30 | 0.05 | 3 | 10 | 0.826 | 0.813 | 0.674 | 0.639 |
50 | 30 | 0.05 | 3 | 20 | 0.826 | 0.812 | 0.673 | 0.638 |
50 | 10 | 0.05 | 5 | 5 | 0.813 | 0.808 | 0.637 | 0.624 |
50 | 10 | 0.05 | 5 | 10 | 0.814 | 0.808 | 0.637 | 0.625 |
50 | 10 | 0.05 | 5 | 20 | 0.814 | 0.808 | 0.637 | 0.625 |
50 | 20 | 0.05 | 5 | 5 | 0.82 | 0.811 | 0.655 | 0.633 |
50 | 20 | 0.05 | 5 | 10 | 0.82 | 0.811 | 0.655 | 0.632 |
50 | 20 | 0.05 | 5 | 20 | 0.82 | 0.811 | 0.655 | 0.632 |
50 | 30 | 0.05 | 5 | 5 | 0.823 | 0.812 | 0.666 | 0.638 |
50 | 30 | 0.05 | 5 | 10 | 0.823 | 0.812 | 0.666 | 0.637 |
50 | 30 | 0.05 | 5 | 20 | 0.823 | 0.812 | 0.666 | 0.638 |
50 | 10 | 0.05 | 10 | 5 | 0.811 | 0.807 | 0.632 | 0.62 |
50 | 10 | 0.05 | 10 | 10 | 0.811 | 0.807 | 0.632 | 0.62 |
50 | 10 | 0.05 | 10 | 20 | 0.811 | 0.807 | 0.632 | 0.621 |
50 | 20 | 0.05 | 10 | 5 | 0.816 | 0.809 | 0.647 | 0.629 |
50 | 20 | 0.05 | 10 | 10 | 0.816 | 0.809 | 0.647 | 0.628 |
50 | 20 | 0.05 | 10 | 20 | 0.816 | 0.809 | 0.647 | 0.629 |
50 | 30 | 0.05 | 10 | 5 | 0.818 | 0.81 | 0.655 | 0.631 |
50 | 30 | 0.05 | 10 | 10 | 0.818 | 0.81 | 0.655 | 0.632 |
50 | 30 | 0.05 | 10 | 20 | 0.818 | 0.81 | 0.655 | 0.632 |
100 | 10 | 0.05 | 3 | 5 | 0.815 | 0.809 | 0.641 | 0.625 |
100 | 10 | 0.05 | 3 | 10 | 0.815 | 0.808 | 0.64 | 0.626 |
100 | 10 | 0.05 | 3 | 20 | 0.815 | 0.808 | 0.641 | 0.627 |
100 | 20 | 0.05 | 3 | 5 | 0.822 | 0.811 | 0.662 | 0.635 |
100 | 20 | 0.05 | 3 | 10 | 0.822 | 0.811 | 0.662 | 0.634 |
100 | 20 | 0.05 | 3 | 20 | 0.822 | 0.812 | 0.662 | 0.636 |
100 | 30 | 0.05 | 3 | 5 | 0.827 | 0.813 | 0.675 | 0.639 |
100 | 30 | 0.05 | 3 | 10 | 0.827 | 0.813 | 0.675 | 0.64 |
100 | 30 | 0.05 | 3 | 20 | 0.826 | 0.812 | 0.674 | 0.639 |
100 | 10 | 0.05 | 5 | 5 | 0.814 | 0.808 | 0.639 | 0.624 |
100 | 10 | 0.05 | 5 | 10 | 0.814 | 0.808 | 0.638 | 0.625 |
100 | 10 | 0.05 | 5 | 20 | 0.814 | 0.808 | 0.638 | 0.624 |
100 | 20 | 0.05 | 5 | 5 | 0.82 | 0.811 | 0.656 | 0.633 |
100 | 20 | 0.05 | 5 | 10 | 0.82 | 0.811 | 0.657 | 0.633 |
100 | 20 | 0.05 | 5 | 20 | 0.82 | 0.811 | 0.656 | 0.632 |
100 | 30 | 0.05 | 5 | 5 | 0.823 | 0.812 | 0.667 | 0.638 |
100 | 30 | 0.05 | 5 | 10 | 0.823 | 0.812 | 0.667 | 0.637 |
100 | 30 | 0.05 | 5 | 20 | 0.823 | 0.812 | 0.667 | 0.637 |
100 | 10 | 0.05 | 10 | 5 | 0.812 | 0.807 | 0.634 | 0.622 |
100 | 10 | 0.05 | 10 | 10 | 0.811 | 0.807 | 0.634 | 0.622 |
100 | 10 | 0.05 | 10 | 20 | 0.811 | 0.807 | 0.633 | 0.62 |
100 | 20 | 0.05 | 10 | 5 | 0.816 | 0.809 | 0.648 | 0.629 |
100 | 20 | 0.05 | 10 | 10 | 0.816 | 0.809 | 0.648 | 0.629 |
100 | 20 | 0.05 | 10 | 20 | 0.816 | 0.809 | 0.647 | 0.629 |
100 | 30 | 0.05 | 10 | 5 | 0.818 | 0.81 | 0.656 | 0.633 |
100 | 30 | 0.05 | 10 | 10 | 0.818 | 0.81 | 0.655 | 0.632 |
100 | 30 | 0.05 | 10 | 20 | 0.818 | 0.81 | 0.656 | 0.632 |
50 | 10 | 0.1 | 3 | 5 | 0.817 | 0.809 | 0.65 | 0.631 |
50 | 10 | 0.1 | 3 | 10 | 0.817 | 0.81 | 0.65 | 0.631 |
50 | 10 | 0.1 | 3 | 20 | 0.817 | 0.809 | 0.65 | 0.631 |
50 | 20 | 0.1 | 3 | 5 | 0.824 | 0.812 | 0.67 | 0.638 |
50 | 20 | 0.1 | 3 | 10 | 0.824 | 0.812 | 0.67 | 0.638 |
50 | 20 | 0.1 | 3 | 20 | 0.824 | 0.812 | 0.671 | 0.638 |
50 | 30 | 0.1 | 3 | 5 | 0.828 | 0.813 | 0.683 | 0.639 |
50 | 30 | 0.1 | 3 | 10 | 0.829 | 0.813 | 0.683 | 0.64 |
50 | 30 | 0.1 | 3 | 20 | 0.828 | 0.813 | 0.683 | 0.64 |
50 | 10 | 0.1 | 5 | 5 | 0.815 | 0.809 | 0.646 | 0.63 |
50 | 10 | 0.1 | 5 | 10 | 0.815 | 0.809 | 0.646 | 0.629 |
50 | 10 | 0.1 | 5 | 20 | 0.815 | 0.809 | 0.647 | 0.63 |
50 | 20 | 0.1 | 5 | 5 | 0.821 | 0.811 | 0.663 | 0.636 |
50 | 20 | 0.1 | 5 | 10 | 0.821 | 0.811 | 0.663 | 0.636 |
50 | 20 | 0.1 | 5 | 20 | 0.821 | 0.811 | 0.663 | 0.635 |
50 | 30 | 0.1 | 5 | 5 | 0.825 | 0.812 | 0.673 | 0.637 |
50 | 30 | 0.1 | 5 | 10 | 0.825 | 0.812 | 0.674 | 0.638 |
50 | 30 | 0.1 | 5 | 20 | 0.824 | 0.812 | 0.673 | 0.636 |
50 | 10 | 0.1 | 10 | 5 | 0.813 | 0.808 | 0.64 | 0.625 |
50 | 10 | 0.1 | 10 | 10 | 0.813 | 0.808 | 0.641 | 0.625 |
50 | 10 | 0.1 | 10 | 20 | 0.813 | 0.808 | 0.641 | 0.626 |
50 | 20 | 0.1 | 10 | 5 | 0.817 | 0.81 | 0.654 | 0.631 |
50 | 20 | 0.1 | 10 | 10 | 0.817 | 0.809 | 0.653 | 0.631 |
50 | 20 | 0.1 | 10 | 20 | 0.817 | 0.81 | 0.653 | 0.632 |
50 | 30 | 0.1 | 10 | 5 | 0.82 | 0.81 | 0.661 | 0.633 |
50 | 30 | 0.1 | 10 | 10 | 0.819 | 0.81 | 0.661 | 0.633 |
50 | 30 | 0.1 | 10 | 20 | 0.819 | 0.81 | 0.661 | 0.633 |
100 | 10 | 0.1 | 3 | 5 | 0.817 | 0.809 | 0.651 | 0.631 |
100 | 10 | 0.1 | 3 | 10 | 0.817 | 0.809 | 0.651 | 0.631 |
100 | 10 | 0.1 | 3 | 20 | 0.817 | 0.809 | 0.651 | 0.631 |
100 | 20 | 0.1 | 3 | 5 | 0.824 | 0.812 | 0.671 | 0.639 |
100 | 20 | 0.1 | 3 | 10 | 0.824 | 0.812 | 0.67 | 0.636 |
100 | 20 | 0.1 | 3 | 20 | 0.824 | 0.812 | 0.671 | 0.637 |
100 | 30 | 0.1 | 3 | 5 | 0.828 | 0.813 | 0.683 | 0.641 |
100 | 30 | 0.1 | 3 | 10 | 0.828 | 0.813 | 0.682 | 0.641 |
100 | 30 | 0.1 | 3 | 20 | 0.829 | 0.813 | 0.684 | 0.641 |
100 | 10 | 0.1 | 5 | 5 | 0.816 | 0.809 | 0.648 | 0.632 |
100 | 10 | 0.1 | 5 | 10 | 0.815 | 0.809 | 0.648 | 0.631 |
100 | 10 | 0.1 | 5 | 20 | 0.816 | 0.809 | 0.648 | 0.63 |
100 | 20 | 0.1 | 5 | 5 | 0.821 | 0.811 | 0.664 | 0.636 |
100 | 20 | 0.1 | 5 | 10 | 0.821 | 0.811 | 0.664 | 0.636 |
100 | 20 | 0.1 | 5 | 20 | 0.821 | 0.811 | 0.664 | 0.637 |
100 | 30 | 0.1 | 5 | 5 | 0.825 | 0.812 | 0.674 | 0.637 |
100 | 30 | 0.1 | 5 | 10 | 0.825 | 0.812 | 0.674 | 0.637 |
100 | 30 | 0.1 | 5 | 20 | 0.825 | 0.812 | 0.674 | 0.637 |
100 | 10 | 0.1 | 10 | 5 | 0.813 | 0.808 | 0.641 | 0.626 |
100 | 10 | 0.1 | 10 | 10 | 0.813 | 0.808 | 0.641 | 0.625 |
100 | 10 | 0.1 | 10 | 20 | 0.813 | 0.808 | 0.642 | 0.627 |
100 | 20 | 0.1 | 10 | 5 | 0.817 | 0.81 | 0.654 | 0.633 |
100 | 20 | 0.1 | 10 | 10 | 0.817 | 0.81 | 0.654 | 0.631 |
100 | 20 | 0.1 | 10 | 20 | 0.817 | 0.809 | 0.653 | 0.631 |
100 | 30 | 0.1 | 10 | 5 | 0.819 | 0.81 | 0.661 | 0.634 |
100 | 30 | 0.1 | 10 | 10 | 0.819 | 0.81 | 0.661 | 0.634 |
100 | 30 | 0.1 | 10 | 20 | 0.819 | 0.81 | 0.661 | 0.633 |
50 | 10 | 0.2 | 3 | 5 | 0.82 | 0.811 | 0.658 | 0.636 |
50 | 10 | 0.2 | 3 | 10 | 0.819 | 0.81 | 0.658 | 0.635 |
50 | 10 | 0.2 | 3 | 20 | 0.819 | 0.81 | 0.657 | 0.636 |
50 | 20 | 0.2 | 3 | 5 | 0.826 | 0.812 | 0.677 | 0.639 |
50 | 20 | 0.2 | 3 | 10 | 0.827 | 0.813 | 0.678 | 0.641 |
50 | 20 | 0.2 | 3 | 20 | 0.827 | 0.813 | 0.678 | 0.641 |
50 | 30 | 0.2 | 3 | 5 | 0.831 | 0.813 | 0.69 | 0.64 |
50 | 30 | 0.2 | 3 | 10 | 0.831 | 0.813 | 0.689 | 0.641 |
50 | 30 | 0.2 | 3 | 20 | 0.831 | 0.813 | 0.689 | 0.64 |
50 | 10 | 0.2 | 5 | 5 | 0.817 | 0.81 | 0.653 | 0.633 |
50 | 10 | 0.2 | 5 | 10 | 0.817 | 0.81 | 0.654 | 0.634 |
50 | 10 | 0.2 | 5 | 20 | 0.818 | 0.81 | 0.653 | 0.633 |
50 | 20 | 0.2 | 5 | 5 | 0.823 | 0.812 | 0.67 | 0.637 |
50 | 20 | 0.2 | 5 | 10 | 0.823 | 0.812 | 0.67 | 0.637 |
50 | 20 | 0.2 | 5 | 20 | 0.823 | 0.812 | 0.67 | 0.636 |
50 | 30 | 0.2 | 5 | 5 | 0.826 | 0.812 | 0.679 | 0.637 |
50 | 30 | 0.2 | 5 | 10 | 0.826 | 0.812 | 0.679 | 0.637 |
50 | 30 | 0.2 | 5 | 20 | 0.827 | 0.812 | 0.68 | 0.637 |
50 | 10 | 0.2 | 10 | 5 | 0.814 | 0.809 | 0.646 | 0.63 |
50 | 10 | 0.2 | 10 | 10 | 0.814 | 0.808 | 0.646 | 0.63 |
50 | 10 | 0.2 | 10 | 20 | 0.815 | 0.808 | 0.646 | 0.629 |
50 | 20 | 0.2 | 10 | 5 | 0.819 | 0.81 | 0.658 | 0.633 |
50 | 20 | 0.2 | 10 | 10 | 0.819 | 0.81 | 0.658 | 0.633 |
50 | 20 | 0.2 | 10 | 20 | 0.818 | 0.81 | 0.658 | 0.632 |
50 | 30 | 0.2 | 10 | 5 | 0.821 | 0.81 | 0.665 | 0.631 |
50 | 30 | 0.2 | 10 | 10 | 0.821 | 0.81 | 0.665 | 0.632 |
50 | 30 | 0.2 | 10 | 20 | 0.821 | 0.81 | 0.665 | 0.631 |
100 | 10 | 0.2 | 3 | 5 | 0.819 | 0.81 | 0.658 | 0.638 |
100 | 10 | 0.2 | 3 | 10 | 0.819 | 0.81 | 0.658 | 0.636 |
100 | 10 | 0.2 | 3 | 20 | 0.819 | 0.81 | 0.659 | 0.637 |
100 | 20 | 0.2 | 3 | 5 | 0.827 | 0.813 | 0.678 | 0.639 |
100 | 20 | 0.2 | 3 | 10 | 0.827 | 0.813 | 0.678 | 0.64 |
100 | 20 | 0.2 | 3 | 20 | 0.827 | 0.813 | 0.679 | 0.639 |
100 | 30 | 0.2 | 3 | 5 | 0.831 | 0.813 | 0.69 | 0.639 |
100 | 30 | 0.2 | 3 | 10 | 0.831 | 0.813 | 0.69 | 0.641 |
100 | 30 | 0.2 | 3 | 20 | 0.831 | 0.813 | 0.69 | 0.639 |
100 | 10 | 0.2 | 5 | 5 | 0.817 | 0.81 | 0.653 | 0.634 |
100 | 10 | 0.2 | 5 | 10 | 0.817 | 0.81 | 0.653 | 0.633 |
100 | 10 | 0.2 | 5 | 20 | 0.818 | 0.81 | 0.655 | 0.635 |
100 | 20 | 0.2 | 5 | 5 | 0.823 | 0.812 | 0.671 | 0.637 |
100 | 20 | 0.2 | 5 | 10 | 0.823 | 0.812 | 0.671 | 0.638 |
100 | 20 | 0.2 | 5 | 20 | 0.823 | 0.812 | 0.67 | 0.636 |
100 | 30 | 0.2 | 5 | 5 | 0.826 | 0.812 | 0.679 | 0.637 |
100 | 30 | 0.2 | 5 | 10 | 0.826 | 0.812 | 0.679 | 0.637 |
100 | 30 | 0.2 | 5 | 20 | 0.826 | 0.812 | 0.679 | 0.638 |
100 | 10 | 0.2 | 10 | 5 | 0.814 | 0.808 | 0.647 | 0.632 |
100 | 10 | 0.2 | 10 | 10 | 0.814 | 0.808 | 0.647 | 0.63 |
100 | 10 | 0.2 | 10 | 20 | 0.814 | 0.808 | 0.646 | 0.63 |
100 | 20 | 0.2 | 10 | 5 | 0.819 | 0.81 | 0.659 | 0.633 |
100 | 20 | 0.2 | 10 | 10 | 0.819 | 0.81 | 0.658 | 0.631 |
100 | 20 | 0.2 | 10 | 20 | 0.818 | 0.81 | 0.658 | 0.632 |
100 | 30 | 0.2 | 10 | 5 | 0.821 | 0.81 | 0.665 | 0.632 |
100 | 30 | 0.2 | 10 | 10 | 0.821 | 0.81 | 0.666 | 0.633 |
100 | 30 | 0.2 | 10 | 20 | 0.821 | 0.81 | 0.666 | 0.634 |
Based on the grid search, a model was built with the following hyperparameters was evaluated on the test set: 0.1 lr; 50 dim; 3 min count; 30 epochs:
Model | Micro Precision | Macro Precision | Micro Recall | Macro Recall | Micro F1 | Macro F1 |
---|---|---|---|---|---|---|
drafttopic | 0.826 | 0.811 | 0.576 | 0.554 | 0.668 | 0.643 |
Wikidata | 0.881 | 0.809 | 0.762 | 0.560 | 0.811 | 0.643 |
Qualitative
[edit]To explore this Wikidata-based model, you can query it via a local API as described within this code repository: https://github.com/geohci/wikidata-topic-model
See Also
[edit]- ORES drafttopic code: https://github.com/wikimedia/drafttopic
- Wikidata Concept Monitor Taxonomy: https://wikitech.wikimedia.org/wiki/Wikidata_Concepts_Monitor#WDCM_Taxonomy
- English Wikipedia category taxonomy: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2018
- DBpedia ontology: https://wiki.dbpedia.org/services-resources/ontology
- Multlingual word embeddings trained from Wikipedia: https://fasttext.cc/docs/en/pretrained-vectors.html
References
[edit]- ↑ Asthana, Sumit; Halfaker, Aaron (November 2018). "With Few Eyes, All Hoaxes Are Deep". Proc. ACM Hum.-Comput. Interact. 2 (CSCW): 21:1–21:18. ISSN 2573-0142. doi:10.1145/3274290.