Research:Wikipedia Navigation Vectors
Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.
There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.
An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.
Where to get the Data
The canonical citation and most up-to-date version of this dataset can be found at:
- Ellery Wulczyn (2016). Wikipedia Navigation Vectors. figshare. doi:10.6084/m9.figshare.3146878
Check out this ipython notebook for a tutorial on how to work with the data.
Since we don't have unique tokens, define a "client" as an (IP, UA, XFF) tuple. To generate a list of requests per client for a given timespan:
- take all non-spider requests for all Wikipedias
- resolve redirects for the top 20 most visited Wikipedias
- filter out requests for non-main namespace articles
- filter out disambiguation pages
- filter out requests for articles that were requested by fewer than 50 clients
- filter out requests for articles that do not have a corresponding Wikidata item
- group requests per client
- remove all data from any client who made an edit
- break requests from a client into sessions whenever there is a gap of 30 minutes or more between requests
- drop sessions with a request to the Main Page
- collapse consecutive requests for the same article into a single request
- filter out sessions with less than 2 requests and more than 30 requests
- Train Word2vec model using the original C implentation
- Save vectors in standard word2vec text format
To give some sense of the scale of the training data, here are some counts for a typical week's worth data:
- # of sessions for training: 370M
- # number items across all training sessions: 1.4B
English Wikipedia Embedding:
- # of sessions for training: 170M
- # number items across all training sessions: 650M
The word2vec algorithm has several hyper-parameters. To tune the algorithm, we used one month of training data and ran randomized search over the following parameter grid:
- size: (50, 100, 200, 300)
- window: (1,2,4,6,10
- sample: (1e-3, 5e-4, 1e-4, 5e-5, 1e-5
- hs : (0,)
- negative : (3,5,10,25,50)
- iter: (1,2,3,4,6,10)
- cbow: (0,1)
To evaluate an embedding, we took a random sample of 50k sessions from the week following the month during which the training data was collected. For each session in this evaluation set, we randomly selected a pair of articles and then computed the mean reciprocal rank of the second article in the ranked list of nearest neighbors for the first article over all pairs. The best model attained an MRR of 0.166.
On Figshare, there are currently releases for models trained on the requests from the following timespans:
- 2016-09-01 through 2016-09-30
- 2016-08-01 through 2016-08-31
- 2016-03-01 through 2016-03-07
Each release contains an embedding for English Wikipedia and an embedding for Wikidata for different embedding dimensions. The embedding file names have the following structure:
To give an example, the " 2016-03-01_2016-03-07_wikidata_100" file contains a 100 dimensional Wikidata item embedding that was trained on data from the first week of March.
Note: Since a lot of readership on Wikipedia is driven by trending topics in the media, you can expect the embeddings for articles relating to media events to change based on these trends. For example, the nearest neighbor for Hillary Clinton may be Bernie Sanders in one month and Donald Trump in the next, depending on what is happening in the presidential campaign race. For articles about less trendy topics, the nearest neighbors should be fairly stable across releases.
Here are some ideas for how to use these embeddings to improve Wikipedia.
We recently created a tool for recommending articles for translation between Wikipedias. Users choose a source language to translate from and a target language to translate to. Then they choose a seed article in the source language that represents their interests. We can find articles in the source language missing in the target language using Wikidata sitelinks, and then rank the missing articles according to the similarity between their vectors and the seed vector.
The Reading Team at WMF recently introduced a "Related Pages" feature that gives readers 3 recommendations for further reading. The current recommendations are generated by the More Like This Query feature in Elastic Search.
Instead, we could generate recommendations for further reading by looking up the nearest neighbors of the current article the reader is on in an embedding. The advantage of this approach is that the nearest neighbors are by definition articles that tend to be read together. Furthermore, the Wikidata embedding would allow us to use a single model to generate recommendations across all languages! Here is demo of how this could work.
If articles are frequently read within the same session, you might be able to make Wikipedia easier to navigate if you were to create a link between them. For a given article you could generate recommendations for links to add by finding the nearest neighbors that are not already linked and adding a link if the original article has a suitable anchor text. Again, the Wikidata embedding would allow you to build a model for all languages.
- Wikipedia Vectors, figshare