Created
Collaborators
Alberto García Durán
Robert West
Duration:  2020-June – ??

Information may be incomplete and change as the project progresses.

The main question we want to answer is how much of the complexity of reader navigation in Wikipedia can be captured by the publicly available clickstream data. To answer this question, we have performed a systematic comparison of ensembles of real navigation sequences with three different synthetically generated navigation sequences, which further allows us to quantitatively assess the differences in the observed user-behavior between real and synthetic navigation sequences.

We wrote up the findings of our project in this paper [1] (available as pre-print):

## Data

Specifically, we use 8 different language versions of Wikipedia for which clickstream data is available, namely—English (EN), Japanese (JA), German (DE), Russian (RU), French (FR), Italian (IT), Polish (PL), and Persian (FA). Additionally, this choice covers languages from different families. Please see Table 1 for detailed dataset statistics.

Dataset statistics

In addition to the observed real navigation sequences, we generate sequences by performing random walks with the following three different transition probabilities between articles in the Wikipedia hyperlink graph. Please see Table 2 for an overview of the four different types of navigation sequences analyzed in this work.

• Clickstream-Priv: weights proportional to the number of clicks between two articles obtained from the webrequest logs, which is similar to Clickstream-Pub, except that it also includes pairs of articles with 10 or fewer observations.
• Clickstream-Pub: weights proportional to the number of clicks between two articles obtained from the publicly available clickstream data.
• Graph: weights are uniform over all outgoing links from the Wikipedia hyperlink graph.
An overview of navigation sequence types analyzed in this work.

## Methodology

We quantify the difference between real and synthetic navigation sequences using techniques belonging to two broad facets:

• Empirical characterization of navigation sequences

### Empirical characterization

By construction, the synthetic sequences follow a Markov process of order 1, i.e., when generating a synthetic sequence, the probability for the next article only depends on the current article and not on any of the previously visited articles. Our aim is to quantify the degree to which previously visited articles affect the probability for the next article in real navigation sequences.

• Mixing of flows: First, we quantify the mixing of incoming and outgoing navigation flows when passing through a given article ${\displaystyle m}$ using information-theoretic measures, specifically, the adjusted mutual information (AMI). For navigation sequences passing through ${\displaystyle m}$, the mutual information (MI) measures the average amount of information (in bits) about the target-page ${\displaystyle t}$ when knowing the source-page ${\displaystyle s}$. The advantage of AMI is that it is normalized between 0 (full mixing, no information about ${\displaystyle t}$) and 1 (no mixing, full information about ${\displaystyle t}$) and takes into account non-zero values of the MI from finite-size effects.
• Diffusion in the semantic space: Next, moving beyond sequences passing through a specific article, we assess whether a similar effect holds for sequences more generally. We quantify how much faster synthetic navigation sequences diffuse in an abstracted semantic space using an article-embedding. We use a Semantic embedding that captures the semantic similarity between different articles. We then map the navigation sequence into this space and measure the cosine-distance between the starting article and the article reached after ${\displaystyle k}$ steps.

## Key Results

### Mixing of Flows

• We show the distribution of values for the AMI and provide two examples illustrating two pages with high and low mixing, respectively. Qualitatively, we see that for ${\displaystyle AMI\approx 0.1}$ sequences are almost completely mixing (no correlation between source and target when passing though ${\displaystyle m}$) while for ${\displaystyle AMI\approx 0.6}$ sequences only mix to a very small degree (strong correlation between source and target when passing though ${\displaystyle m}$).
• Overall, we observe that most pages show a high degree of mixing, that is less than ${\displaystyle 10\%}$ of pages have an ${\displaystyle AMI>0.1}$ with a fast exponential decay of the distribution such that less than ${\displaystyle 0.1}$ of pages exhibit ${\displaystyle AMI>0.5}$. We find a similar pattern for all 8 languages.
• Since by construction, the synthetic navigation sequences exhibit full mixing due to the fact that incoming and outgoing traffic are completely uncorrelated, the AMI analysis establishes that in majority of the cases real navigation is similar to synthetic navigation in this regard.

### Diffusion in semantic space

• We observe that the distribution of distances for real and synthetic sequences from clickstream are almost completely overlapping (even for larger ${\displaystyle k}$) showing that differences are present but overall small.
• In contrast, the distribution of distances from the synthetic sequences of the unweighted random walk on the hyperlink graph is substantially shifted to larger values indicating that these sequences are much less confined and, as a result, diffuse much faster.
• As a sanity-check, we show that the semantic distance between two completely randomly drawn articles is much larger, corroborating that the embedding space captures the semantic similarity between articles.
• These findings hold across different languages.

## Summarizing take-aways

• Across all 8 considered languages, our results show that real navigation sequences exhibit strong mixing characterized by the mutual information being close to zero, i.e. the source-article contains very little information about the target-article of the navigation sequence.
• However, for a small set of articles (around ${\displaystyle 0.1-1\%}$) we observed much larger values of mutual information. This provides clear evidence for cases where the real navigation sequences differ substantially from the synthetic navigation sequences.
• We also showed that, as a result, overall the diffusion in the semantic space of articles is virtually indistinguishable when comparing real navigation sequences with synthetic sequences from public clickstream. While differences in the cosine-similarity between articles separated by ${\displaystyle k}$ steps are statistically significant, the effect sizes are small.
• We showed that synthetic navigation sequences from the public clickstream dataset are effective for most practical applications for all 8 considered languages. Specifically, we compared the performance of real and synthetic navigation sequences in four different downstream tasks involving the use of navigation sequences: link prediction, next-article prediction, generating representations, and topic classification.
• Our results revealed that using clickstream data often yields performance that is within ${\displaystyle 10\%}$ (or less) in comparison to using real navigation sequences. Specifically, article embeddings generated from synthetic navigation sequences are of comparable quality to those generated from real navigation sequences.
• Furthermore, in the case of next-article prediction and semantic similarity/relatedness we found evidence that the main limitations do not originate from the restrictions to first-order Markov processes when generating the synthetic data but from additional filtering of the public data using ${\displaystyle k}$-anonymity necessary for ensuring privacy.
1. Arora, A., Gerlach, M., Piccardi, T., García-Durán, A., & West, R. (2022). Wikipedia Reader Navigation: When Synthetic Data Is Enough. In arXiv [cs.CY]. arXiv. http://arxiv.org/abs/2201.00812