Research:Investigating Semantic Navigation on Wikipedia

From Meta, a Wikimedia project coordination wiki

This page documents a planned research project.
Information may be incomplete and change before the project starts.


Key Personnel[edit]

  • Thomas Niebler
  • Alexander Dallmann
  • Martin Becker
  • Florian Lemmerich
  • Andreas Hotho

Project Summary[edit]

We aim to investigate user navigation on Wikipedia both to extract semantic relatedness information as well as to enhance navigation, recommending interesting pages to read or automatically build a category system. Specifically, we want to shed light on the mutual influence of user navigation behaviour and extractable semantic relatedness.

In earlier work we extracted semantic relatedness from navigation on WikiGame paths[1] and from the Clickstream dataset[2], which contains heavily anonymized navigation information from one month on the Wikipedia servers. More recently, we investigated random walks based on the Wikipedia link network as well as on clickstream navigation data from Wikipedia [3]. We could show that semantic relatedness information extracted from these random walks achieved very high correlation with a standard dataset of human intuition about semantic relatedness. However, the clickstream dataset only covers a single month of navigation data and is heavily anonymized. Using random walks here was a result of missing more detailed request logs. Because of this, we think that our method would greatly benefit from using the actual raw request logs extending our already promising research to real navigation data and explore different settings for both our own method[2] (which we already ported to Spark/Hadoop) and the Word2Vec approach, which we already successfully applied on random walks in [3].

Concerning the analysis of user navigation behaviour, we put a big effort into analyzing navigational behavior, resulting for example in the HypTrails method[4] (achieved best paper award at WWW'15), which is used to measure the plausibility of navigation hypotheses on navigation datasets. We want to apply our expertise in this area on real navigation data on Wikipedia, thus offering new insights on user navigation, as well as the influence of user navigation on its semantic content.

In this project, we want to investigate if navigation data from other months also shows the same amount of extractable semantic information, when used as a prior for generated random walks. For this, we need access to the Wikipedia log data.

Methods[edit]

To extract semantic relations from navigation, we apply the Word2Vec approach presented by Mikolov et al. (as already done in above mentioned work[3]) as well as the counting methods presented by Singer[1] and Niebler[2]. To analyze the user navigation behaviour, we utilize the HypTrails method[4].

Dissemination[edit]

We plan to publish our code on GitHub as well as publish our research findings as full papers on major CS conferences, such as WSDM, WWW oder ISWC. Additionally, we plan to publish a major study in a journal article.

Wikimedia Policies, Ethics, and Human Subjects Protection[edit]

Benefits for the Wikimedia community[edit]

The results can be used to enhance user navigation on Wikipedia by proposing semantically related pages or automatically rebuilding the category tree

Timeline[edit]

Funding[edit]

External links[edit]

Contacts[edit]

  • Thomas Niebler: niebler@informatik.uni-wuerzburg.de / Twitter: ThomasNiebler / Skype: Thomasniebler_uni

References[edit]