Jump to content

Research talk:Expanding Wikipedia articles across languages/Work log/2018-01-03

Add topic
From Meta, a Wikimedia project coordination wiki

Wednesday, January 3, 2018

[edit]

We are interested in applying what we have learned so far to languages other than English. We aim to track this work here.

Cleaning up the category network

[edit]

Cleaning the category network using the methodology we developed across languages is possible, however, it is costly. Recently, Amit et al. has developed a new way for cleaning up Wikipedia's category network in 280 languages. Instead of spending times to clean up the category network in a new language, we decided to use their cleaned category network (almost out of the box) and assess the quality of the recommendations. We decided to focus on French Wikipedia for this phase given the ongoing conversations with Ma Commune folks and that we would like to help them expand the article types for which their tool can recommend sections.

Results

[edit]

Preliminary Dataset

https://drive.google.com/file/d/1dYKNBXk-l_FfVdca9Kk7uzNJNrdQVVH6/view?usp=sharing

Json format:

One category per line: {"category": <category_name>, "recs": [{"title": <section_title>, "relevance": <relevance_score>},...]}

Example:

{"category":"Catégorie:Ville_de_Souss-Massa-Drâa","recs":[{"relevance":0.3333333333333333,"title":"Notes et références"},{"relevance":0.3333333333333333,"title":"Voir aussi"},{"relevance":0.2222222222222222,"title":"Démographie"},{"relevance":0.2222222222222222,"title":"Économie"},{"relevance":0.1111111111111111,"title":"Infrastructures"},{"relevance":0.1111111111111111,"title":"Culture"},{"relevance":0.1111111111111111,"title":"Population"},{"relevance":0.1111111111111111,"title":"Manifestations"},{"relevance":0.1111111111111111,"title":"Vue d'ensemble"},{"relevance":0.1111111111111111,"title":"Climat"}]}

Basic Usage

The simplest approach is:

  • take all the categories of the target article
  • merge the recommendations by summing the relevance scores of the shared titles
  • sort by score (desc) and show the top K
  • optional: filter the very common sections (Notes et références, Voir aussi, ...).

The method works better when the article has more than one category because the sum of the scores promotes the relevant sections. In the future, we will apply Learning2Rank techniques to give weights to different categories.