Research:Expanding Wikipedia stubs across languages/Ranking sections within categories

From Meta, a Wikimedia project coordination wiki

The idea of recommending sections to editors has arise as a possible strategy for expanding wikipedia stubs across languages. The main assumption behind this strategy is that for each type of article (i.e. biographies, countries, movies), there is a desirable structure (list of sections).

The aim of this work is i) to study the strength of the previous assumption, and ii) and rank sections names according to their predictive power for each article type.

Problem statement[edit]

We defined this as classification problem, using Wikipedia categories as labels and section names as features. Our goal is to study the predictive power of section names for article category classification.

Data[edit]

We will use high quality articles across different languages, creating a separated classification task per language. For proxy of article's quality we will define a heuristic based on the number of editions of each article.

Methodology[edit]

Test a set of different classification algorithms for predicting single Wikipedia categories or a set of categories. Specially

Limitations[edit]

The are some limitations that need to be considered for this approach:

  • Wikipedia categories are not necessarily orthogonal, therefore is possible that a unique list of sections can apply for different categories. This does not mean that there is not desirable structure for a given category, but that this structure is not unique, make more difficult the evaluation the classification task.
  • The ranking of features (section names) will prioritize the ones that are more distinctive of each category, penalizing frequent section names across categories (e.g Introduction or References)

Current Status[edit]

We are in the phase of data collection, and defining the heuristic for select high-quality articles.