Research:Expanding Wikipedia articles across languages/Stub expansion

Contact

Bob West

EPFL

Michele Catasta

Stanford University

Tiziano Piccardi

EPFL

Leila Zia

Wikimedia Foundation

Research:Projects

This page documents a completed research project.

Proposal

More than 50% of English Wikipedia articles (2.9M out of 5.4M) are labelled as stubs.^[1] These are articles considered too short to provide encyclopedic coverage of the subject.

In this research, we propose a methodology to reduce the number of stub articles in Wikipedia across languages by identifying stubs, proposing what sections/content should be added to the article, and recommending the article for expansion to interested editors. The aim of this research is to assist in vertically completing the articles in Wikipedia, in contrast with the earlier work with the goal of horizontally completing Wikipedia across the languages.

Methodology We strive to build as much as possible on the methodology developed in prior research on horizontally completing Wikipedia.^[2]

Step 1: Identify stub articles
Step 2: Rank stubs by importance (we may build on the ranking step in the previous research)
Step 3: Recommend stubs to the most interested editors (we may build on the matching step of the previous research)
Step 4: Recommend missing content that should be added to the stub. This step is novel and challenging. We propose to pool information across language versions and thereby determine which concepts are present in complete language versions of the article, but are missing in the stub of the target language. Ideas from Omnipedia ^[3], as well as the data resource of Wikidata, will be useful for representing concepts and articles in language-independent ways.

Data sources

All Wikipedia language versions
Wikidata
Pageview counts (necessary to rank by importance; step 2)
Edit logs (necessary to match stubs to editors; step 3)
Webrequest logs (useful for step 4: if links in a certain section of a complete language version of the given article are clicked frequently, this is a signal that the section should also be added to stub versions of the article in other languages)

Wikipedia Stubs or more broadly speaking, Wikipedia articles that need expansion, are abundant across Wikipedia languages. There is a need to be able to easily identify and expose such articles, allow the potential editors of these articles filter through them (using a variety of parameters), and/or recommend how such articles can be expanded. The last point is specifically important for onboarding newcomers, for example, through edit-a-thons.

This research aims to address the above. What you will find below is the researchers ongoing work towards that goal. Please note that this is a research in its early stages meaning that methods and approaches can and will change based on our ongoing discussions with Wikipedia editors, event organizers, and researchers. Feedback is welcome in the Discussion page.

What is a stub?

What is a "stub article" has not a formal definition and it is subject to different interpretation based on the topic of the article and the Wikipedia sub-community. The current guidelines for the English version of Wikipedia set a minimum number of character (1500) to consider an article "good enough" (https://simple.wikipedia.org/wiki/Wikipedia:Stub). It is clear that this definition is quite vague as this number of characters could be enough for a TV show but for sure they are not sufficient to describe the WWII.

The current way to mark an article as a stub is managed by the Wikipedia editors that have the responsibility to add in the corpus of the page the relative {{stub}} template. In the English version of Wikipedia, the name of the template includes the category of the article. This is encoded in the name using the format {{category_name-stub}} and in October 2016 the total number of templates with this format were 27572. An alternative way to approach the stub problem is to consider the quality classifier (https://github.com/wiki-ai/wikiclass) currently used to generate the statistic page (https://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Statistics). Despite this, we decided to select the stub articles exploiting the {{*-stub}} template because given the active participation of the Wikipedia community, this crowdsourced labeling is more reliable to understand what people consider as a stub.

In particular, in the public dump released in October 2016 we can identify a total of 2,004,059 articles that contain a matching with the stub template format. This value is obtained filtering by namespace (== 0) to keep only the regular articles and excluding the entries that in the database dump are marked as redirect pages. These articles represent the ~37% of the all the Wikipedia size and in average they have a length of 755 characters. Figure 1 shows the frequency distribution of the stubs length. As expected, the majority of these articles are shorter than the indicative limit of 1500 characters -- only the ~12.95% (259537) exceed this value (97.50% of the articles has less that 2529 characters).

Unfortunately, the template format used to select the stubs is not cross languages, as in each version of Wikipedia a different formalism to mark an article has emerged. For example, in the Italian version the equivalent of this template has the format {{S|category1|category2}}, where the editor can specify multiple categories. In order to extend these results, an ad-hoc analysis for each language is required.

Stubs and time

Limiting the problem to the English Wikipedia, we investigated the possibility to use the “template removal” as an indicative event to consider the article good enough. This opens the problem to define what is a real “de-stub” event and what could be considered just noise in the history of the article revisions.

To select a meaningful time window to analyze the evolution of the stub articles, we chose as minimum value the date of the introduction of this formalism. In the revisions dataset the first article marked as stub is “Peterborough United F.C.” and the template was added the 4th January 2003. Figure 2 shows that is not always easy to identify when an article is actually “good enough”, because we have a consistent portion of the dataset that have an high frequency of oscillation between the two statuses. A manual inspection of the articles with the largest variations highlight that these article are the ones about controversial topics such as politics (e.g., the top one is “George W. Bush”).

Simple classifier

From the 15th October to the 27th November 2016, the Wikipedia community organized a challenge called "Africa Destubathon" (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Africa/The_Africa_Destubathon) with the goal to expand a list of selected articles about Africa. The organizers provide a list of guidelines and a total price for the most active user of US$2380. Thanks to this event, more than 2000 articles have now enough content to be considered "good articles". These articles provide a good example to explore how the structure evolves when the article change its status (from stub to non-stub). To verify the coherence between all the articles before and after the event we trained a very simple binary classifier. For each article, we selected as a positive example (is a stub) the last revision before the challenge, and as negative (not a stub) the first revision of this time window without the "stub template". In total, we selected 1710 articles and for each of them, we selected the 2 revisions at a different time.

Model: Random Forest
Sample size: 1710 stub (before the challenge) + 1710 non-stub (after the challenge)
Features list: "has_infobox", "links_count", "references_count", "sections_count", "text_length"
* Accuracy: 0.75 (+/- 0.05)
* F1: 0.78
* Precision: 0.81
* Recall: 0.76 
The scores are from a cross-validation with k = 10

Considering that we used few raw features without any kind of scaling or preprocessing, the classification gives encouraging results.

Supporting the editors

Previous attempts to approach the stubs expansion problem are based on the automatic generation of the content using external sources and summarization techniques. Despite this approach may generate acceptable results for short article portions, it is hard to get content with the same quality of the human-written sections. Since the growth and the quality of the Wikipedia articles depend on the editors, the best way to expand the coverage of the encyclopedia is to support the users in the writing activity.

A common issue that may arise when an editor creates a new article (or edits a stub) is the limited support offered by the editing tool about the structure of the article, and how it should appear to be readable and consistent with the articles in the same category. Starting from an almost empty article may add extra workload and can discourage the user to contribute. ( kenophobia! :) )

[TODO: considerations about category-structure]

In particular, the structure of an article can be decomposed as:

the titles of the sections
the presence and the content of the Infobox
the attached media files
the number of references
the indicative minimum length

Templates generator

One effective way to help the editors in expanding the content of an article is the introduction of a template generator that pre-compiles the structure of the page giving the user the possibility to quickly understand what is missing.

Recommending sections

Currently, the sections' structure is defined by the editor that are responsible to chose the titles and the structure by using similar articles as reference. This high degree of freedom it is not necessary a desired feature of the editing tool. This leads also to a high de-structured organization of the section titles that could give a not coherent representation across the articles of the same type (i.e. biographies, countries, movies).

In total, the English Wikipedia has 1.578.661 unique section's titles and about 1.3 million have only one single occurrence in Wikipedia. Figure 3 shows the frequency distribution of the titles grouped by the number of occurrences across all the articles. This means that currently there are a limited portion extremely frequent (i.e. REFERENCES in ~4M+ of articles) and a much larger portion very rare or unique. Some examples of unique titles are "DRAMATIS PERSONAE AND CHARACTER ANALYSIS", "OTHER SONGS RECORDED", "BIOGRAPHICAL INFLUENCES".

By removing the titles that appear less that 10 times across all the articles we got about 50.000 sections that we used to generate the association rules https://en.wikipedia.org/wiki/Association_rule_learning that describe the relations between these terms. This approach allows supporting the editor to improve the article by recommending the missing sections based on the current content. This opens the problem of creating a template for a completely empty article, where we do not have enough titles that can act as a "seed" for the template generator. In the future work, we aim to extract a possible categorization for the templates and to design a smart interface to allow the system to prompt the user to provide the topic of the article.

The following example shows the sequence of titles in one article and the missing section with its relative probability:

[RELEASE AND RECEPTION,PRODUCTION,CAST] => [PLOT], 0.849624060150376
[COMMUNITIES,DEMOGRAPHICS,HISTORY] => [GEOGRAPHY], 0.9277360066833751
[ARCHERY,GYMNASTICS,ROWING,SWIMMING] => [JUDO], 0.9527272727272728

Note that this approach do not consider the order of the sections. The sorting problem will be explored in the future work using a precedence graph to describe the most probable sequence.

See datasets section for more examples.

Datasets

Conclusion

While working on this research we realized that expanding stubs is a subset of the problem we're interested in, and that is expanding any Wikipedia article. This latter problem includes also the stub expansion research. You can read more about this new direction atExpanding Wikipedia articles across languages.

References

[1] ttps://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team/Statistics

[2] ttps://meta.wikimedia.org/wiki/Research:Increasing_article_coverage

[3] ttp://omnipedia.northwestern.edu/

[1]

[2]

[3]