Research:Automated classification of article importance

From Meta, a Wikimedia project coordination wiki
This page documents a completed research project.


Whether some articles on Wikipedia are more important than others have been broadly discussed both in the Wikipedia communities as well as the research literature. For example, there is a list of articles every Wikipedia should have on meta, which has been repeatedly criticised for being biased towards Western (European and US) culture. Similarly, the English Wikipedia has a project maintaining a list of vital articles, with different levels of importance.

In scholarly research, the issue has typically been approached by looking at the structure of links between articles. For example, we examined the articles found in the most languages[1], and also those with the most amount of content across all languages. Similarly, a paper at WikiSym 2012[2] studied the importance of biographies in 15 languages through network analysis of the link structure within each language, using betweenness centrality as a measure of importance.

Importance has also been widely studied in the computer science literature, where it plays a part in how to rank results from search queries. Ideally, the search results should be both relevant and authoritative. Google’s PageRank algorithm uses the link structure to define which pages are most authoritative in order to rank them higher.

Proposed Project and Deliverables[edit]

We are proposing a project that will study article importance in the context of Wikipedia, with the following milestones and deliverables:

  1. A thorough literature review of relevant work in order to understand the nuances of what “importance” means and how it relates to Wikipedia. This literature review will be publicly available on Meta (a draft is available on Research:Studies of Importance).
  2. Communication of the launch of the project on-wiki, and the creation of a group of stakeholders who can provide feedback during development stages of the project.
  3. A publicly released dataset with the article importance assessments we will use in our model training and evaluation. This dataset will be accompanied by an analysis of available approaches to gather such data, also a part of the project’s pages on meta. See Research:Automated classification of article importance/Gathering importance data for the analysis on how to gather data on article importance ratings.
  4. A classifier (machine learner) that can predict article importance for all existing articles in a given Wikipedia edition.
  5. A classifier that can predict article importance within a given WikiProject.

In addition, a user evaluation will be performed towards completion of this project. This evaluation will be led by Jonathan Morgan from Design Research. The planned goal of the evaluation is to understand how users judge the classifiers' performance, as well as how these classifiers relate to other approaches to defining article importance (e.g., PageRank and article views).

Benefits[edit]

We see several benefits from this project for both the Wikimedia Foundation and the Wikipedia community.

Overall Assessment of Wikipedia’s Success[edit]

Being able to assess the importance of all articles would allow us to measure how successful Wikipedia has been overall. This can be accomplished by comparing article importance against article quality. If important articles are not of high quality, it would suggest that there is still a lot of room for improvement.

Reducing Contributor Workload[edit]

Assessing an article’s importance and applying said label to the article’s talk page is currently a completely manual process. At least on English Wikipedia, WikiProject members rate an article’s importance together with its quality.

Having a machine learner that predicts important means we can build software tools that can suggest importance ratings, both for individual articles and multiple ones. This should make it easier to maintain existing ratings as well as enable rating of articles that do not currently have an importance rating (per Jan 4, 2017, 1.2 million articles on English Wikipedia have a quality assessment rating but not a corresponding importance rating[3]). It can potentially also improve rating consistency by ensuring that articles of similar importance are rated similarly.

Directing Contributor Attention[edit]

Directing contributor attention towards important content can both reduce contributor workload and potentially increase Wikipedia’s impact on readers. This type of intervention is already used by SuggestBot on English Wikipedia, although it uses a number of article views as a measure of importance. Similarly, we are proposing to be able to determine importance within certain topic areas (e.g., WikiProjects), which can allow us to match contributor interest and thereby ensure that contributors stay motivated to continue contributing. One of the Foundation’s goals is to distribute high-quality free knowledge, and matching contributor interest and importance in this fashion is one way of meeting that goal.

Feasibility Study[edit]

In order to determine the feasibility of using machine learning to predict importance, we have completed a small feasibility study. This study reuses a dataset gathered by Aaron Halfaker for a preliminary investigation into article importance. The dataset was gathered from the English Wikipedia, using WikiProject assessments of article importance. A known caveat with this approach, and one of the reasons for why we seek to do a thorough literature review, is that WikiProject assessments are localised. Different WikiProjects can define discordant ratings of an article’s importance, whereas we are aiming to measure global importance. Since we are only aiming to ascertain whether a machine learning approach is feasible, this issue is not that important in this context.

A notebook that replicates the R code used in this study is available on PAWS: Feasibility Study of Automatic Article Importance Classification

Methods[edit]

Using various machine learning and statistical models, we examine the relationship between WikiProject assessments of article importance and two variables that might help determine importance: number of views and number of links pointing to a given article. Both of these variables have been previously used in the research literature as measures of importance.

The dataset contains 3,746,600 importance labels across five categories. In decreasing order of importance, the categories are: Top, High, Mid, Low, and Unknown. There are differing numbers of articles in each category. Because some of the machine learners prefer a balanced set of categories, in other words that there are roughly the same number of articles in each category, we randomly select 5,000 importance labels from each of the categories for a total dataset of 25,000 labels. While we could have used a much larger dataset, it would have greatly increased the computation time of some of our models. We have confirmed that the smaller dataset does not significantly alter our findings.

Using this dataset, we apply several machine learning and statistical models:

  1. Least-Squares (Linear) Regression
  2. Random Forest Classifier
  3. Random Forest Regression
  4. Support Vector Machine
  5. Gradient Boost Classifier

Results[edit]

Because we have the same number of articles in each importance category, we can use overall accuracy (proportion of correct predictions) to assess classifier performance. For the linear and Random Forest regressions, we report the percentage of variance explained by the model. For the other methods, we measure the average accuracy using 10-fold cross-validation.

Method Accuracy (%) % Variance Explained
Linear Regression   28.86
Random Forest Classifier 32.60  
Random Forest Regression   26.15
Support Vector Machine Classifier 34.68  
Gradient Boost Classifier 34.67  

We can compare the classifiers’ accuracy against a random choice in order to judge whether there is additional information available in the two variables we’ve added. A random choice would have an average accuracy of 20%.

Based on the reported accuracy and the variance that is explained by the statistical models, it is clear that article importance is related to number of article views and links pointing to a given article. At the same time, the results are not as strong as we would want them to be, which is why we propose further research in this area.

Project Highlights[edit]

  1. A literature review (draft) studies of importance is available.
  2. We started a collaboration with the English WikiProject Medicine on the initial development of a classifier that can predict the importance for all articles in the context of that project. During this collaboration, we proposed a reassessment of the articles about breastfeeding and abortion, based on model predictions that their importance rating should be changed to Top-importance (from High- and Mid-importance, respectively). WikiProject Medicine members reviewed and implemented these changes (ref edits to Talk:Breastfeeding and Talk:Abortion).

References[edit]

  1. Warncke-Wang, M., Uduwage, A., Dong, Z., and Riedl, J. "In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network", in the proceedings of WikiSym 2012. PDF
  2. Aragón, P., Laniado, D., Kaltenbrunner, A., and Volkovich, Y. "Biographical Social Networks on Wikipedia: A cross-cultural study of links that made history". in the proceedings of WikiSym 2012. PDF
  3. See the WP 1.0 Bot overall table