Research:Measuring article importance

From Meta, a Wikimedia project coordination wiki
Created
19:34, 16 October 2014 (UTC)

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The importance of some encyclopedic topics above others is often discussed but poorly understood. Intuitively, it seems clear that some content in Wikipedia is more important than other content -- and that coverage of highly important content is more valuable than coverage of less important content.

In this research project, we explore two different strategies for measuring the importance of Wikipedia articles -- view rate and inlink count. We perform comparisons between these strategies and human-assessed importance ratings in English Wikipedia. We conclude that, which the choice measurement is arbitrary. Both carry useful signal for approximating human assessment. However, there are some notable differences that should be considered when applying this metric. (e.g. Chemistry) So, we recommend that others make use of both metrics and compare differences.

Importance in Wikipedia[edit]

WP 1.0 Importance x Quality table. 

The qualitative assessment of the importance of articles has long been used by Wikipedians to organize their work. For example, English Wikipedians have gathered a list of what they felt to be the core 1000 topics that should be covered well in an encyclopedia. Many WikiProjects have also adopted an ordinal importance rating scale used by the Wikipedia v1.0 Editorial Team of "Unknown", "Low", "Mid", "High" and "Top". Bots like WP 1.0 bot are used to generate and update tables that enable Wikipedians to quickly find high importance, low quality articles to contribute to.

This method has been later superseded by Wikipedia 1.0: Kiwix has instead used a PageRank-like method to select a subset of (English/Italian Wikipedia) articles of which to make a small ZIM file for offline consumption. The resulting selection was generally very reasonable, hence manual selection work has been abandoned.

Quantitative measurement of the importance of articles has been used in the literature to measure value and to examine disparities between what reader and editor notions of importance. Priedhorsky et al.[1] used the view rate of an article while a contribution was present as a way of approximating the value that contribution brought to readers. This strategy assumes that readers will find value in the articles they read -- and therefore contributions to articles that are highly viewed will produce more value.

Other studies focused on disparities between what editors chose to write about and external measures of importance. Samoilenko and Yasseri[2] used citation count metrics of academics as a proxy for the importance and compared that to the quality of encyclopedia articles and found no significant correlation. Further, Müller-Birn et al.[3] discuss the disparity between articles that receive editor attention and articles that receive reader attention and argue for that reader desire is a missed opportunity to direct editors towards more valuable work.

Methods[edit]

In this project, we focus on two measures of article importance: page view rate and inlink count. These two measure

Extracting importance ratings[edit]

In order to gather a sample of articles that have importance ratings, we take advantage of a common category naming schema of <rating> + "-importance " + <wikiproject> + " articles": e.g. "Top-importance Africa articles". This query gathers importance ratings using this category extraction strategy.

Descriptive stats from 2014-10-21
SELECT importance, COUNT(DISTINCT page_id) 
FROM importance_classification 
GROUP BY importance
+------------+-------------------------+
| importance | COUNT(DISTINCT page_id) |
+------------+-------------------------+
| Top        |                   35639 |
| High       |                  144344 |
| Mid        |                  584461 |
| Low        |                 2359667 |
| Unknown    |                 1794587 |
+------------+-------------------------+
5 rows in set (6.41 sec)

Note that this dataset allow the same article to be in multiple importance categories due to differences in scope between WikiProject. For example, the article .NET Framework, which five WikiProjects what have assigned importance rating from "mid" to "top". For our analysis, we use the mean of all rating. This did not result in a substantial difference from selecting the max rating.

Page views rate[edit]

To extract a general page view rate, we downloaded sample hourly view log data for the month of Sept. 2014 and wrote a python script to aggregate view counts to a monthly total. This monthly total is taken to represent a general view rate.

This strategy will be susceptible to strange results in the case of articles that have short term popularity (or unpopularity) during the month of September, but we expect a sustained month of unusual view rates to be unlikely.

Inlink count[edit]

To extract the count of inlinks between articles, we experimented with two strategies: naive pagelinks table and parsed organic links.

Naive pagelinks table. MediaWiki maintains a pagelinks table that records links between pages and is updated every time that a new revision is saved. Links can be counted by joining this table to the page table. On Oct. 25th, 2014 we queried this table to gather a raw count of inlinks for all namespace zero pages.

Parsed organic links. After working with the pagelinks dataset, we realized that many articles use transcluded content containing many links to related content (generally referred to as a Navigation Box). Unlike links that appear in the article content, these Navigation Box links are categorically associated with every relevant page. We therefore hypothesized that organic links might carry a larger amount of signal.

To extract organic links, we used mwparserfromhell to extract links from the 20141106 article XML dump for English Wikipedia. We made use of any wikilink that appeared directly in the text without transclusion.

Resolving redirects[edit]

Methodologically, it's important to consider work by Hill & Shaw[4] that showed the view rate of articles is substantially affected by redirects between wiki pages, so measurements of view rates will need to resolve these redirects in order to be robust.

In order to resolve redirects, we made use of the redirect table at the time that data was extracted. This will be inaccurate for pages whose redirect status changes often, but we suspect that to be very uncommon.

Results[edit]

Page view rate[edit]

The density of log(views/month) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for view rates > 0)
View rate density. The density of log(views/month) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for view rates > 0)

After log-transforming the view rate we plot the distribution and the result is shown in #View rate density. As we can see there is quite a bit of overlap between the classes. At the same time we also see that importance appears to be strongly associated with view rate, higher importance means higher view rate. We choose to investigate this relationship using a one-way ANOVA on the log-transformed data. Using Tukey's HSD we get the following table of differences between classes:

High Mid Low Unknown
Top 0.30 .81 1.29 1.26
High x 0.51 0.99 0.97
Mid x 0.48 0.45
Low x 0.02

These differences are computed from log-transformed data, therefore they show multipliers between importance classes. In other words, the average view rate for Top-importance articles is nearly or more than ten times that of Mid-, Low-, and Unknown-importance articles. Similarly, High-importance articles are on average nearly ten times as popular as Low- and Unknown-importance articles. Also note that While the adjusted P-value between all of these is much less than 0.001, likely due to the size of the dataset, the small difference between Low-importance and Unknown-importance articles suggests that there is not an meaningful difference between those two classes.

Inlink count[edit]

The density of log(# of inlinks) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for inlinks > 0)
Inlink density. The density of log(# of inlinks) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for inlinks > 0)
The density of log(# of organic inlinks) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for inlinks > 0)
Organic inlink density. The density of log(# of organic inlinks) is plotted for English Wikipedia articles by the avg. WikiProject importance classification. (for inlinks > 0)

References[edit]

  1. Priedhorsky, R., Chen, J., Lam, S. T. K., Panciera, K., Terveen, L., & Riedl, J. (2007, November). Creating, destroying, and restoring value in Wikipedia. In Proceedings of the 2007 international ACM conference on Supporting group work (pp. 259-268). ACM.
  2. Samoilenko, A., & Yasseri, T. (2013). The distorted mirror of Wikipedia: a quantitative analysis of Wikipedia coverage of academics. arXiv preprint arXiv:1310.8508.
  3. Lehmann, J., Müller-Birn, C., Laniado, D., Lalmas, M., & Kaltenbrunner, A. (2014, September). Reader preferences and behavior on Wikipedia. In Proceedings of the 25th ACM conference on Hypertext and social media (pp. 88-97). ACM.
  4. Hill, B. M., & Shaw, A. (2014, August). Consider the Redirect: A Missing Dimension of Wikipedia Research. In Proceedings of The International Symposium on Open Collaboration (p. 28). ACM.