Research talk:Automated classification of article importance/Work log/2017-03-07

Add topic
From Meta, a Wikimedia project coordination wiki

Tuesday, March 7, 2017[edit]

Today I will continue the data analysis that I was unable to complete yesterday. For future reference, doing string manipulation in R is not a great idea. I'll write some Python to munge my data and aim to tackle the last four research questions:

  1. How is the number of ratings distributed?
  2. How many ratings are unanimous?
  3. How many are rated by more than one project and unanimously rated?
  4. What is the overlap between ratings?
  5. How many have more than two ratings?

RQ4: How is the number of ratings distributed?[edit]

I decided to insert a new RQ4 as I was interested in understanding what the rating distribution looks like. Unsurprisingly most articles have only a few ratings, as we can see in the histogram below. This is also the case when examining the quantiles; the median is 2, 85% is 3, and 95% 5.

Histogram of importance ratings

The article with the highest number of ratings is African, Caribbean and Pacific Group of States with 86. Second is Women in Europe with 55.

We also want to know how many articles have a given rating (Top, High, Mid, Low). Counting the total occurrences of those gives the following table:

Rating N ratings
Top 53,104
High 210,665
Mid 913,394
Low 4,929,624

Because we are here counting each occurrence of a rating, the total number is much larger than the number of articles in our dataset, since as we have already seen, articles frequently have multiple ratings.

RQ5: How many ratings are unanimous?[edit]

This research question regards articles that are rated by at least one WikiProject as unanimous. Note that some articles are categorized as having unknown importance or the importance is "not available" (NA). We remove these articles from our dataset, meaning that we only regard articles where the categorization points to the WikiProjects being in full agreement about the rating.

n_unanimous_1 = data.table(
  rating=c('Top', 'High', 'Mid', 'Low'),
  n_unanimous=c(
    length(only_articles[n_top > 0 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high > 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid > 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id)
  )
);
n_unanimous_1$rating = ordered(n_unanimous_1$rating,
                               c('Top', 'High', 'Mid', 'Low'));
> n_unanimous_1
   rating n_unanimous
1:    Top        7991
2:   High       42572
3:    Mid      231329
4:    Low     1795436
Rating N articles
Top 7,991
High 42,572
Medium 231,329
Low 1,795,436

RQ6: How many are rated by more than one project and unanimously rated?[edit]

This RQ only concerns itself with articles that are rated by at least two WikiProjects and where they all agree on the rating. Similarly as for RQ5 we remove articles with "unknown" or "NA" importance.

n_unanimous = data.table(
  rating=c('Top', 'High', 'Mid', 'Low'),
  n_unanimous=c(
    length(only_articles[n_top > 1 & n_high == 0 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high > 1 & n_mid == 0 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid > 1 & n_low == 0 & n_unknown == 0 & n_na == 0]$talk_page_id),
    length(only_articles[n_top == 0 & n_high == 0 & n_mid == 0 & n_low > 1 & n_unknown == 0 & n_na == 0]$talk_page_id)
  )
);
n_unanimous$rating = ordered(n_unanimous$rating,
                             c('Top', 'High', 'Mid', 'Low'));
> n_unanimous
   rating n_unanimous
1:    Top        1900
2:   High        9905
3:    Mid       65365
4:    Low      855760
Barchart of unanimous importance ratings
Rating N articles
Top 1,900
High 9,905
Medium 65,365
Low 855,760

From the table it is evident that removing articles rated by a single project drastically lowers the number of articles. For example, over 6,000 articles (RQ5: 7,991, RQ6: 1,900) have a Top-importance rating, but only from a single project. Similarly, we remove about 32,000 High-importance articles. We see this as clear indications of how WikiProject assessments are localized to a given project, which in turn suggests that using these as global indicators of importance without any kind of filtering is a problematic approach. Since we only count unanimous ratings by multiple projects, the number of articles are reduced, thereby suggesting that this might be a useful way of identifying importance at a larger scale (e.g. the edition as a whole).

RQ7: What is the overlap between ratings?[edit]

We investigate this by first creating a confusion matrix of the counts of pairs of ratings, and then creating a confusion matrix with triplets (the question of how many articles span all ratings will be answered in RQ8).

High Mid Low
Top 5,347 4,224 2,402
High 23,904 21,626
Mid 177,397
Mid Low
Top + High 2,698 1,083
High + Mid 12,334

We find that pairs of ratings are not uncommon among articles that have High- or Mid-importance as their highest rating. 45,530 articles (1.37% of our entire dataset) are rated High-importance as well as one of the other lower ones, while 177,397 articles (5.3%) are rated both Mid- and Low-importance. Top-importance articles are more rarely rated with one of the other ratings, 11,973 articles in total (0.36%).

RQ8: How many articles have two or more ratings?[edit]

I rephrased this question slightly so we measure the number of articles with two, three, and four ratings. Again we discard articles with "unknown" or "NA" ratings. To find the number of pairs and triplets, we sum the numbers from the RQ7 tables. Then we grab the articles with ratings across the board from the dataset:

> n_pairs = 5348+4224+2403+23904+21626+177397;
> n_triplets = 2698+1083+12334;
> n_quads = length(only_articles[n_top > 0 & n_high > 0 & n_mid > 0 & n_low > 0 & n_unknown == 0 & n_na == 0]$talk_page_id);
> n_pairs
[1] 234900
> n_triplets
[1] 16115
> n_quads
[1] 1127

So, 234,900 articles (7.07%) have two (and only two) ratings, 16,115 (0.48%) have three, and 1,127 articles (0.03%) span all ratings. Some examples of articles spanning all four ratings are First Persian invasion of Greece and Second Persian invasion of Greece, who are both in Category:Low-importance Featured topics articles due to their promotion to being part of a featured portal. Perhaps more interesting are the fact that several US presidents show up on this list, for example Franklin Pierce, Ulysses S. Grant, Grover Cleveland, Woodrow Wilson, Dwight D. Eisenhower, Herbert Hoover, and Jimmy Carter. Taking Carter as an example, we find that it is rated Top-importance because of him being a US president, but at the same time rated Low-importance by WikiProject US governors and WikiProject US State Legislatures. Similarly as we saw for RQ6, this indicates some of the difficulty of using these importance ratings without further analysis.

Sample of articles with unanimous ratings[edit]

Based on the results from RQ6, we're interested in understanding more about the articles that have unanimous ratings from at least two WikiProjects. We therefore randomly sample a dozen articles from each of the four rating categories. Here is the sample we used:

Top High Mid Low
Albania–Kosovo relations Graphyne Désiré Munyaneza Tha Hall of Game
Small business Utamaro Disseminated superficial actinic porokeratosis Stephanie Sheh
Massina Empire Gopinath Poland at the 1996 Summer Olympics Department of State Development
Women in the Middle Ages FedEx Sun Jianguo Ridhima Ghosh
Prophecy Competitor analysis Orleans Canal Australian Natives' Association
Cinema of Algeria The Mind Is a Terrible Thing to Taste Elliptic curve point multiplication Badminton railway station
United Kingdom Rostec Isilkulsky District Indefinite detention without trial
Sejm Distinction Parque de la Costa Pyeonyuk
Jean Metzinger Protein family Ataxia Andrei Ivanovich Gorchakov
Political status of Kosovo Indigenous peoples of the Philippines AH82 Uroballus henicurus
Gaborone Origin of replication Port of Jiaxing Sabine
Fiduciary Resident Evil Owensboro Community and Technical College Singikat

Note that Rostec, Isilkulsky District, and Andrei Ivanovich Gorchakov reveal an issue with the category structure. Those articles are only rated by a single WikiProject (Russia), but because their category system puts it in quality-based categories that are picked up by our schema, we count it as being rated twice. I went back and updated the Python script we use for counting the various ratings and added a check for these types of quality-based categories and removed them. All statistics up to this point have been updated to reflect the new data, and all of these three Russia-related articles are no longer incorrectly counted (they all only have a single project rating them).