Research talk:Automated classification of article importance/Work log/2017-03-29

From Meta, a Wikimedia project coordination wiki

Wednesday, March 29, 2017[edit]

Today I plan to finish writing the script to process Wikidata information about Low-importance WPMED articles, and prep for tomorrow's Research Group meeting by wrapping up the documentation of the classifier results.

Categorizing WPMED articles[edit]

From our conversation with WPMED members we have learned that certain categories of articles default to Low-importance. It would therefore be useful if we can filter those out, either by giving them a specific label in the dataset, or by removing them from the dataset altogether. The question is whether we should use Wikipedia's own category structure for this, look into Wikidata, or perhaps look elsewhere. Given that Wikipedia's category structure is rather messy and it is not straightforward to move from a specific concept (e.g. how do you go from Alexander Fleming's category "People from East Ayrshire" to "People"?), we first seek more general strategies.

Wikidata[edit]

A lot of the articles in WPMED might have Wikidata items for them, which again might have properties that we can exploit. We will therefore start there.

I wrote a Python script that grabs the Wikidata identifier for any article in a given category, and then looks to see if the Wikidata item has the "instance of" property (P31 to be exact). I've then used this script to process articles from all four WPMED importance categories, and can start generating some statistics.

Articles without a Wikidata item[edit]

There are only 69 articles without a Wikidata item, or 0.23% of all articles in my dataset.

Articles without "instance of"[edit]

There are 15,959 articles that do not have the "instance of" property, accounting for 54.4% of our dataset. Since most of WPMED are Low-importance articles it could mean that "instance of" is not solving our problem of categorizing Low-importance articles.

Most frequent instances[edit]

We split this up by rating, looking first at Top-importance articles. There are 15 different instances, but only two (13.3%) that occur more than once. Two articles are an instance of "physiological condition", while 54 articles are an instance of "disease".

For High-importance articles there are 105 different instances in use, but only 28 (26.7%) of them occur more than once. There's a significant jump in the dataset, the seventh most frequent is used 8 times, while the sixth is used 20 times. Again "disease" is the most frequent one, way ahead of everything else. In descending order the most frequent ones are: disease (240), chemical compound (47), pharmaceutical drug (45), medical specialty (23), Wikimedia list article (20), taxon (20).

For Mid-importance articles there are 321 different instances in use, but 104 (32.4%) are only used once. We have again a huge increase in usage for the top 15, with a particular jump for the top 5 (the sixth is half as common as the fifth). The top five are: disease (1,676), chemical compound (491), pharmaceutical drug (490), Wikimedia list article (180), and taxon (108).

There are 664 different instances in use for Low-importance articles, of which 390 (58.7%) are used once. We can again see a break in the data for the more frequently used instances. The top 10 are (in descending order): human (3,789), disease (1,376), organization (672), chemical compound (523), scientific journal (411), business enterprise (257), taxon (180), Wikimedia list article (164), medical school (132), pharmaceutical drug (118).

I'll investigate the less frequently used ones as well, but based on this it seems that we can correctly label at least 5,441 articles as Low-importance. That's 18.56% of the entire WPMED dataset we've used previously. Sounds like a good start!