Research talk:Automated classification of article importance/Work log/2017-03-30

From Meta, a Wikimedia project coordination wiki

Thursday, March 30, 2017[edit]

Today I'll continue working on the categorization of WPMED articles. First I'll continue my analysis of the most common instances, then I will start looking into whether we can somehow categorize those that are not instances of anything but have a Wikidata page.

WPMED categorization[edit]

Yesterday I gathered data on the "instance of" property of WPMED articles, and found that 54.4% of the WPMED dataset did not have this property. Digging around a bit I found Wikidata's help on basic membership properties, which explains the three key ones: instance of, subclass of, and part of. I therefore rewrote my Python script so that it could gather data on all three of these.

I find that 12,368 articles (42.1%) have neither of these properties set. While that is still a large number, it is about 3,000 fewer than if we just look at "instance of".

Next question is, are there properties of the remaining articles on Wikidata that can help us categorize them? I decided to sample 250 of them to see if I could find some patterns. I stopped after checking about 30, because the vast majority didn't contain any information apart from links to the Wikipedia articles.

I therefore wrote a Python script that would go through all the 12,368 articles and store only those that have at least one claim or property in Wikidata. There are 4,854 articles (39.2%) that do, leaving us with 7,514 articles for which we cannot learn anything from Wikidata. It might be that we can use Wikipedia's category structure for those, something I'll look into later.

Across the 4,854 articles there are 207 distinct claims/properties used for those. I wrote a one-liner to get their Wikidata IDs, then used my previously written labelling script to get their labels, before finally writing a short Python script to count and sort the claim/property usage and write it out with their labels. There are 25 labels that are used more than 100 times, they are:

ID Label Number of uses
P646 Freebase ID 2,489
P373 Commons category 1,067
P3827 JSTOR topic ID 928
P494 ICD-10 722
P3417 Quora topic ID 670
P1995 medical specialty 645
P493 ICD-9 631
P910 topic's main category 488
P557 DiseasesDB 428
P17 country 366
P673 eMedicine 259
P492 OMIM ID 227
P856 official website 225
P1343 described by source 224
P508 BNCF Thesaurus 208
P486 MeSH ID 179
P1705 native label 160
P604 MedlinePlus ID 149
P571 inception 147
P18 image 140
P1402 Foundational Model of Anatomy ID 139
P227 GND ID 128
P349 NDLAuth ID 122
P1323 Terminologia Anatomica 98 116
P625 coordinate location 115

While some of these are not helpful (e.g. the Freebase ID property), some point to various indexes of medical information (e.g. P494, ICD-10, or P1995, medical specialty), suggesting that we can use some of these to identify things that are in scope of WPMED.