Research:Automated classification of article importance/Wikidata side chain

From Meta, a Wikimedia project coordination wiki

During this research project, we found that some WikiProjects have categories of articles that have a specific importance rating. For example, all NFL seasons are High-importance in WikiProject National Football League, and all individuals are Low-importance in WikiProject Medicine. One approach to identifying these types of categories is to use Wikidata, where for example individuals are identified through their "instance of" relationship with the "human" entity. Another approach would be to use Wikipedia's category system. We chose the Wikidata approach for two reasons: firstly, it is global and allows for the approach to work across Wikipedia language editions, and secondly it has subclass/superclass relationships. Particularly the latter might not be consistent when it comes to Wikipedia's categories.

Once Wikidata was chosen, the next question was figuring out how many different types of articles this might apply to. To do this, we decided to build a network of relationships between entities on Wikidata. The code is in graphbuilder.py in our GitHub repository. Starting point in the network are all the entities connected to the articles in the WikiProject. The Wikidata identifier is used as the identifier for all nodes, and the starting nodes are also described by their rating and article title. From these starting nodes, we follow any link of type "instance of", "subclass of", or "part of". Of these three, the "instance of" relationship (e.g. "Douglas Adams is an instance of human") turns out to be the most useful. From then on, we only follow "subclass of" relationships (e.g. "scientific journal is a subclass of scientific publication"). This is done to make the network smaller and more coherent, if we use the same relationships as the initial step the network will consist of many different types of instances and their relations to other classes is typically not helpful. Note that any node is only visited once, and once the search does not discover any new nodes the search terminates.

The main challenge was identifying the various categories of entities in WikiProject Medicine that should be labelled Low-importance. In order to facilitate this, we wrote a small script to search the network and identify parent nodes where the majority of the children are rated Low-importance (see find-majority-low-nodes.py in the GitHub repository). We restricted it to parent nodes with at least 3 children, because otherwise it is difficult to determine what a "majority" means. The script outputs a dataset of parent nodes based on the proportion of Low-importance nodes, which made it easy to identify key entities (e.g. 3,809 humans) and key relationships (e.g. that "instance of" is the most useful starting relationship). From this we then identified about 80 specific categories that have been encoded and can be used to automatically classify over 6,600 articles in WikiProject Medicine (we developed sidechain.py as a general library, and use process-sidechain.py to identify the articles.