Research talk:Automated classification of article importance/Work log/2017-04-25

From Meta, a Wikimedia project coordination wiki

Tuesday, April 25, 2017[edit]

Today I'll wrap up the WikiProject candidate selection by getting a list prepared for discussion. Then I'll move over to figuring out how to get view rate data.

Organic inlinks vs all inlinks[edit]

I decided to revisit my feasibility study in order to see if our conclusion that using the database to grab inlinks provided us with better signal than using only the links present in the wikitext (meaning links coming in through infoboxes, navboxes, etc… get ignored). Previously, I had found that using all links gave better results, but that might have changed now that we instead use rank percentiles for views and inlinks as predictors. To keep things simple, I decided to use the exact same dataset as I had before, and simply calculate the rank percentiles and use them as predictors. Neither the SVM nor GBM models were particularly helpful in this due to either being slow to calculate (SVM) or using too much memory (GBM), so I went with a limited size Random Forest model instead. Like before I used 10-fold cross-validation to decide on forest size and terminating node size, choosing the one that had the highest overall accuracy. Based on the training "out of bag" estimate, there is no difference between the overall accuracy of these approaches when predicting importance across all of English Wikipedia (using our feasibility study dataset). There are some changes for individual classes, which might suggest that we make slightly different decisions for individual articles. Either way, once we have a way to compute organic inlink counts, we'll want to study those again.

WikiProject candidates[edit]

List of candidate WikiProject having at least 100 non-bot edits in their project space (e.g. "WikiProject China" and all its related talk- and subpages) over the past 180 days, at least 1,000 edits to their articles in the past 180 days, and with at least 25% of their articles rated "unknown" importance:

Project name No. of articles % Unknown No. of Non-bot edits
WikiProject Africa 80,937 43.2 7,324
WikiProject Albums 172,025 41.4 439
WikiProject Beauty Pageants 5,984 53.3 127
WikiProject Buddhism 4,689 53.6 137
WikiProject Chicago 43,604 50.3 127
WikiProject China 50,846 37.6 122
WikiProject Cycling 21,210 50.0 156
WikiProject Dungeons & Dragons 4,062 25.4 186
WikiProject Dungeons & Dragons 4,062 25.4 186
WikiProject Europe 5,091 48.1 587
WikiProject Historic sites 8,553 38.2 170
WikiProject Horror 12,102 42.7 122
WikiProject Iran 89,620 76.9 125
WikiProject Judaism 11,036 30.2 126
WikiProject Korea 21,635 39.1 151
WikiProject Malaysia 9,046 27.6 104
WikiProject Motorsport 9,615 31.4 119
WikiProject National Football League 27,766 68.8 717
WikiProject Olympics 108,593 43.9 114
WikiProject Pharmacology 10,957 40.7 269
WikiProject Politics 47,556 29.1 225
WikiProject Politics of the United Kingdom 37,165 52.2 131
WikiProject Rock music 15,346 33.7 155
WikiProject Rugby league 14,044 33.4 191
WikiProject Television 107,565 34.5 395
WikiProject Television Stations 9,060 36.8 253
WikiProject United Nations 5,122 64.2 226
WikiProject Yugoslavia 2,722 32.8 211