Research talk:Automated classification of article importance/Work log/2017-03-20

From Meta, a Wikimedia project coordination wiki

Monday, March 20, 2017[edit]

Today I will work on the WPMED classifier, looking at the articles the new & improved classifier does not predict correctly. Once that is ready I want to sketch out a post to WPMED so we can start chatting about what importance means. Lastly, I'll go through our list of sources of signal and start making some decisions about where to move next.

WPMED prediction errors[edit]

Like we did last week, I'll generate lists of the perhaps most interesting articles.

Top-importance predicted Low-importance Top-importance predicted Mid-importance High-importance predicted Low-importance
No Top-importance articles were predicted to be Low-importance
Mid-importance predicted Top-importance Low-importance predicted High-importance Low-importance predicted Top-importance
There are 1,002 Low-importance articles predicted as High-importance, which are too many to list. Here is a sample of 50 of them:

WPMED disambiguation pages[edit]

How many of the pages in my dataset are actually disambiguation pages? I need to go figure that out!

I ran this SQL query on Quarry to get a TSV of all disambiguation pages in WikiProject Medicine. There are 108 of them in total. Out of these, 18 have a different prediction from their actual WPMED importance rating. All of them are predicted to be Low-importance, 17 are rated Mid-importance, and one (Drug use) is rated High-importance. Only the last article shows up in our lists.

Inspecting the data I find that all of them have low number of inlinks, reasonably low number of views, and often none of the inlinks come from WPMED. I suspect the latter is because WPMED generally cleans up their articles and makes sure they do not link to disambiguation pages. In other words, a rating of Low-importance seems reasonable (although we might discuss why 18 of these appear to have importance ratings?)

Comparing these importance-rated disambiguation pages with those that did not have a rating suggests that all of them should have been marked as disambiguation pages and not gotten a rating. I went ahead and changed them, partly because that makes a dataset of WPMED importance ratings better.

WPMED communication[edit]

I posted to WPMED's assessment talk page with an introduction and some examples of articles we might want to talk about. Hopefully they'll have some comments.