Research:Automatic new article topics suggestion
English Wikipedia's new page patrollers are overloaded. The backlog of new articles needing review is growing and drastic measures are being proposed. Previous studies suggest that the primary difficulty in reviewing this backlog involves dealing with the Time-Consuming Judgment Calls(TCJSs) -- pages with potential that need work to bring them up to the current standards. If only we had a better way to direct people with the right interests and expertise towards these TCJCs, maybe there'd be hope in addressing the backlog more effectively.
The aim of this project is to develop models for extracting topic information from new article drafts to help match new article creations with new article reviewers based on their interest and expertise. mw:ORES already provides a "draftquality" model that helps reviewers split the obviously bad new articles (spam/attack/vandalism) from the rest. In this project, we propose to add another model to the flow -- one that will perform topic extraction to help route new articles (and expecially TCJCs) to subject matter experts.
To do this topic extraction, we'll leverage the WikiProject templates (used to tag articles) and the WikiProject Directory to build a labeled dataset with mid-level categories. We'll then use this dataset to train and test multi-label models for extracting topic information about new article drafts. If we're successful, the current users of the current backlog processing tools (e.g. Special:RecentChanges and mw:Extension:PageTriage) will be able to focus their reviews on subject specific cross-sections of the backlog.
Some existing tools exist that categorize new pages with WikiProjects but all are rule based, not utilizing any AI.
The end goals of the research are two fold:
- To host the topic prediction service on ORES webservice and expose its predictions, just like the current ORES's predictions of edits are shown on en:Special:RecentChanges
- Provide analytical insights through statistical Machine Learning on the data collected around WikiProjects. E.g. which WikiProjects need attention
Also consider some interesting use cases like getting articles with categories Science and Biography should be able to return biographies of scientists.
WikiProjects are an initiative on Wikipedia that brings together editors of common interest in a specific domain like Science or Literature and makes collaboration easier. WikiProjects Collaboration underlines the detailed collaboration behind WikiProjects for improving articles
Research:New_page_reviewer_impact_analysis is an ongoing research providing insights into the number of users doing new page reviews and the pending review backlog of new articles. Typically, once an articles passes an initial review process, and is not tagged for deletion, its important that the article be routed to its topic space where more domain experts can review the article. Currently this process is manual where reviewers tag the articles with WikiProjects or at most rule-based through bots.
Categories like this are scarcely populated and AI could do well to populate these pages for faster review of new articles.
Given that WikiProjects are a mature initiative and already an Article Quality system is deployed on ORES for many Wikipedia language projects, it does good to think about using this user-generated data for further analysis and predictions.
Broadly the project can be divided into the following phases:
- The Wikiclass library extracts WikiProject membership info and user-ratings of articles. This project's extraction mechanism will be similar to what wikiclass does.
- Keeping an updated list of WikiProject categories in a machine readable format for the available categories. See phab:T172326 This involves creating a machine-readable WikiProject hierarchy as one exists here. Projects only till the first level will be considered( also called mid-level WikiProject categories ). In short, we have something like:
- Articles in WikiProjects ( like Inveraray Castlehere ) contain the WikiProject membership information on the talk pages embedded in the wikitext in the form of templates( E.g WikiProject:Music template ). We will extract this data in a machine readable format to use as a labeled dataset for building multiclass classification models
This will involve using statistical Machine Learning techniques around existing WikiProject articles to get a representation of the WikiProjects and comparing this representation to the new article's representation to get a similarity score.
The following would be the rough series of steps to be followed as described above:
- Applying algorithms like Tf-idf weighting, LDA, etc. around the cluster of WikiProject articles to get useful word features. E.g WikiProject Science might give science, innovation, scientific, mathematics, technology.... Any model will be implemented using the revscoring library.
- Creating and storing a mathematical model for each such WikiProject indexed by the WikiProject's name.
- When a new article scoring is requested, convert the new article text to a mathematical representation same as the one used for WikiProject clusters.
- Feeding the new article mathematical model and WikiProjects models into a multiclass classifier to get probability scores for each project.
- Multiclass classification works best when there are a few classes. WikiProjects are quite large in number, more than 100.
- Performance and memory usage are concerns for models hosted by the ORES system.
- Managing humongous number of articles around top-level WikiProjects are a challenge in iteself.
The below timeline is a rough estimate and is subject to adjustement.
- 15 days - dataset curation
- 1 month - Testing with various models to aggregate information around WikiProjects.
- 1 month - Deploy the final model on ORES and optimize it for scalability. In particular ensure that matching against too many categories is not time consuming. Trimming of unrelated categories might also be needed.
- 15 days - documentation and outreach (primarily directed towards tool developers)
- Published the machine readable WikiProjects dataset at https://doi.org/10.6084/m9.figshare.5503819.v1 . The dataset was generated by parsing the WikiProjects directory using a handwritten parser through recursion. The work log provides detailed steps on how the dataset was generated. Github commit - link
- Script to extract and generate mid-level wikiprojects mapping - This change takes a bunch of fine grained WikiProject topics and groups them in higher level categories called mid-level categories. The idea for mid-level categories comes from the categorization given at WikiProjects Directory. E.g WikiProject:Albums and WikiProject:Music both come under a broader category Performing Arts. The aim of this step is to reduce the classification to a few broad level categories which is both feasible as well as logical. Pull requrest - link
- Script that given a list of page-ids extracts all wikiproject templates and mid-level categories associated with the page by querying the mediawiki api. This prepares the final dataset that will be used to train the multi-label classifier by additionally fetching text associated with each page for training. Pull request - link
- Published the labeled WikiProjects set containing 94,500 observations having talk-page-id to mid-level-categories mapping - https://doi.org/10.6084/m9.figshare.5640526.v1
For a total of 93,000 observations(page_ids), the labeling operation took about 1:18:28(hh:mm:ss)
- Changes to the revscoring library to include a tfidf weighting for multi-label classification for WikiProjects data or a word-vector based approach to classify new drafts, whichever works better. The basic challenge is handling the scale of data. - In progress
- Revscoring changes to enable multilabel classification and provide metrics for the same.
- Fast cross validation inside revscoring by scoring instances in a bunch rather than one by one.
- Support for OneVsRest classifier inside revscoring to follow the approach of binary-relevance.