Grants:Project/Sumit/Automatic suggestion of topics to drafts

From Meta, a Wikimedia project coordination wiki

statusnot selected
Automatic suggestion of topics to drafts
summaryProject to develope a Machine Learning model to automatically suggest topics to draft articles based on existing WikiProject topics and expose it as a webservice on ORES.
targetEnglish Wikipedia
amount9000 USD
contact• asthana.sumit23(_AT_)
this project needs...
created on19:46, 16 September 2017 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

English Wikipedia's new page patrollers are overloaded. The backlog of new articles needing review is growing and drastic measures are being proposed[1]. Previous studies suggest that the primary difficulty in reviewing this backlog involves dealing with the Time-Consuming Judgment Calls(TCJSs)[2] -- pages with potential that need work to bring them up to the current standards. If only we had a better way to direct people with the right interests and expertise towards these TCJCs, maybe there'd be hope in addressing the backlog more effectively.

The aim of this project is to develop models for extracting topic information from new article drafts to help match new article creations with new article reviewers based on their interest and expertise. ORES already provides a draftquality model that helps reviewers split the obviously bad new articles (spam/attack/vandalism) from the rest. In this project, we propose to add another model to the flow -- one that will perform topic extraction to help route new articles (and expecially TCJCs) to subject matter experts.

Article rerouting with ORES

What is your solution to this problem?[edit]

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

We intend to approach the problem from a Machine Learning perspective, i.e., build models that would learn the distribution of words that characterize a particular WikiProject, and then match this distribution to the new article draft to measure how closely a WikiProject relates with a new article draft.

In particular we'd experiment with algorithms like Vector space model and LDA within revscoring to build our model.

Project goals[edit]

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

We intend to help new page reviewers on English Wikipedia by:

  • Hosting a topic prediction service on ORES webservice for new drafts and exposing its predictions, just like the current ORES's predictions of edits are shown on en:Special:RecentChanges
  • Providing analytical insights through statistical Machine Learning on the data collected around WikiProjects. E.g. which WikiProjects need attention

Also consider some interesting use cases like getting articles with categories Science and Biography should be able to return biographies of scientists.

overview of the topic suggestion model

The above two could directly aid the editors in deciding the WikiProjects for new drafts so that draft articles can be quickly redirected to their desired WikiProject namespaces where more attention can be given to them. For example a new draft in the domain of medical science and dealing with medical issues created by war could be put into WikiProject military history and WikiProject medicine where subject experts can devote time to them.

Project impact[edit]

How will you know if you have met your goals?[edit]

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (i.e. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

How will I know if I have met my goals?[edit]

The goals of the project are very well defined with the Scoring platform team's purview on phabricator. In order of solution, they are:

  • phab:T172326 - Create machine-readable version of the WikiProject Directory
  • phab:T172325 - Efficient method for mapping a WikiProject template to the WikiProject Directory
  • phab:T172321 - Build mid-level WikiProject category training set
  • phab:T123327 - New article review routing AI

The above are starting goals. The new goals yet to be defined:

  • Extracting WikiProject articles given a WikiProject name
  • Running a classifier on existing WikiProjects to learn their distribution of words, then saving it in models per WikiProject
  • Comparing the new article drafts with the models of existing saved WikiProjects
  • Think about how to restrict model comparison to only the most relevant WikiProjects through some preprocessing
  • Build a new webservice within ORES to host this service for broader use

After the project gets over, how will it continue to impact Wikimedia projects positively[edit]

Once my project gets over, the final goal is to deploy this prediction service on ORES and expose it via an api. This way consumers of the api or the Recent Changes page itself can use these predictions of topics on new drafts to help reviewers decide which WikiProjects can best host the draft.

Do you have any goals around participation or content?[edit]

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable.

With the deployment of the topic suggestion api on ORES and its integration with Wikipedia, it'll be a big boost to page reviewers in tagging articles with appropriate WikiProjects which would automatically lead to quick improvement of the draft with topic specific experts devoting time to it.

Project plan[edit]


Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

All code submissions will happen to the drafttopic repo under wiki-ai on github.

Some of the preliminary work has already been done related to parsing the WikiProjects directory and creating a machine readable format - phab:T172326 I have been working closely with Wikimedia's Scoring platform team and have submitted a number of patches to its various repositories. Halfak will be my mentor for the project. I have broken down the project into the following milestones:

  • Data curation
  • Research and Model generation
  • Deployment on ORES

Phase 1 - Data curation ( 4 weeks )[edit]

  • Keeping an updated list of WikiProject categories in a machine readable format for the available categories. See phab:T172326 This involves creating a machine-readable WikiProject hierarchy as one exists here. Projects only till the first level will be considered( also called mid-level WikiProject categories ). In short, we have something like:
    • Culture
      • culture.Music
      • culture.Performing
      • culture.Plastic
      • culture.Visual
    • STEM
      • stem.Biology
      • stem.Physics
      • ...
  • Articles in WikiProjects ( like Inveraray Castlehere ) contain the WikiProject membership information on the talk pages embedded in the wikitext in the form of templates( E.g WikiProject:Music template ). We will extract this data in a machine readable format to use as a labeled dataset for building multiclass classification models[3]

Phase 2 - Research and Model generation ( 8 weeks )[edit]

overall pipeline for the model

This will involve using statistical Machine Learning techniques around existing WikiProject articles to get a representation of the WikiProjects and comparing this representation to the new article's representation to get a similarity score.

To sum up in simple terms through an example: Suppose Science and History are the only two WikiProjects( topics ) that exist. Through the above techniques, we will extract and store representations of these projects which could be mathematical or simply list of words for each WikiProject like (Science - scientist, research, experiment, math, physics, phenomenon...) and (History - era, century, ruler, war, diplomacy...) . Then in a new article draft we simply look for these representations or in the word list case, say look for amount of these words in the new draft and score the closeness of the draft to each WikiProject accordingly.

Phase 3 - ORES Deployment ( 8 weeks )[edit]

Once models for WikiProjects have been generated and there's a tool to use to models to predict topics for new drafts, this stage will involve hosting these models on ORES and exposing its predictions via an api, just like other existing editquality or article quality services on ores.


How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Developer salary: 25 USD/hr

  • Phase 1: 4 weeks * 15 hrs / week = 60 hrs
  • Phase 2: 10 weeks * 15 hrs / week = 150 hrs
  • Phase 3: 10 weeks * 15 hrs / week = 150 hrs

Total = (60 + 150 + 150) * 25 USD = 9000 USD

Community engagement[edit]

How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve at various points during your project? Community input and participation helps make projects successful.

The main consumers of the project - WikiProject:Council have been notified here

Get involved[edit]


Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

Grantee - Sumit Asthana ( github ) - I'm a volunteer contributor to Wikimedia Foundation since my GSoC project on WikidataPageBanner extension in 2015. I've contributed to a number of projects on Wikipedia. I had contributed to the Mobile team on gerrit for quite some time. Recently in May 2017, I started my work with Aaron and the Scoring platform team contributing to the various projects related to Wikipedia AI in terms of both code and research work .

Advisor - Aaron Halfaker is the advisor of this project who currently heads the Scoring platform team and is leading the development of AI related projects to aid Wikipedia editors and viewers through automation.

Community notification[edit]

You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc.--> Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?


  • WikiProject:Council has been notified on the talk page here.
  • A message has also been left on Wikitech-l for general information.
  • New Page Patrol project notified: here.


Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • Strong oppose per my comments on the talk page. This would be a colossal waste of funding based on a theory that the community rejected, and when a volunteer already developed a technology solution that does effectively the same thing for free. TonyBallioni (talk) 00:19, 3 October 2017 (UTC)
    • Its probably a bit late for the grant proposal to change now, but in the future, I can see the use for this as a tweaked proposal per Rentier and MusikAnimal's comments. I'm still opposed to it in its current form, but generally don't oppose further ORES research. TonyBallioni (talk) 16:19, 4 October 2017 (UTC)
  • Strong oppose. We cannot predict the outcome of the ACTRIAL experiment which will largely dictate the next steps. The requested enhancements to the Page Curation/New Pages Feed must first be addressed by the Foundation who created the system. See comments on the talk page. Kudpung (talk) 02:05, 3 October 2017 (UTC)
  • Oppose Oppose. I don't see what this proposal adds that we don't have with the new NPP browser. Wait till we get ACTRIAL before implementing more changes. How about you spend these resources on all the Phabricator requests that go ignored instead? Insertcleverphrasehere (talk) 05:18, 3 October 2017 (UTC)
  • Strong oppose--Per Tony, Kudpung and ICP.I read the page twice but failed to realize what purposes it seek to fulfill other than what is exactly provided by NPPBrowser, developed by one of our volunteers.And maybe you can spend your time and/or resources much better in developing or tending to our long-pending list(s) of NPP improvements gathering dust at Phab.Godric on Leave (talk) 05:52, 3 October 2017 (UTC)
  • Neutral Neutral per my comments on the talk page. This could potentially have some valid use, but definitely not within the world of new page patrolling. I'd advise retooling this grant application to remove the connection to new page patrolling. Scottywong (talk) 23:17, 3 October 2017 (UTC)
  • Strong oppose for now: per Tony'comment above, and his comment on the talkpage, and per my own commentary on the talkpage. Also, since ACRTIAL's deployement, things have drastically improved. I believe the circumstances would be a lot different for better by the time ACTRIAL's trial run is over. So the best time for discussing this would be a month after trial end of ACTRIAL. —usernamekiran(talk) 01:34, 4 October 2017 (UTC)
  • Oppose Oppose I think the underlying idea is sound and the automated topic sorting would be useful. Not for the new page patrol - it was a mistake to use it as a motivation for the project. The backlog hasn't been growing for some time, the TCJS theory is questionable at best, and even if these assumptions were true, a grant proposal is not the place to suggest such drastic changes in the NPP workflow. The real benefits are hinted at by MusikAnimal on the talk page -- examples include filtering of the recent changes or the watchlist by topic. The proposed model doesn't really duplicate the NPP Browser, which would remain useful for fine-grained filtering of the new pages. In fact, I would probably use the new ORES scores to improve the accuracy of the tool. I would be happy to support a reworded proposal. Rentier (talk) 11:25, 4 October 2017 (UTC)
  • Oppose Oppose - as others have stated, NPP is in undergoing the ACTRIAL experiment. While I can't support this project, I would definitely support a grant funding the creation of an algorithm to determine NPOV, UNDUE and BALANCE in controversial articles. 🤣 Atsme📞📧 17:47, 4 October 2017 (UTC)
  • Strong oppose there are a long list of requests around NPP that need to be addressed. NPP handles new articles not Drafts. Routing new pages to alert morbid wikiprojectsnwill help no one. Reading this proposal makes me wonder if the author has spent a day in NPP themselves to process a few hundred new pages - because if they had they would see how wrong headed this proposal is exactly. What we need is filters and editors to catch the garbage and delete it, not annalsrt system no one will watch. Legacypac (talk) 05:31, 6 October 2017 (UTC)