From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Hi, I'm Sumit, bachelors in Computer Sc. and Engg. from IIT Patna. I'm interested in Artificial Intelligence and NLP. Applying mathematics and computer science algorithms to make sense of data is what excites me the most.

This page is to track my research projects with Wikimedia Foundation as part of Scoring Platform Team and elsewhere over the course of time. I also maintain an online blog where I write about technical topics related to my work from time to time.

Wikimedia Foundation[edit]

I've been associated with the Scoring platform team of Wikimedia Foundation since May 2017 and have been contributing to a number of projects aimed at building and deploying Machine Learning models on Wikipedia to ease the task of editors. Some of them are:

Automatic suggestion of topics to new drafts[edit]

This is the latest project I'm currently driving under the guidance of User:Halfak_(WMF) who manages the Scoring platform team. The project is about using the data around existing en:Wikipedia:WikiProject to gather topical information on these WikiProjects and then using this information to predict topics of new drafts on Wikipedia.

Article rerouting with ORES
  • Script to extract and generate mid-level wikiprojects mapping - This change takes a bunch of fine grained WikiProject topics and groups them in higher level categories called mid-level categories. The idea for mid-level categories comes from the categorization given at WikiProjects Directory. E.g WikiProject:Albums and WikiProject:Music both come under a broader category Performing Arts. The aim of this step is to reduce the classification to a few broad level categories which is both feasible as well as logical. Pull requrest - link

For a total of 93,000 observations(page_ids), the labeling operation took about 1:18:28(

  • Script that given a list of page-ids extracts all wikiproject templates and mid-level categories associated with the page by querying the mediawiki api. This prepares the final dataset that will be used to train the multi-label classifier by additionally fetching text associated with each page for training. Pull request - link
  • Changes to the revscoring library to include a tfidf weighting for multi-label classification for WikiProjects data or a word-vector based approach to classify new drafts, whichever works better. The basic challenge is handling the scale of data. - In progress
  • The classification was done using RandomForests and GradientBoosting classifiers using word2vec as features. The binary-relevance model was used for multilabel-classification and achieved good results.

PR curve with random forest classifier for select mid-level-categories:

ROC curve with random forest classifier for select mid-level-categories:


This project deals with determining the quality of new drafts on Wikipedia drafts and classify them into the categories spam, vandalism, attack or OK. The project is aimed at automating the task of editors regarding speedy deletion criteria where editors are authorized to delete certain articles violating Wikipedia policies without broader consent. I worked on:


Completed generation of models for few of the languages:

  • Damaging and goodfaith models for Albanian Wikipedia - phab:T163009
  • Damaging and goodfaith models for Romanian Wikipedia - phab:T156503
  • Testing the sentiment feature on edtis - phab:T170177


New Page Patrol Research[edit]

Research work around new page review user right on Wikipedia introduced in November 2015:

  • The number of re-reviews that patrollers perform - link

  • Number of people who performed new page patrol - link

Undergraduate Thesis[edit]

My undergraduate thesis work was on the WSDM Cup 2017 triple scoring task. The poster can be found here.

More info can be sought on the blog post. The work will be published in the short paper/poster track of ICON 2017.