Grants:IEG/Automated Notability Detection

From Meta, a Wikimedia project coordination wiki
statusselected
Automated Notability Detection
summaryUsing machine learning to determine whether articles are notable. Help Draft reviewers and NPPs make better, faster, easier decisions.
targetCurrently English Wikipedia, but it should be easily adaptable to many languages.
strategic priorityincreasing participation
themetools
amount14,775 USD partial funding 8775 USD
granteeBluma.Gelley
contact• bgelley@nyu.edu
this project needs...
volunteer
advisor
join
endorse
created on22:24, 9 September 2014 (UTC)



Project idea[edit]

What is the problem you're trying to solve?[edit]

There is a large volume of articles added to Wikipedia, in both the Main namespace and the new Draft namespace. All these articles require some form of review, and one of the major elements of that review is whether or not the article is about a notable topic. Notability is difficult to determine without a significant amount of familiarity with the notability guidelines, and sometimes domain expertise. Often, reviewers quickly pronounce an article non-notable and it is declined or deleted, even though it is in fact notable. These overly hasty decisions can drive away well-meaning new users whose good-faith articles are quickly deleted. Since it can be difficult to adequately determine a topic's notability quickly, reviewers often end up basing their judgments on the content and quality of the draft or article, rather than the potential notability of its subject.

What is your solution?[edit]

We propose to use machine learning to support the reviewing process. We'll train classification algorithms to automatically determine whether articles are notable or not. The notability scores (e.g. probability of notability & confidence) calculated by the classifier will be made available via public APIs on Wikimedia Labs. We hope that these scores will give users more information about the notability of articles they may be unsure about, hopefully helping reviewers find and improve articles that are indeed notable though they don't seem so, and help support better decisions.

Note that this system will not be assessing specifically whether an article should be deleted or not. The classifier's output will only address whether the topic of an article draft is probably notable or not. We feel that this is an important distinction. Stated simply, the algorithm will be intended to support human judgement -- not replace it. We also do not want our algorithm to become a crutch for users to hastily delete or decline articles without due thought. We therefore plan to only return those scores that are > .5; i.e, the article is more than 50% likely to be notable; any less than that and the user will receive a message that the notability of the article cannot be determined. We will also attempt to include a rationale for the notability score.

Project goals[edit]

We hope that making reviewing easier will:

  1. improve the quality of notability assessments by calling attention to drafts or articles that seem to be about notable topics
    • Hypothesis 1: Drafts about truly notable topics will be less likely to be deleted/declined.
  2. decrease the workload of current draft reviewers by reducing the burden of assessing notability
    • Hypothesis 2: Reviewers using the notability tool will effectively review drafts more quickly.

Project plan[edit]

Activities[edit]

  1. Create a manually coded sample of articles: In order to train a classifier, we will need to start with labeled data -- a sample of article drafts that have been carefully assessed for notability. In order to obtain this labeled data, we will create a random sample of draft articles and present them to Wikipedians (found via WikiProjects relevant to the articles' content) to determine if they are notable or not.
  2. Define and extract notability features: To support a classifier's ability to differentiate between notable and non-notable drafts, we will need to specify and extract features that carry a relevant "signal" for notability (e.g. How many web search engine results? How many red links appear to reference the topic? Can the article's topic be matched to an existing category? etc.).
  3. Train and test classifiers: We will use the feature set and labeled data to train classifiers and test them against a reserved set of labeled data to identify the fitness of different classification strategies and choose the most accurate.
  4. Serve via API on WMF Labs: Once we have a functioning classifier, we will expose it to wiki tools on a Wikimedia Labs instance. This will allow us and other wiki-tool developers to make use of the service. We will also build a minimal, proof-of-concept wiki gadget to demonstrate the utility of the classifier's scores to draft reviewers.

Timeline:

  • weeks 1-4: Brainstorming features and determining feasibility for obtaining them. Researching methods for determining the topic of an unclassified article. We will try to solicit community suggestions in this step.
  • weeks 1-4 (simultaneous with the above step): Manual labeling of training data by subject-matter experts.
  • weeks 5-9: implementing the features in code; testing the classifier
  • weeks 10-12: improving the classifier, adding and removing features as necessary
  • weeks 12-14: thorough validation, creating proof-of-concept website for community members to evaluate the classifier
  • weeks 15-18: build the classifier into an API that would return notability and confidence scores.

Budget[edit]

Total of 14,775 USD, broken down as follows:

  • Graduate student salary for grantee for the duration of the research: 30 USD/hour for 20 hours a week * 14 weeks: 8,400 USD
  • Cost to hire someone to implement the API: 50/hour for 120 hours: 6,000 USD. If there are any volunteers to do this job, the grant will decrease by this amount.
  • Human resources - finding the API developer, managing their work, etc: 25 USD/hour for 15 hours = 375 USD

Community engagement[edit]

We plan to ask members of various WikiProjects to help in constructing a training set. These volunteers would read a number of articles and mark them as notable or not notable. In this way, we can create a gold standard data set based on expert judgement. We will also continuously solicit help and advice from the community; we would love if members of the community would suggest possible features for the classifier. These would be based on their experience as to what aspects of an article they look at when determining notability.

We are planning on producing only an API; this will be most useful if tools are built to consume the API that integrate our scores into tools to support reviewing. We plan to liase with the tool developer community to encourage them to build tools that use the API to help reviewers make better decisions. We're already in touch with communities around AfC & NPP -- we'll work closely with them on testing, and incentivize deployment that way. One of us (Aaron) is a tool developer & embedded in that community.

Sustainability[edit]

We hope that by the time the grant period ends, we will have a working, robust classifier whose decisions are made available through an API. This should allow maximum flexibility for others to build tools using the scores. We will provide detailed documentation of how the system works and open-source the code so that it can be improved by anyone who wishes. We hope that others will build on top of this API and create tools that will help different parts of the community make better decisions. We will also solicit help from the community in continuing to label articles as notable or not so we can keep expanding our training set to make more accurate predictions. (This is the paradigm used by Cluebot NG; see here.)

Measures of success[edit]

We will be using machine learning, so we can measure our success by the precision and recall of our classifier. Since notability detection is a hard problem, we will consider around 75% accuracy to be good. This is more than high enough to use as a first step in the review process to help reviewers with their work.

We would also like to get community feedback on our classifier's results. (Thanks, Siko (WMF)!) Though we do not expect to have a fully functional tool at the end of the grant period, we hope to make a web page available where users can input an article, receive our classifier's score for it, and mark whether or not they agree with the classifier. This will both allow the community to decide whether or not our classifier is meeting their needs. It will also help us improve the classifier by expanding its training set.

Get involved[edit]

Participants[edit]

  • Bluma Gelley : I am a PhD student at New York University and have done a significant amount of research on Wikipedia. In particular, my research looked at some of the problems with the deletion/New Page Patrol process, and with the Articles for Creation/Draft process. Both these processes could be improved by making automated notability detection available to those reviewing/vetting articles.
I have previously done related work on automatically predicting articles that will likely be deleted; this project would build on that using a new, hand-constructed training set of recent articles for better results. In this published paper, I attempted to detect notability, but I suspect that I was successful only in predicting deletion. Besides a better training set, I also plan to use better features. I already have the framework for the classifier, so part of the work is done already.
  • Aaron Halfaker (EpochFail) is a Research Scientist at the Wikimedia Foundation (staff account: Halfak (WMF)) and has developed tools that use intelligent algorithms to support wiki work (e.g. en:WP:Snuggle and R:Screening WikiProject Medicine articles for quality) and has performed extensive research on problems related to deletion, AfC, and newcomer socialization in general.
  • Volunteer I am willing to test such tools, and can also provide advice if needed. Mdann52 (talk) 16:02, 29 October 2014 (UTC)
  • Volunteer Stuartyeates is willing to help with manual labelling & other aspects -- coder with PhD (see notes at talk page)

Community Notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

The following communities have been notified. We plan to contact several WikiProjects for help with manual labeling of articles once we have a set of articles to work with.

Recruiting Volunteers[edit]

We are recruiting participants for hand-coding here.


Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  • As a long-term participant and closer at AfD, and a frequent participant at AfC on ENWIKI, I have long experience the grave difficulties involved both working to our notability guidelines and explaining them to new editors. This effort holds promise for moving us not only in the direction of assisting reviewers and new editors at a process I currently describe as a "whisper chipper for new editors." I do not mean to suggest that I expect an immediate panacea here, but this is the right first step towards what I hope will be a longer series of efforts leveraging what technologies we can leverage into solving some of Wikipedia's biggest user engagement hurdles. I strongly endorse this effort. --Joe Decker (talk) 15:35, 16 September 2014 (UTC)
  • I am supportive of this idea, as I would like to start seeing more formal ways to judge notability on our site outside of individual ideas on notability. Kevin Rutherford (talk) 02:04, 17 September 2014 (UTC)
  • Support the idea. Additionally, if this tool could also pick out unsourced articles (which often goes hand-in-hand with being non-notable), and integrated into the helper script, this could massively improve the reviewing workflow. Mdann52 (talk) 07:38, 17 September 2014 (UTC)
  • This does not require machine learning, so will not be part of the preliminary prototype, but should be very easy to integrate into the completed tool. Thanks for your support! Bluma.Gelley (talk) 07:53, 17 September 2014 (UTC)
  • Yeah, I am aware. Just giving suggestions of possible future features, and something that could be picked up on to show no notability :) Mdann52 (talk) 12:35, 17 September 2014 (UTC)
  • Definitely! Please keep the suggestions coming! Bluma.Gelley (talk) 20:40, 17 September 2014 (UTC)
  • Integration into the AfC helper script would work well. I think you should at least create an API for your project, in case anyone would want to also integrate this into a new pages patrol application. Working on the actual AfC helper script would be a waste, but an API would work well. I would support this. If this is an ANN, then I'll support if there's some sort of measurement of confidence. Not a binary decision. If you base it on Google News result, Google Scholar, or any other formula that takes into account specific factors, then you'd need to show the reasoning of the decision. For example, if an article is about a recent event, seems very notable to an untrained eye, but it turns out that there is no Google News results, I'd like this application to tell me that, "Hey, this article doesn't seem notable because there are no Google News results, and it seemed to have taken place recently". That is a big red flag, and one that this tool would hopefully alert me of. Chess (talk) 03:16, 18 September 2014 (UTC)
  • Sounds like a worthy project. Since the proposed deliverable is a tool that helps reviewers, the tool should give more than a binary or one-dimensional assessment. For instance it should hilight which sources in the article are reliable. It could try to determine and indicate to reviewers the subject area so to know which notability criterion to apply. Teaching a machine to do stuff like this will likely require more than binary pass/fail notability information from trainers. Kvng (talk) 02:17, 19 September 2014 (UTC)
  • Interesting, if ambitious, idea. As a long-time new page patroller, frequent user (and previously beta-tester) of the new pages feed tool in en.wiki, I think there is a genuine possibility that this tool could be useful and also act as a test bed for additional semi-automated means of screening our ever-growing queues of draft and new articles. I will post some additional thoughts on the talk page. VQuakr (talk) 02:59, 19 September 2014 (UTC)
  • This is a lot more complex than just showing some numbers for recall and precision, this can end up as a tool that changes the deletion process to support its own notion of whats possible to delete. If a tool propose positive changes it does not matter so much if it spins the direction to the left or right, but when a tool are geared towards negative changes it can be extremely dangerous if it starts to optimize its own decisions towards deletions. That will happen if the tool use machine learning, it will try to optimize for deletion as its outcome. Deletion processes should be rooted in firm rules, not in continuous machine learning. A system could although be used for learning which articles could survive a deletion process driven by those rules, but then it drives positive changes. Note that also non-continuous machine learning has problems as it hides the underlaying rules of its decisions. This is a general problem with nearly all types of machine learning. I can endorse development and testing, but not setting it up as part of the production process unless I have a lot more information about it. — Jeblad 10:58, 21 September 2014 (UTC)
  • Jeblad, you raise some very important points. The possibility of a machine learned system being biased towards what gets deleted is something we've thought about extensively. For that reason, we do not plan to use deleted articles to train the classifier with. We will hopefully get subject-matter experts to make careful judgments on whether articles are notable or not, and use that to train. We are also not aiming this tool specifically at deletion(what you refer to as 'negative changes'); rather, we hope that it can be used at various points in the article creation/improvement workflow to help users make positive decisions as well. Currently, articles are often deleted or rejected, even though potentially notable, because they are missing sources and/or are low quality. We hope that this tool will help reviewers see the potential in articles on notable topics and allow them to develop, rather than rejecting them out of hand. Bluma.Gelley (talk) 12:14, 21 September 2014 (UTC)
To clarify what I wrote; it is about what happens when the errors the machine learning introduce starts to pile up, and how it will erode previous knowledge. In a vector space this will look like a slow blurring of the features, and if the surface in the vector space is the defining limit then you get a constantly evolving notability.
Notability in Wikipedia is about the deletion process, but what you now writes seems to be more about quality processes in general. I need more info before I can endorse this. — Jeblad 06:11, 22 September 2014 (UTC)
Notability isn't just used in the deletion process. One of the main quality processes we're thinking about is the en:WP:AfC process; this reviews articles from the en:WP:Article wizard (including those written by IP users & newcomers). Does that make sense? I can point you to a paper we wrote for OpenSym about AfC; OpenSym |and the slides Aaron. One of our findings was that Notability decisions are really difficult for reviewers to make because they're very subject specific and require significant judgement. But AfC articles are an assortment ("On the same day, a reviewer might consider articles on a World War I soldier, a children’s TV special from the 1980’s, a South African band, and an Indian village."). Does that make the motivation more clear? Jodi.a.schneider (talk) 16:05, 24 September 2014 (UTC)
  • You do know that Notability isn't an issue at speedy deletion, the test for A7 being no credible assertion of importance or significance? Otherwise I like the idea of trying this, but I would be very uncomfortable with a tool that was only 75% confident when it marked an article as probably meriting deletion. Better if this goes ahead to identify some articles as almost certainly meriting certain deletion tags, some as almost certainly needing to be marked as patrolled, and another group as needing human review, and that in my view should include anything where the bot is >5% unsure. Where this would be really useful would be in highlighting probable G10 candidates and bringing them to the attention of patrollers and admins, a few simple rules such as the inclusion of certain phrases or having a previous article deleted G10 should make a really useful difference. Otherwise presence of references is only relevant to one deletion criteria - BLPprod, unless that is the tool automates the level of mistakes we already see? WereSpielChequers (talk) 19:20, 22 September 2014 (UTC)
    Hey WereSpielChequers, glad to see you commenting here. The goal isn't to mark pages as probably meriting deletion (for A7 or any other reason). The main goal is to help en:WP:AfC reviewers sift through the backlog of draft articles; and secondarily to speed up human review en:WP:NPP by identifying how likely something is to be notable. Based on previous comments, we will include confidence scores.
  • Bluma.Gelley's the ML expert -- I'll let her answer about whether it's feasible to autodetect probably attack pages (Wikipedia:Criteria_for_speedy_deletion#G10) within the scope of this project. Jodi.a.schneider (talk) 16:35, 24 September 2014 (UTC)
    I think I see what you are trying to do, but I remain nervous that the deletionists will just use this as a way to speed up deletion tagging of articles. I foresee people saying "don't blame me, when I tagged it there was only x% chance of it being notable". Would it be possible to build in something that gave contraindications, for example marking anything less than 24 hours old as either notable or "too early to tell". WereSpielChequers (talk) 13:53, 26 September 2014 (UTC)
    I definitely understand your concerns; there is always that possibility. However, we do feel strongly that in many cases, the problem is that the article is notable, but its notability is not visible on a superficial review. The point of this project is to combine all the external information not necessarily easily available to reviewers/patrollers, so that they can have much more information when they make their decision, rather than just relying on the text of the article and whatever they can find themselves, if they bother. We hope this will make it easier to keep, rather than easier to delete.
Also, while it may be a good check on the power of this scorer to build in some contraindications, I don't think that we should do that. That would mean making policy decisions (i.e. anything less than 24 hours old cannot be flagged as not notable) in what should be an objective score. Whoever ends up building actual tools to make use of the scores should be the ones to build in such conditions. Bluma.Gelley (talk) 06:49, 28 September 2014 (UTC)
  • It may take some time for this tool to meet reliability standards for reviewers, but I think it has a lot of promise and I certainly would be very interested in testing it out. In principle, I think tools that can help guide decision-making (but not replace it) are very helpful for getting editors interested in a very time-consuming and effortful task. I JethroBT (talk) 22:29, 29 September 2014 (UTC)
  • As a long-term wikipedian and frequent AfD and AfC participant with a PhD in Comp Sci who has published in machine learning I Support this project. I have some concrete suggestions which I will post on the talk page. Stuartyeates (talk) 20:30, 1 October 2014 (UTC)
  • Comment: I've started a discussion here for users who don't wish to commit at this time. Best, FoCuSandLeArN (talk) 19:42, 3 September 2015 (UTC)

Oppose[edit]

  • I strongly oppose this idea. It will simply cause more potential articles to be deleted, this time by a bot. Editors who enjoy deleting other editors work will point to this bot as proof that the article is not notable. Terrible idea on so many levels. Walterruss (talk) 07:45, 30 September 2014 (UTC)
  • Strongly oppose. Notability is a difficult issue even for matured experts/admins. This tool will bias wiki towards a deletionist point of view. --Natkeeran (talk) 20:43, 17 October 2014 (UTC)
Thanks for your input, @Natkeeran:. We have spent a long time researching notability and deletion and agree that it is an extremely difficult decision to make. We are therefore not planning on replacing human decision-making at all. All we want to do is provide humans with some additional information so that they can make better-informed decisions. As for biasing WP to a deletionist point of view, our goal is actually the opposite: to help articles that are actually notable, but don't seem like it, get more attention and consideration instead of just being deleted (or rejected at AfC) because a reviewer just didn't have enough information to make the correct decision. - sorry, this was me User:Bluma.Gelley - I forgot to log in. 173.3.194.165 13:59, 22 October 2014 (UTC)
  • As a long-term editor trying primarily to rescue both articles unwisely deemed unsuitable, and actually unsuitable articles with possibilities for improvement, I've focused for the last 6 years on deletion and afc/new page patrol processes. I've come to the conclusion that though there is a good portion (perhaps 1/3) of totally impossible articles, detecting them takes judgments and , frequently, investigation. For the middle third, where the notability guidelines are applicable, I find them useless. We keep articles on the basis of what we want to keep, and interpret the extremely ambiguous phrases in the guidelines to suit our feelings about the article. Such terms as substantialcoverage , reliable sources and even independent references are not self explanatory, and for most disputed articles I could perfectly well construct an argument in either direction. We cannot automate this process, because given even a perfect AI, it's not rational. Such patterns as AI could detect are often simply patterns of prejudice. We need to teach peeople judgement, not enshrine principles of mechanical thinking. DGG (talk) 07:22, 20 October 2014 (UTC)
DGG, you are definitely correct that notability decisions require human judgment. We are not trying to supplant human judgment, just supplement it. You mention that the "totally impossible" articles require investigation. Part of what we are trying to do is to do some of the investigation for you, then present the results of that investigation to the user as a single score, hopefully with some rationale for the score as well. The decision will always be made by the human judgment, not AI. We are just trying to support human decisions, as well as potentially help filter articles that require more or less human judgment. For instance, these scores could be used to make a tool that would direct the most difficult AfC drafts to the most experienced reviewers. Currently, AfC drafts are 'sorted' by age, which is not necessarily the best allocation of resources.
Machine-learned scores that help humans make more informed decisions are not new on Wikipedia: they are commonly used in vandalism detection, through such tools as Huggle and StiKI. We are proposing something very similar. Yes, notability is arguably more difficult to determine than vandalism, but the essentials are the same.
We really appreciate your comments, @DGG: - thoughtful comments like yours from experienced users are really helping us refine and improve our ideas. Would you be willing to help us test an under-development version as soon as we have a working prototype? We really want to make sure that the community's concerns are addressed. Thanks, Bluma.Gelley (talk) 18:54, 22 October 2014 (UTC)
Bluma.Gelley. Sorry for the delay in responding, I'm not at Meta often. You can find me best at enWP. I'm aware of weighted detection algorithms; we use them for example for the new article bot for subject sorting, and it's fairly helpful. They might be here also, and I'm certainly willing to take a look at them. But for AfC, what we really need is subject detection--and the new article bot would do it reasonably well here also, except that nobody is willing to implement it, though I and others have been asking for years. Oddly, sorting by age is not in fact irrelevant--the less obvious ones tend to get postponed, both at afc and NPP. In fact, I and some others have taken to scanning the new afCs, to identify the obvious keeps -- usually from fairly experienced editors who for some reason still use afc, sometimes because they do not want to register an account, and the blatant and obvious advertisements which can go directly to speedy deletion. I have some ideas myself about detection--the simplest for promotionalism is to scan for the words "we" and "our"--unless in a quotation, which takes manual checking. And numbered references in the text without a reference list are usually copyvio, unless they forgot to include the {{Reflist}} template--which takes manual checking. But these are for more straightforward problems than notability.
For notability, there are some positive indicators: M.P. or its equivalent is notability in enWP--if it applies to the subject, & that is not easy to determine automatically. a word with an initial capital preceding the word Professor often indicates meeting WP:PROF--again, if it applies to the subject. Similarly with awards of various sorts. Negative indicators for notability are much harder--I cannot immediately think of any. DGG (talk) 23:19, 17 December 2014 (UTC)
  • Deletion is a complex set of processes on the English Wikipedia and I suspect other languages as well, and notability is one of the most difficult things to judge. A much better approach would be to look for the lower threshold of an assertion of importance or significance, articles that lack that can be speedy deleted. An automated notability detector would risk encouraging those deletionists who delete or tag for deletion articles they deem not notable as not having a plausible assertion of importance or significance. A more nuanced system that gave three results - clearly notable, not even a claim of importance or significance, and everything in between, would fit with current workflows and be less deletionist in impact. WereSpielChequers (talk) 07:57, 3 September 2015 (UTC)
  • Comment: by the way, where can we find the hand-coding form? Best, FoCuSandLeArN (talk) 19:49, 3 September 2015 (UTC)