Grants talk:IEG/Automated Notability Detection

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

As a service[edit]

How will we provide this notability score to other tools? I propose a web API (as a service) and a simple human UI hosted in WMFLabs. --EpochFail (talk) 14:57, 10 September 2014 (UTC)

Should we plan to hire someone to implement this, and add it to the grant? Or keep it as planned, for a later stage? Bluma.Gelley (talk) 12:20, 17 September 2014 (UTC)
Yes, I think it should be an outcome of the grant, and should be in the budget, with plans for recruiting the person who will do this. Jodi.a.schneider (talk) 15:31, 24 September 2014 (UTC)

which kind of reviewers?[edit]

It would be good to explain which kind of reviewers you think this would be helpful for: AfC, NPP, people trolling categories for cruft, ... All of the above? Jodi.a.schneider (talk) 18:38, 10 September 2014 (UTC)

Proof-of concept[edit]

Thanks for joining today's IEG Hangout. I am looking forward to seeing this proposal develop! Particularly will be curious to see how you decide to move forward with any ideas for putting a lightweight prototype in the hands of real Wikipedian testers as part of this project - ultimately, I'd love to see something heading towards practical application of the classifier, with some measures of success indicating that the classifier is not only accurate, but useful for active community members. Cheers, Siko (WMF) (talk) 23:51, 16 September 2014 (UTC)

Which WikiProjects?[edit]

Which WikiProjects would be the best to ask for help in creating the training set? Shall we request feedback from them now? One list of WikiProjects is at en:Category:Wikipedia_WikiProjects or there's an categorized list here. I guess it should be a good subset of content-based ones? Looking at what's declined at AfC also might give ideas of the topics of importance... Or else Wikipedia:Notability and its many, many subguidelines... Let me know how I can help! Jodi.a.schneider (talk) 20:42, 17 September 2014 (UTC)

Possibly contacting the Articles for Creation and NPP people for feedback on how they might find such as tool useful. --Kudpung (talk) 16:09, 18 September 2014 (UTC)
Good thought, Kudpung -- they've both been notified. This project would rely on having some initial data -- about notability (or not) of articles on some given topics. That would come from experts from certain WikiProjects -- I think it would be good to establish which ones, and get some feedback/buy-in. Jodi.a.schneider (talk) 09:14, 19 September 2014 (UTC)
Suggest AfC and NPP. The subject-matter specific Wikiprojects probably are too focused of groups for assessment of a general-use tool, unless this tool is intended for a large subset of articles (such as historical events or biographies). VQuakr (talk) 04:07, 20 September 2014 (UTC)
I guess the question is whether this is going to be based on en:WP:GNG or whether it's going to be based on subject-specific notability guidelines. Thoughts? Jodi.a.schneider (talk) 15:32, 24 September 2014 (UTC)

Functionality suggestions[edit]

I am coming from newpages patrol on en.wiki, where I normally use the new pages feed (NPF) tool [1]. My thoughts are in the context of that tool, and may or may not apply well to other similar applications. That tool currently lists new pages, along with pertinent information such as the creator's edit count, the first sentence or so of article content, and whether it contains categories and/or references. A tool that identified topics that might not meet the General Notability Guideline would complement this information. Given the terse nature of the NPF interface, I think the automated notability assessment would best be indicated by a single scalar score, maybe a 0-100% scale, possibly color-coded red to green. It could be clickable to get open a secondary screen with additional analytical information, with a bug/failure reporting link (at least in alpha) for false negatives and false positives. It would be nice if the secondary screen also linked potentially useful (but unused in the article) sources. Triggers for lowering the notability score could include presence in social media but not reliable sources, use of only primary and vanity sources in the article without better sourcing available, and close paraphrase detection from corporate websites and/or social media.

Ironholds might have thoughts on how to test modifications integrated into the NPF. Good luck! VQuakr (talk) 03:19, 19 September 2014 (UTC)

Thanks for the suggestion, @VQuakr:. Right now, our proposal is just for creating the scoring system, rather than designing a tool to make use of it. Hopefully, when we get to that stage, your ideas can be incorporated into the design. 173.3.194.165 15:11, 21 October 2014 (UTC)

Validity of results, etc[edit]

I endorsed this at first, but then I got second thoughts after reading the reply. It seems like the initial description isn't what the contributors of the proposal is planning, and that makes me confused. Then I read the research paper [2] and gets even more confused.

If you train any kind of classifier through supervised learning you need two training sets, one that represents your wanted class and one for the unwanted class. Both must be representative. In this case it seems like the wanted ones are taken from older articles that have evolved past the initial state of the unwanted ones. When you learn from two classes that are at two different stages in their lifespan, then you will have inherit differences that is not comparable. That makes the classification task much easier, and will give precision and recall that is much better than in a real world case.

Then you have the problem with how you present this information to the users and how the algorithm itself will change the process outcome. If it does change the outcome, will it then learn the new outcome and then slowly game the process over time? And if the creators of new articles can observe the effect, will they be able to adapt their new articles to get a higher score and bypass the algorithm? Reading the paper it seems like the proposal is to make this a reviewer-only tool for notability (really deletion), and then it won't contribute to overall quality, but from the reply to my endorsement it seems like it is a quality-tool for everyone. How the community would interact with such a tool is very important, especially whether they would take the output from the classifier for a fact, but the discussion about this in "8. Limitations and concerns" in the paper is somewhat short and builds on a number of unsubstantiated hypothesis.

A very rough idea of a general tool like this would be to add a "feature export" option to AbuseFilter (yeah it should be renamed EditFilter) and then make a classifier that can use the export to train against some logged value. The obvious problem you have here is how do you identify positive outcome. One is that you log both outcomes from a feature export. Until there is sufficient data it should be possible to manually make training and test sets. Often initial sets can be made from the testing feature in AbuseFilter. It should be possible to reevaluate the classifier as it evolves against the old test set, and how it evolves on precision and recall should be logged and presented graphically.

A tool like I have sketched would be more general and could be used for a lot more than just deletion requests. It would also be visible if it starts to propagate in an unwanted direction over time.

In short I think the proposal needs more discussion than the current timeframe allows. Perhaps it could be run more as a feasibility study. — Jeblad 14:18, 22 September 2014 (UTC)

Hi Jeblad, thanks for the detailed thoughts. This is a new project, distinct from the paper you reference above (the methodology is similar as I understand it; but I'm not the ML expert (that's Bluma.Gelley).
The plan is to take a *new* data set (we've been discussing exactly how we'll select and collect that); this will be from the last year (unlike the dataset used in the other paper). Further, we plan to look at each article t seconds after it was started (for some t, not yet decided). Does that address your concern about the classifier? We'd really welcome any thoughts on collecting the training set. I'll look more at the rest of your comment later. Jodi.a.schneider (talk) 15:41, 24 September 2014 (UTC)
@Jeblad: , we've been giving a lot of thought to your comments - thanks! We are definitely aware of the complexity of what we're trying to do, and the potential pitfalls. We do believe, though, that the problems Wikipedia is facing with regard to reviewer shortages at AfC, newcomer retention dropping, possibly overeager deleters, and the like can be helped by technology. The technology is just a tool, though - it's up to the humans using it to make sure that it's used in a way that helps and doesn't harm. That is, your objections and suggestions are all great, and they will be part of what gets taken into consideration when designing tools to make use of the scores the classifier produces. That's a bit further in the future, and I think that the simple classifier itself has validity on its own. 173.3.194.165 15:28, 21 October 2014 (UTC)
I've been toying with some ideas that could be useful and is somewhat similar, it is about establishing a baseline measure whether some statement seems to be common knowledge. If its not then the editor should be given a hint that the statement needs a reference. It is not about proving that the statement is correct, it is about checking whether the coverage on the web of the topic from the statement is sufficient given the subject.
It will be interesting to hear about this project, especially if you get funding. — Jeblad 02:38, 22 October 2014 (UTC)
That sounds like an interesting project @Jeblad:. You might be interested in the use of textual entailment, described in my colleagues' work: Cabrio E., Villata S., Gandon F. A Support Framework for Argumentative Discussions Management in the Web. 10th Extended Semantic Web Conference (ESWC 2013), LNCS, vol. 7882; p. 412-426, 2013. Jodi.a.schneider (talk) 06:38, 22 October 2014 (UTC)

Reminder to finalize your proposal by September 30![edit]

Hi there,

  • Once you're ready to submit your proposal for review, please update the status (|status= in your page's Probox markup) from DRAFT to PROPOSED, as the deadline to proposal for this round is September 30th.
  • Let us know here if you've got any questions or need help finalizing your submission.

Cheers, Siko (WMF) (talk) 20:55, 26 September 2014 (UTC)

Suggestions[edit]

[Note: I was w:en:WP:CANVASS'ed to come here by Jodi Schneider, based on a mailing list post related to my work at w:en:Wikipedia:WikiProject_New_Zealand/Requested_articles/New_Zealand_academic_biographies, I'm assuming that this is kosher in the grants process.]

  1. The operative parts of the proposal / work need to be rewritten from "notability" to "evidence of notability". Without a huge knowledge of the real world the algorithm is going to be unable to judge notability, but judging the evidence of notability in the draft explicitly constrains the scope to the examination of the draft alone. The term 'evidence' is used frequently in w:en:WP:GNG and commonly in deletion rationales. On wiki.en 'notability' is the outcome of a consensus decision making activity. 'Evidence of notability' is anything that feeds into that consensus decision.
  2. I suggest that rather than solving the (relatively hard) binary problem of notability, a set of overlapping problems of increasing difficulty are solved. Insertion of w:en:Template:Peacock templates should be pretty trivial, insertion of w:en:Template:Advert slightly harder, etc. See w:en:Category:Wikipedia_articles_with_style_issues and w:en:Category:Wikipedia_article_cleanup for suggestions. Even if the notability problem is not solved (or solved in a manner the community finds unacceptable) the tool will still be useful.
  3. I suggest that the tool be pitched not at reviewers, but at article creators. This would allow better, more immediate feedback to more users to enable them to grow as editors and improve their drafts in the short term and removes the delay of a reviewing queue. Automated tagging of a draft after it had been idle (unedited) for ~24 hours might be suitable (but require a bot approval process).
  4. I suggest that the domains used in URLs in references are likely to be a useful attribute for machine learning
  5. I suggest that a manual labelled corpus is already available by looking at the articles that have been declined in the past. Very informative will be articles which have been declined, are improved and then accepted, since these are matched pairs of articles, one with evidence of notability and one not.
  6. I volunteer to help in the manual labelling.
  7. I volunteer to help with other aspects (I'm a coder with a PhD in a relevant field)

Stuartyeates (talk) 21:21, 1 October 2014 (UTC)

It appears that I misread your proposal. You're trying to measure the "absolute notability of topic" rather than the "evidence of notability in the article/draft"? That's altogether a separate kettle of fish and I'll have to think about suggestions for that. Stuartyeates (talk) 22:24, 1 October 2014 (UTC)

Here's my second crack at suggestions:

  1. Understand the notability is a deeply contested concept on en.wiki (an probably other wikis) and take steps to deal with this; otherwise your work will become bogged down in dealing with people who have axes to grind about notability. For example, use a tool name and branding that is self-depreciating (words like 'helper' 'diviner' 'assister' etc)
  2. Attempt to expose at least some of the underlying logic in a useful way; "probably notable based on results from WorldCat, The Moscow Times and The Economist" (with links) is infinitely more useful than "notable with 86.56% confidence"
  3. A tool that can be seen to contribute constructively to AfD as well as at AfC is likely to instil confidence / understanding in it's working.
  4. Trawl through old AfDs looking for sites which are commonly used as evidence of notability
  5. Have a strong disclaimer and be upfront about w:en:WP:BIAS and things your tool can't possibly be good at (due to non-digitisation of content, lack of authority control, etc).
  6. Don't ask wikipedians whether a topic is notable, ask whether it's notable in their opinion, or what their !vote would be.
  7. Maintain an a page / subpage with things that either editors or subjects could have done that might have heavily influenced your algorithm ("The algorithm was confused by the ubiquity of the word 'Einstein' but that could easily have been solved if he'd got himself an ORCID like all real physical scientists"; "This 'John Smith' is indistinguishable from all of these w:en:John Smiths, but we might have had a chance if we'd known dates of birth and death for VIAF.")

Hopefully I've correctly understood what you're trying to do this time. Stuartyeates (talk) 20:39, 2 October 2014 (UTC)

Eligibility confirmed, round 2 2014[edit]

IEG review.png

This Individual Engagement Grant proposal is under review!

We've confirmed your proposal is eligible for round 2 2014 review. Please feel free to ask questions and make changes to this proposal as discussions continue during this community comments period.

The committee's formal review for round 2 2014 begins on 21 October 2014, and grants will be announced in December. See the schedule for more details.

Questions? Contact us.

Jtud (WMF) (talk) 23:27, 2 October 2014 (UTC)

Data source[edit]

I don't like that this project focuses on the English Wikipedia. If you think it's easy to extend to other languages, then you must commit to do so. If not, it's useless.

If you're trying to predict what will be deleted, you'll need to know and analyse what has been deleted in the past, but the proposal doesn't mention a source for this data. Speedydeletion Wikia and deletionpedia maybe? I shouldn't be left guessing. mw:Extension:BayesianFilter already crashed against this wall in the past, for instance, so I think this is important to know. --Nemo 09:40, 17 October 2014 (UTC)

Nemo, a working notability detection system for English Wikipedia would not be useless -- at least this is what our research seems to suggest. Further, English Wikipedia is our home wiki. There are many reasons why it is appropriate that we start there. Regardless, extending to other languages only makes sense once we show that the system can work. The strategies that we plan to employ will be a mixture of language agnostic ones as well as some that are language specific. We will use standard programming practices to make sure that this is abstracted.
Now to address your actual question about our data source. We're not going to look at articles that have been deleted. We're going to use a manual hand-coding scheme to label articles by whether they cover notable topics or not. See activity #1 in the project plan. However, if you *do* want to get a good source for deleted articles, I suggest that you check out Special:Log. Here's an example call that lists deleted pages [3]. --EpochFail (talk) 13:50, 17 October 2014 (UTC)
Here's another way that you can get at the data with R:Quarry: [4] --EpochFail (talk) 13:53, 17 October 2014 (UTC)

Supporting human notability judgements[edit]

Hi @DGG:, sorry not to answer earlier, I'm getting email for talk page changes but not main page changes on this IEG. I'm really glad to see you commenting here because you're a real expert on notability and all the problems that go with our systems for determining it. I completely agree that it's difficult to sort out the "totally impossible" articles from the reasonable ones. My goal here is to support triage of AfC drafts. None of the three people putting the project forward want to see it used to promote deletion; there's no way we can automate AfD decisions. Bluma, the machine learning expert here, got interested in the problem in order to SAVE A7 CSD's actually. Aaron is mostly working on newcomer socialization (and deletion is one of the things going wrong there).

I don't think that notability will ever be completely automated. But I do I believe that we can help people assess notability, by bringing more information together into the same place to help. We're planning to create an API, which gives an indication ONLY when something is PROBABLY notable, along with a rationale: e.g. (as Stuart suggested above): "probably notable based on results from WorldCat, The Moscow Times and The Economist" (with links). So I hope this would support people's judgement. So our goal is to provide a rough indication to humans -- who will apply their own judgement.

And while the algorithm would be automated, it would be an extrapolation on human judgement, which could be iteratively examined and critiqued. One of the first steps of this project would to get some WikiProjects to contribute data -- that hand-assess topics (rather than articles). This would give a baseline (which topics are considered notable and not notable) and determine which decisions about notability are more difficult than others. ( Aaron followed similar procedures when creating en:Wikipedia:Snuggle).

If this project goes forward, I would hope that you'd be willing to be closely involved as a critic of what's not working! From that perspective, what this project would create is something to talk about -- judgements that could be critiqued and lambasted without upsetting a person who made them -- so that we could figure out what features are important. That kind of discussion could help us figure out if there's a common procedure experts go through to assess articles, and then create better support tools in the future. I can speak mostly about my own procedures (from studying AfD (where I am a ravid inclusionist) for my dissertation research; and from participating in AfC).

Longer-term, I personally would like to determine how other people determine notability, and whether there are any steps along the way where algorithms could help. Currently my rough model of how to determine notability is:

  1. Identify the article topic (e.g. determine whether the subject is an author, sportsperson, musician, musical group, company, .... to find which guidelines apply)
  2. Determine which sources are reliable in a given area (sometimes instantiated with custom search engines, e.g. Wikipedia:WikiProject Video games/Reference library has a custom search of magazine archives)
  3. Know what is "normal, routine" coverage for that topic (e.g. 2 links does not ensure that GNG is met; "mentions" are not enough)

And: Just because you haven't found sources, it doesn't mean that there aren't any. Understand of which topics may be very notable without having online coverage; the difficulty in finding some materials online (due to non-unique name, search engine "spelling correction", looking in the wrong kind of repository, etc)
Note that: Judgement on sources can be very subject-specific (especially reliability of sources, but also what kind/how much coverage is expected).

Does that seem similar to the process you use? Jodi.a.schneider (talk) 07:52, 22 October 2014 (UTC)

Aggregated feedback from the committee for Automated Notability Detection[edit]

Scoring criteria (see the rubric for background) Score
1=weak alignment 10=strong alignment
(A) Impact potential
  • Does it fit with Wikimedia's strategic priorities?
  • Does it have potential for online impact?
  • Can it be sustained, scaled, or adapted elsewhere after the grant ends?
7.1
(B) Innovation and learning
  • Does it take an Innovative approach to solving a key problem?
  • Is the potential impact greater than the risks?
  • Can we measure success?
6.3
(C) Ability to execute
  • Can the scope be accomplished in 6 months?
  • How realistic/efficient is the budget?
  • Do the participants have the necessary skills/experience?
6.9
(D) Community engagement
  • Does it have a specific target community and plan to engage it often?
  • Does it have community support?
  • Does it support diversity?
5.9
Comments from the committee:
  • Truly innovative.
  • This will make use of ideas that have previously worked - machine learning, feedback through users (as in the case of ClueBot).
  • Risk of unwanted results / action based on the results seems quite high. Of particular concern is the possibility that the tool would exacerbate existing problems with regard to deletion of new contributors' articles. These risks should be explicitly considered and appropriate mitigation proposed. We see that participants have considered a number of these views and can likely build in safeguards to alleviate some of these problems. It is important to take into account that they shouldn't be trying to foster deletionism.
  • Some risk of reducing collaborative efforts, one of the core assets of Wikipedia.
  • There is a clear plan for a target community - the AFC review process - however community support is mixed. The team should ensure that they communicate further with stakeholders who may oppose the project, so that they can understand their concerns and build a way to alleviate them. There is lots of community engagement so far. To have long-term impact, a significant degree of community engagement and support will be needed.
  • We would appreciate further community engagement on the non-English language wikis – illustrating an understanding of the AFC equivalent process for other wikis, for example, so it can be kept in mind when developing the tool, keeping scale in mind.
  • Impact would be online, and if successful, the tool could be adapted elsewhere.
  • It may be beneficial to consider whether the tool would be more useful if it were to output the analysis underlying the notability score (i.e. the actual notability "feature" metrics), which could be used to justify a notability decision or directly improve the article in question, rather than the score itself.

Thank you for submitting this proposal. The committee is now deliberating based on these scoring results, and WMF is proceeding with its due-diligence. You are welcome to continue making updates to your proposal pages during this period. Funding decisions will be announced by early December. — ΛΧΣ21 16:59, 13 November 2014 (UTC)

IP edits[edit]

Someone is editing the proposal page from an IP address (see recent history), and changing important information. I take it that this is one of the prospective grantees? Please confirm so we know to trust the info on the page :) Cheers, Jmorgan (WMF) (talk) 19:51, 20 November 2014 (UTC)

I'm sorry, this is me. <insert abashed face here> I am supposed to stay logged in, but it doesn't always work for some reason, and I don't always notice in time. Is there a way to "claim" IP edits after they've been made? Bluma.Gelley (talk) 01:59, 23 November 2014 (UTC)
Hi Bluma.Gelley. That's no problem! I just wanted to make sure the changes were legit... particularly since they involved the proposal budget ;) There's currently no way to re-assign edits after the fact, but (as a precaution, to protect your privacy) I've hidden the IP address of the recent edits you made while logged out. Cheers, Jmorgan (WMF) (talk) 23:13, 25 November 2014 (UTC)
Thanks a lot, Jmorgan (WMF)! Bluma.Gelley (talk) 02:44, 12 December 2014 (UTC)

Comments based on IEG committee discussion 22-nov2014[edit]

After reviewing this proposal again I would like to comment that this proposal makes no mention of including Wikidata into the process. There are tons of articles in non-English Wikipedias that meet the notability requirements for the English Wikipedia, and a quick check is possible using the Authority control template. This bundles various notability checks in one fell swoop. See for example how the Wiki page for en:Barend Cornelis Koekkoek gives quite list of available sources from the authority control box, based on one parameter only (VIAF via www.viaf.org). Of course, Wikidata items may also be referenced in other ways than sources included in the Authority control wikiproject. Jane023 (talk) 18:50, 23 November 2014 (UTC)

Thanks Jane023. These types of suggestions will really help us to build up a feature set that has good signal. Keep 'em coming!
I suspect that it's very uncommon for new articles to be linked to a Wikidata item while still a draft. But for those that are, I agree that a link to a Wikidata item -- especially one that is linked to other Wikipedia articles -- may provide a strong signal for notability.
As for {{Authority control}}, it'd imagine that it's a similar story, but I look forward to finding out how early that type of template tends to get added to draft articles.
When we start gathering features, we'll be looking for while features tend to be present in the early days of a draft article since it's rescuing those articles that we're focused on. --EpochFail (talk) 01:08, 25 November 2014 (UTC)

Round 2 2014 decision[edit]

IEG key lightblue.png

Congratulations! Your proposal has been selected for an Individual Engagement Grant.

WMF has approved partial funding for this project, in accordance with the committee's recommendation. This project is funded with $8775

Comments regarding this decision:
We look forward to seeing this project work closely with community members to explore how machine learning might help support human decision-making around notability. As API-building costs remain unclear - they may or may not be needed depending on who takes this on - we’re approving your budget without this cost for now. WMF will be happy to approve additional budget to fund the API build at the point in project when this is needed.

Next steps:

  1. You will be contacted to sign a grant agreement and setup a monthly check-in schedule.
  2. Review the information for grantees.
  3. Make any necessary scope adjustments to your proposal page, as discussed with grantmaking staff.
  4. Use the new buttons on your original proposal to create your project pages.
  5. Start work on your project!

Questions? Contact us.


2015 discussion[edit]

I'm opening a medium for discussion here for those of us who want to provide feedback yet are not willing to endorse any definite views as to whether this should be indeed selected for financing.

I'll begin by stating that although many users have presented skeptical positions about the implementation of such an automatic system, I see no harm in approaching this as a convenient tool which might present interesting data, which for that fact alone should be considered viable. I agree per DGG's comments in the Grants page that reviewing the notability of an article's subject is an inherently human ability as of this time, but that doesn't mean that, as Joe Decker, Chess and Jodi.a.schneider aptly pointed out, an automated tool could be of great help at AfC.

Through its implementation we could at least establish a preliminary selection of possibly non-notable topics for an experienced set of editors to review especially carefully. As the system works now, one can only discern submissions according to date. It would go a long way to solve our constant backlog if we could add topic categories (as is currently being dicussed) and simultaneously predetermine (in a similar fashion to earwig's copyvio tool - such as using percentages) a topic's notability based on references presented or a set of established source checks (weighted according to number of mentions in any given source, for example). This numerical data would only be used as a guide for the simple purpose of handling submissions more efficiently at first, but it could run through a set of 1000 or so daily mainspace articles via bot, providing a list of potential articles needing notability checks as an output; we could even set up an appropriate WikiProject for reviewing said listed articles. Anyway, these are my 2 cents to get this conversation going. Cheers, FoCuSandLeArN (talk) 19:41, 3 September 2015 (UTC)

Please contact WMF immediately[edit]

Hello Bluma.Gelley,

I have not been able to get a hold of you via the email address I have for you. I need to speak with you urgently about finalizing a plan for your project. There is no problem, but I need to let our Finance Department know your plan by May 15. Can you contact me at mjohnson (_AT_) wikimedia  · org at your earliest convenience? If we don't hear from you, we will have to mark this project as incomplete.

If anyone else reading this knows how to contact Bluma, please let me know!

Warm regards, Marti (WMF) (talk) 05:10, 9 May 2017 (UTC)