Research talk:Revision scoring as a service/Archived

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
See also: Grants talk:IEG/Revision scoring as a service

Work log


Progress report: 2015-10-17[edit]

Hello all, your weekly report.

That was your weekly report. -- とある白い猫 chi? 14:34, 2 November 2015 (UTC)

Progress report: 2015-10-24[edit]

Hello all, your weekly report.

  • Travis-ci fixed for revscoring. T116397 [1]
  • Testing coverage reports for revscoring. T116402 [2]
  • Draft implementation SigClust in Python. T113761
  • Features from revered edits extracted to be used in clustering. T110580
  • All Wikibase-related parts of pywikibot are pulled out to pywikibot/wikibase and used as a submodule. T108440

That was your weekly report. -- とある白い猫 chi? 14:41, 2 November 2015 (UTC)

Progress report: 2015-10-31[edit]

Hello all, your weekly report.

  • Features for a balanced set of reverted/not-reverted edits in Wikidata extracted. T116983
  • Wikidata revert detection model trained and tested. T116980
  • Wikidata revert model deployed to ORES just in time as a present for the Wikidata:Third Birthday. T116984
  • Configurable logging setup support for ORES implemented. T108421

That was your weekly report. -- とある白い猫 chi? 14:50, 2 November 2015 (UTC)

Thread on Toxic communities from wikimedia-l[edit]

See wikimedia-l thread: "On toxic communities"

There's a thread that started recently about aggressive behaviors in community spaces. See the wikimedia-l thread "On toxic communities". I replied to say that I thought there was some untapped potential in getting a dataset of on-wiki discussions -- possibly labeled by "toxicity" or "aggressiveness" -- out there for researchers to study and for machine learning projects like ours to try to make predictions with. Fluffernutter offered to help us gather a labeled dataset. I wanted to start a thread here to guage interest on three components of this:

Open "conversations" dataset
I've already started work on building a common talk page parser for Wikimedia projects. See github.com/halfak/talk-parser. If we could finish that parser and start releasing regular datasets it generates, that would lubricate the gears of science in this area.
Labeled data
If we have a good dataset of interactions between editors in discussion spaces, we could run various subsets through wiki labels to gather human judgement about the aggressiveness of conversations. This is something that Fluffernutter has volunteered to help us with. Such a labeled dataset would both help basic research into aggressiveness/toxicity and help us potentially build useful models for inclusion into ORES.
Revscoring/ORES model
We have a lot of options in revscoring since the "Scorer" pattern is very general. We don't just have to make basic probabilistic predictions about whether a discussion posting is "toxic" or not. A Scorer could also flag words/phrases that a user should be cautious about using when posting a message. Just so long as we can fit this "score" in a JSON document, we won't have to change revscoring or ores to handle it.

Thoughts? --EpochFail (talk) 16:41, 19 November 2015 (UTC)

Glad to hear it. See also Fluffernutter's comments here: 2015_Community_Wishlist_Survey#Machine-learning_tool_to_reduce_toxic_talk_page_interactions. --Andreas JN466 05:29, 24 November 2015 (UTC)

Progress report: 2015-11-07[edit]

Hello all, your weekly report.

  • We added language features for Dutch, German and Italian. T107590 T109367 T107591
  • Parallelism added to feature extraction. T117422
  • Duplicate clustering with old kmeans strategy T117253
  • Added trim() function for gathering basic (non-modified) features T117424
  • [Spike] Trained a model on sample of 100K edits for wb-vandalism T117258
  • [Spike] Figured out why clustering is behaving weird T118003
  • Compare R sigclust to python sigclust implementation. T118004

That was your weekly report. -- とある白い猫 chi? 07:38, 30 November 2015 (UTC)

Progress report: 2015-11-14[edit]

Hello all, your weekly report.

  • We generated a Revert Model for German, Hebrew, Indonesian, Italian, Dutch and Vietnamese Wikipedias T118314 T118316 T118317 T118318 T116937 T118319
  • We deployed Edit Quality campaign model for ORES generated for Turkish. T118008
  • We launched an Edit Quality campaign on wikilabels on Russian, Ukranian, Spanish and Dutch Wikipedias. T116478 T114502 T114507 T115210
  • We established a backpressure for ORES has been setup to limit queue sizes in Celery. T115534
  • We deployed new revert models to ORES. T118564
  • We implemented soft threshholding in python sigclust T118583
  • Testing python sigclust for relationship between full cluster & damaging clusters T116403

That was your weekly report. -- とある白い猫 chi? 08:07, 30 November 2015 (UTC)

Progress report: 2015-11-21[edit]

Hello all, your weekly report.

  • We expanded the number of features of WikiData reverted detector. T117254
  • Security Review of Revscoring and some dependencies T110072

More to come! -- とある白い猫 chi? 06:26, 2 December 2015 (UTC)

Progress report: 2015-11-28[edit]

Picking up on a number of on going tasks that did not make it to last week's report...

That was your weekly report. -- とある白い猫 chi? 06:45, 2 December 2015 (UTC)

Media coverage for Revscoring/ORES[edit]

A graph of daily pageviews for m:Objective Revision Evaluation Service and m:Research:Revision scoring as a service shows a sudden burst in interest after a post on the Wikimedia blog.
Pageviews to ORES and Revscoring docs. A graph of daily pageviews for m:Objective Revision Evaluation Service and m:Research:Revision scoring as a service shows a sudden burst in interest after a post on the Wikimedia blog.

Hey folks! Over the past couple of weeks, I have been working with the WMF Communications department and User:DarTar to write a blog post about our project. After going through a few iterations, the comms team got kind of excited about the potential for media attention, so we reached out to a couple of reporters that we knew. Well, coverage of the project has blown up. I've lost count of how many interviews I have given. I'll use this post a sort of summary of the articles that are out there about the project. Please feel free to extend the list if you find any more articles. --Halfak (WMF) (talk) 17:12, 2 December 2015 (UTC)

I created a dedicated subpage: Research:Revision_scoring_as_a_service/Media for easy transclusion, cross-linking etc)--DarTar (talk) 18:11, 6 December 2015 (UTC)

Progress report: 2015-12-04[edit]

Hello all, your weekly report of our progress:

  • Flake8 of aetilley/sigclust committed. T118730 [3] [4]
  • Parameter tuning utility implemented in Revscoring. T119769 [5]

That was your weekly report. -- とある白い猫 chi? 15:05, 1 January 2016 (UTC)

Progress report: 2015-12-11[edit]

Weekly report for your consumption!

  • Edit quality campaign for WikiData. T120531 Wikidata:Edit labels
  • Implemented an ORES testing server that can be run against any wiki in a testing environment for vagrant. T120956 [6]
  • Edit quality campaign for Italian Wikipedia launched! T114505 w:it:Wikipedia:Labels
  • Revscoring hyperparameter tuning for all of the feature/label sets in editquality datasets. T121009
  • We had a spike for experimenting with using bag-of-words badwords features and general NLP strategies. T102343
  • We Investigated an anomaly with vandalism detection on Water (Q283) because of bad scaling in some features, fixed with wb-vandalism PR #17 T118731

That was the weekly report. -- とある白い猫 chi? 15:42, 1 January 2016 (UTC)

Progress report: 2015-12-18[edit]

Our weekly progress is detailed as follows.

  • We deployed a tuned random forest model for Wikidata. Tuning reports suggest that we can get a very high amount of fitness out of an RF model See [7] T121350
  • Init edit type campaign for English Wikipedia! w:en:Wikipedia:Labels/Edit types T117237
  • Edit quality campaign for Indonesian Wikipedia launched! w:id:Wikipedia:Labels T114506
  • Complete beta version of pcfg_scorer and approximate overhead. T121258
  • We deployed edit types pilot campaign for English Wikipedia to gather initial user feedback. T121713

That was your weekly report! -- とある白い猫 chi? 15:51, 1 January 2016 (UTC)

Progress report: 2015-12-25[edit]

Presenting our progress for your consumption.

  • We implemented SemanticOperationsSelector for edit types campaign. T121403
  • We implement config merging for ORES (passwords and connection details), ORES should now be able to read multiple config files so that it can merge private or location-specific information into public configuration. T122272 [8], this also required a new release of yamlconfig [9]
  • Switch from AOF+RDB to RDB persistence strategy for ORES redistribution to minimize file usage for ORES redistribution cache. T121658
  • "monolingualtext datatype is not supported yet" bug is fixed. T118565

That was your last weekly report of 2015! -- とある白い猫 chi? 16:03, 1 January 2016 (UTC)

Feedback, churnalism and 32C3 video on Watching Algorithms.[edit]

Prediction and Control - Watching Algorithms. Helsby (32c3)

"Autorenschwund in der Wikipedia: Algorithmen als Ursache und Lösung?". Netzpolitik. 2015-12-18.

I am quite frustrated with the awful en:Churnalism on ORES by german media, which is mostly blindly copying WMF blogposts and american media reports. German Wikipedia DOES NOT use revert-bots like enwiki, which makes all reference to Research:The Rise and Decline ("In order to maintain the quality of encyclopedic content in the face of exponential growth in the contributor community, Wikipedians developed automated (bots) and semi-automated tools (Huggle, Twinkle, etc.) to make the work of rejecting undesirable contributions waste as little effort as possible. (...) it was the successful implementation of algorithmic bureaucracy in form of bots that turned away larger portions of potential future editors.") pretty pointless with regard to German Wikipedia. --Atlasowa (talk) 20:36, 1 January 2016 (UTC)

Hey Atlasowa, we don't say that automated tools are the cause, but rather an exacerbating symptom of a larger switch toward restrictive quality control and primarily negative feedback for newcomers. German Wikipedia has flagged revisions and Huggle (and other tools, I imagine). IMO, Huggle's quality control dynamics are much more problematic than auto-revert bots that mostly deal in egregious damage because Huggle users interact with what's left -- mostly good-faith newcomers. But I don't want to just point at this. I think that new page patrol is equally problematic. Any time we have a filter in place that primarily affects newcomers and is not designed to help them learn and contribute productively, they won't.
Regarding en:Churnalism, I agree. While it is maybe good for me and ORES that the media parrots our framing of what we are doing, I don't think it suggests good things for humanity/society as a whole. There are a lot of people who are critical of the politics of algorithms in social spaces who I have reached out to for comment. Regretfully, none of them chimed in, so the media is a Wikimedia Blog echo chamber for this round. --EpochFail (talk) 15:24, 21 January 2016 (UTC)

Progress report: 2016-01-01[edit]

Presenting our progress for your consumption.

  • We investigated issue with the Dutch Wikipedia Edit Quality campaign. Dutch Wikipedia's edit quality campaign got loaded with the wrong revision IDs. T122511
  • We looked into error correcting output codes in SciKit Learn. (Spike) T105517
  • We investigated how Chinese writing variants are stored in Chinese Wikipedia. (Spike) T119687

That was your first report of 2016! -- とある白い猫 chi? 19:03, 21 January 2016 (UTC)

Progress report: 2016-01-08[edit]

Weekly progress report is as follows.

  • We investigated wikidata's revert model's precision and recall to determine what portion of human edits will need to be reviewed. T122687
  • We introduced quality control and newcomer socialization tools with revscoring and ORES. These are as follows: T114246
    • Quality control tools
    • Newcomer socialization tools
    • MediaWiki integration
    • New model types
  • We created a Mediawiki Extension for Wikilabels. This eliminates the reliance to the custom user script which has proven to be confusing for some users. [10] T120664

That was the weekly report. -- とある白い猫 chi? 19:03, 21 January 2016 (UTC)

Progress report: 2016-01-15[edit]

Weekly progress report is summarized below.

  • We resolved a bug where some revisions do not load in Wikilabels where as old revision info remains. T122815
  • We implemented word frequency diff features. This way badwords etc are treated based on the frequency they appear in the article so for example an article on a specific curse word or insertion of the word Nazi into articles on WW2 do not treat this addition the same as into other articles where the addition of such words are typically disruptive. T121003
  • We implemented common features between languages as a meta-language feature beyond simple space delimited words. This paves way for features for Chinese, Japanese, Korean (CJK) languages. T121008
  • We merged wb-vandalism features/datasources into revscoring. T122304
  • We implemented Meta datasource/feature refactoring for revscoring reducing code duplication. T121005
  • We implemented a balanced not-damaging/maybe-damaging edit extractor for "editquality". This is very useful for wikis dominated by bot edits, particularly smaller wikis. T120999
  • We added documentation on what Wikilabels "Campaigns" are for. T123129

That was the weekly report. -- とある白い猫 chi? 19:03, 21 January 2016 (UTC)

Progress report: 2016-01-22[edit]

Documenting our progress for the week.

  • We created Rule and Symbol objects in pcfg.py. This Generalized types of rules that can be read into PCFG object. T123759
  • We resolved a bug where ORES "r" flag did not work when grouping in recent changes is disabled. T122766
  • We determined how to build WP phrase-structure tree-bank. [11] T122728
  • We built a simple GUI for ORES. [12] T123348
  • We worked out issues with Sphinx in generating Revscoring docs where attributes were not being documented. T123124 T123758
  • We investigated and resolved RDB snapshot issue on ORES T122666

This was your weekly dose of our progress. -- とある白い猫 chi? 19:03, 21 January 2016 (UTC)

ORES UI visual JSON representation[edit]

A mockup of a JSON visualization is presented next to an ORES prediction
Visual JSON mockup. A mockup of a JSON visualization is presented next to an ORES prediction

Ladsgroup has been working on a nice UI to sit on top of ORES. Currently, the UI uses a table to represent hierarchical data. I suggested we try some nested HTML divs or tables. I wanted to share a mockup of what I had in mind. See the mockup on the right. --EpochFail (talk) 15:53, 22 January 2016 (UTC)

@EpochFail: That looks similar to e.g. Schema:Analytics. Maybe there is some code which can be reused? Helder 19:07, 10 February 2016 (UTC)
Ooh! Good point. I was digging around looking for a library that would do this for us. I'll dig into the code that presents schemas to see if there's something we can re-use. Thanks for pointing that out! --EpochFail (talk) 20:09, 10 February 2016 (UTC)

Why real-time catch is important[edit]

I'm writing this short essay, Research:Revision scoring as a service/Why real-time catch is important. please read and comment :) Amir (talk) 17:34, 23 January 2016 (UTC)

Two proposals for new ORES behaviors[edit]

We should provide a means to give features/datasources to the ORES API that it will use when scoring. This will allow users to, for example, see how an editquality score changes for the same edit between anon & registered users or see how an articlequality score changes with a few more reference. I've posted two proposals (both of which I think are good) for how this could be accomplished. See also arlolra's WIP pull request: https://github.com/wiki-ai/ores/pull/115 --EpochFail (talk) 21:21, 9 March 2016 (UTC)
This one has been bugging me for a long time. A user of ORES should never be surprised when we switch from, I.e., a LinearSVC model trained on a balanced set to a GradientBoosting model trained on a representative set, but these two models produce very different score ranges. Still, we should have the flexibility to deploy new modeling strategies. This proposal describes supporting multiple models for the same "modeling problem" in the form of "variants" that would allow ORES users to continue using the same URL pattern they know an love as well as providing them the ability to specify a "variant" that will give better guarantees against sudden changes. This strategy would also allow us to continue updating and refining "variants" as we add new sources of signal. --EpochFail (talk) 21:21, 9 March 2016 (UTC)

FA questioned[edit]

i have nominated some articles based on revscore, but one got summarily rejected. en:Talk:Elizabeth_Catlett#GA_nomination. Duckduckstop (talk) 19:39, 4 April 2016 (UTC)

Hey Duckduckstop! Sorry for the delay. Thanks for letting us know about the false-positive. Generally, I would rely on the prediction models to help give a gist of the quality of an article. In the end, a real review from human eyes will be necessary. Still, it looks like there's some work that we could do in looking for obvious grammatical mistakes and other issues that were brought up. I've filed a task for that. See Phab:T132533. --EpochFail (talk) 23:49, 12 April 2016 (UTC)

Weekly update (April 8th)[edit]

Hey folks,

This is the weekly update for the Revision Scoring project for the week of April 2nd through April 8th.

New developments:

  • Solved some issues that block a major performance improvement for score requests using multiple models phab:T134781
  • Improved the performance of feature extraction for features that use mwparserfromhell phab:T134780
  • We applied regex performance optimizations to badwords and informal word detection for many languages phab:T134267

Maintenance and robustness:

  • Solved a regression in ScoredRevisions that caused most revisions in RecentChanges to not be scored phab:T134601
  • Set ORES load balancer to rebalance on 500 responses from a web node phab:T111806
  • Enabled CORS for error responses from ORES -- this makes it easier to report errors from a gadget on a wiki phab:T119325
  • Sade the staging instance of Wikilabels look a lot more like the production instance phab:T134627

Stay tuned --EpochFail (talk) 21:11, 10 May 2016 (UTC)

[Cross-post] Including new filter interface in ORES review tool[edit]

The new filtering interface demo

Hey folks,

I made a post at mw:Topic:Tflhjj5x1numzg67 about including the new advanced filtering interface that the Collaboration Team is working on in the ORES beta feature. See the original post and add any discussion points there. --EpochFail (talk) 23:05, 18 November 2016 (UTC)

Moved this page![edit]

The new home for this team is mw:Wikimedia Scoring Platform team. See you there! --Halfak (WMF) (talk) 22:25, 16 May 2017 (UTC)

Join my Reddit AMA about ORES[edit]

Hey folks, I'm doing an experimental Reddit AMA ("ask me anything") in r/IAmA on June 1st at 21:00 UTC. For those who don't know, I create artificial intelligences that support the volunteers who edit Wikipedia like ORES. I've been studying the ways that crowds of volunteers build massive, high quality information resources like Wikipedia for over ten years.

This AMA will allow me to channel that for new audiences in a different (for us) way. I'll be talking about the work I'm doing with the ethics and transparency of the design of AI, how we think about artificial intelligence on Wikipedia, and ways we’re working to counteract vandalism. I'd love to have your feedback, comments, and questions—preferably when the AMA begins, but also on the ORES flow board.

If you'd like to know more about what I do, see my WMF staff user page, this Wired piece about my work or my paper, "The Rise and Decline of an Open Collaboration System: How Wikipedia’s reaction to popularity is causing its decline" --EpochFail (talk) 15:42, 24 May 2017 (UTC)