Grants:IEG/Revision scoring as a service/Renewal/Final
This project is funded by an Individual Engagement Grant
This Individual Engagement Grant is renewed
renewal scope | timeline & progress | finances | midpoint report | final report |
Welcome to this project's final report! This report shares the outcomes, impact and learnings from the Individual Engagement Grantee's 6-month project.
Part 1: The Project
[edit]Summary
[edit]Our project aims to create an Artificial Intelligence (AI) infrastructure for Wikimedia sites as well as any other Mediawiki installation. We have successfully deployed a working revision scoring service that currently supports 5 languages (enwiki, fawiki, frwiki, ptwiki and trwiki) and we have made the deployment of new languages relatively trivial (specify language-specific features or simply choose to use only language-independent features). In order to gather en:labeled data for new languages and machine learning problems, we also developed and deployed a generalized crowd-sourced labeling system (see Wiki labels) with translations for all of our supported languages.
We currently support two types of models:
- "reverted" -- Predicts whether an edit will need to be reverted. This is useful for counter-vandalism tools and newcomer support tools like en:WP:Snuggle
- "wp10" -- Predicts the Wikipedia 1.0 assessment rating of an article. This is useful to triage WikiProject assessment backlogs. (e.g. Research:Screening WikiProject Medicine articles for quality)
An "edit_type" model is currently in development.
To accomplish this, we have developed and released a set of libraries and applications that are openly licensed (MIT):
revscoring
-- a python library for building machine learning models to score MediaWiki revisionsores
-- a python application for hostingrevscoring
models behind a web APIwikilabels
-- a complex web application built in python/html/css/javascript that uses OAuth to integrate an on-wiki gadget with a WMFLabs-hosted back-end to provide a convenient interface for hosting crowd-sourced labeling tasks on Wikimedia sites
This project is hardly done and will likely (hopefully) never be done. We leave the IEG period working on developing new models, increasing the systems' scale-ability and coordinating with tool developers (e.g. en:WP:Huggle) to switch to using the revscoring system.
Methods and activities
[edit]What did you do in project?
Please list and describe the activities you've undertaken during this grant. Since you already told us about the setup and first 3 months of activities in your midpoint report, feel free to link back to those sections to give your readers the background, rather than repeating yourself here, and mostly focus on what's happened since your midpoint report in this section.
After our renewal in July 2015, we presented our project in general and specific terms in Wikimania 2015 where we discussed what Artificial Intelligence (AI) can perform in general (wm2015:Submissions/Would you like some artificial intelligence with that?) to expose the wikimedia communities to further application of AI that would be of benefit to them as well as within the specific scope of our project and its current impact (wm2015:Submissions/The Revision Scoring Service -- Exposing_Quality to Wiki Tools). With this one of our aims was outreach for our project to expand towards more languages in the second six months of our project.
We expanded our scope to many more language editions of Wikipedia, our current list of language editions includes Arabic, Azerbaijani, German, English, Spanish, Estonian, Persian, French, Hebrew, Indonesian, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Turkish, Ukranian, Vietnamese, Chinese editions. Furthermore we also invested a significant amount of our time to introduce quality control to Wikidata where we also launched a Wikilabels campaign there as well.
We also focused on the capacity of our system. One key consideration we had was tens of bots requesting scores on hundreds of revisions per language edition per second. So this can mean possibly dealing with 10 bots looking at 20 wikis and 100 revisions each wiki's RC feed (10x20x100=20,000 revision scoring requests/second) which would correspond to roughly 120,000 revision scoring requests per minute. While we do not expect such an immediate usage, with the adoption of our system by tools such as huggle, this is actually quite a conservative number.
Furthermore we needed to make optimization considerations to make the system more efficient in order to avoid running into a situation where we cannot keep up with the recent changes feed for the wikis we are serving. One of the optimization strategies we adopted was to score revisions only once per model revision so that every consecutive request would be provided an already measured score. But even so we noticed that our system was too slow in meeting the demand of scores. Afterwards we implemented pre-caching of recent changes feed where current revisions are pre-scored before anyone requests them. Because our server runs on Wikimedia Wikitech server which is physically next to the Wikimedia servers, we are able to intercept recent edits faster than anyone else effectively providing instantaneous scores.
Even such an implementation was inadequate in meeting the demand as we needed to make our system scale-able both horizontally as well as vertically. For this we adopted a strategy where we had a celery implementation running parallel virtual servers. Not only does this provide redundancy in case one of the servers fail, but also it provides load balancing so that we can introduce more and more parallel servers as the load increases (vertical scale-ability). Because our servers are virtual we can expand on their individual resources over time as needed without having an unnecessary reserve of server resources. We intend to have a minimalist footprint on the server resources without compromising our availability or speed.
We expanded our scope beyond the initial damaging edit detection scope to include newer problems such as edit type detection. Such a feature would be most helpful in identifying edit patterns of editors which would intuitively improve the efficiency of our damage detection algorithms as we would now be able to provide a different score based on the type of the edit rather than treating all edits as of the same type. We have had launched a pilot Wikilabels campaign to identify the typical type of edits which we provided as options on the actual Wikilabels campaign.
Outcomes and impact
[edit]Outcomes
[edit]What are the results of your project?
Please discuss the outcomes of your experiments or pilot, telling us what you created or changed (organized, built, grew, etc) as a result of your project.
Progress towards stated goals
[edit]Please use the below table to:
- List each of your original measures of success (your targets) from your project plan.
- List the actual outcome that was achieved.
- Explain how your outcome compares with the original target. Did you reach your targets? Why or why not?
Planned measure of success (include numeric target, if applicable) |
Actual result | Explanation |
Think back to your overall project goals. Do you feel you achieved your goals? Why or why not?
Global Metrics
[edit]These metrics have a limited relevance for our project as we are primarily building a back-end infrastructure hence we would not have a direct impact on content creation. Furthermore, the ultimate goal of our project is to process on-wiki tasks such as counter-vandalism or WP 1.0 assessments and predict future cases as accurately as possible. As such, we rely on the feedback of more experienced users which at this stage of our project significantly limits our interaction with newer users. As such we do not achieve such global metrics as much as provide the means of support to help others do so.
In reflection, it seems like this list is more relevant to edit-a-thons and other in-person events where it is easier to count individuals and to track the work that they do. However, we try to address the prompts as well as we can below.
For more information and a sample, see Global Metrics.
Metric | Achieved outcome | Explanation |
1. Number of active editors involved | 51 active editors helping us label revisions, several others involved in discussion of machine learning in Wikipedia in general. How many users WikiProject Medicine and the WEF account for? | |
2. Number of new editors | 0 | We aren't in the stage yet were our project will have impacted this. |
3. Number of individuals involved | Measurement of this is not tractible. | |
4. Number of new images/media added to Wikimedia articles/pages | N/A | We do not add new content. Media related to our project: Commons:Revision scoring as a service (category). |
5. Number of articles added or improved on Wikimedia projects | 0 | Our system supports triage, so you might count every article where a user used our tool to perform a revert, but that's probably not is desired here. |
6. Absolute value of bytes added to or deleted from Wikimedia projects | N/A | We don't really edit directly, instead we guide experienced local users in assessing edit quality for us. This way machine learning classifiers are trained based on the community demand rather than any arbitrary criteria we would set ourselves. This way we also promote wider adoption of our tool since the results are tailor made for that wiki. We currently have concluded, is running, or about to start an edit quality campaign in the following 18 Wikipedia editions and we are also running a similar but different campaign on Wikidata.
|
- Learning question
- Did your work increase the motivation of contributors, and how do you know?
- This is a larger goal of User:Halfak (WMF)'s research agenda. In his expert opinion, this project provides a critical means to stopping the mass demotivation of newcomers[1]. So, to answer the exact question, we did nothing to increase contributor motivation. To answer the spirit of the question, we have done some very critical things to preserve contributor motivation. See Halfak's recent presentation for more discussion.
Indicators of impact
[edit]Do you see any indication that your project has had impact towards Wikimedia's strategic priorities? We've provided 3 options below for the strategic priorities that IEG projects are mostly likely to impact. Select one or more that you think are relevant and share any measures of success you have that point to this impact. You might also consider any other kinds of impact you had not anticipated when you planned this project.
Option A: How did you increase participation in one or more Wikimedia projects?
- By providing scores that measures productive/damaging but also for good faith/bad faith we hope to reduce the unintentional negative reception new users receive. In the older workflow damaging edits of any kind were treated the same receiving stern warnings, with our system we help identify users making mistakes (and hence making damaging edits) from users whose intent is only malice. This way new users making mistakes can triage into a separate list from those with malicious intent.
Option B: How did you improve quality on one or more Wikimedia projects?
- By reducing the workload in identifying non-problematic edits we allow recent changes patrol to focus on edits that are more likely to be problematic. In turn vandalism will be removed more efficiently increasing quality on the entire project our system runs for.
Option C: How did you increase the reach (readership) of one or more Wikimedia projects?
- While this is not one of our goals, less vandalism ought to promote this as well.
Project resources
[edit]Please provide links to all public, online documents and other artifacts that you created during the course of this project. Examples include: meeting notes, participant lists, photos or graphics uploaded to Wikimedia Commons, template messages sent to participants, wiki pages, social media (Facebook groups, Twitter accounts), datasets, surveys, questionnaires, code repositories... If possible, include a brief summary with each link.
- We developed models based on Wikilabels user community input for English, Persian, Portuguese, Turkish.
- We developed models based on on wiki reverts for (list of language editions) as well as for Wikilabels.
- We implemented a sustainable API hosted on a series of servers allowing horizontal and vertical scaling.
- Horizontal Scaling
- Vertical Scaling
Learning
[edit]The best thing about trying something new is that you learn from it. We want to follow in your footsteps and learn along with you, and we want to know that you took enough risks in your project to have learned something really interesting! Think about what recommendations you have for others who may follow in your footsteps, and use the below sections to describe what worked and what didn’t.
What worked well
[edit]What did you try that was successful and you'd recommend others do? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.
- Your learning pattern link goes here
What didn’t work
[edit]What did you try that you learned didn't work? What would you think about doing differently in the future? Please list these as short bullet points.
- Adoption
- We have had a hard time managing adoptions where on some wikis we were ignored. We feel more coordination with community liaisons would help circumvent this problem. Often wikis with smaller communities ignore the stream of English due to a language barrier.
- Some new language editions brought added challenges as preexisting language utilities are either inadequate or even non-existent in some cases. For instance in Chinese and Japanese language words are not delimited by spaces unlike say English or Spanish which requires the implementation of tailor made language utilities to handle these differences. What is more, Chinese Wikipedia holds two different character sets (Simplified and Traditional) which handles five variants. We are working to fix such issues in close cooperation with the local community. It would be beneficial if we are able to work together on newer adoptions with the language tool developers as well as lang com provided they have time for such an endeavor.
Other recommendations
[edit]If you have additional recommendations or reflections that don’t fit into the above sections, please list them here.
Next steps and opportunities
[edit]Are there opportunities for future growth of this project, or new areas you have uncovered in the course of this grant that could be fruitful for more exploration (either by yourself, or others)? What ideas or suggestions do you have for future projects based on the work you’ve completed? Please list these as short bullet points.
Part 2: The Grant
[edit]Finances
[edit]Actual spending
[edit]Please copy and paste the completed table from your project finances page. Check that you’ve listed the actual expenditures compared with what was originally planned. If there are differences between the planned and actual use of funds, please use the column provided to explain them.
Expense | Approved amount | Actual funds spent | Difference |
2 grantee developer stipends | $30,000 | $30,000 | $0 |
Total | $30,000 | $30,000 | $0 |
Remaining funds
[edit]Do you have any unspent funds from the grant?
Please answer yes or no. If yes, list the amount you did not use and explain why.
- No unspent funds remain.
If you have unspent funds, they must be returned to WMF. Please see the instructions for returning unspent funds and indicate here if this is still in progress, or if this is already completed:
- Yes.
Documentation
[edit]Did you send documentation of all expenses paid with grant funds to grantsadminwikimedia.org, according to the guidelines here?
Please answer yes or no. If no, include an explanation.
- Yes.
Confirmation of project status
[edit]Did you comply with the requirements specified by WMF in the grant agreement?
Please answer yes or no.
- Yes.
Is your project completed?
Please answer yes or no.
- Yes.
Grantee reflection
[edit]We’d love to hear any thoughts you have on what this project has meant to you, or how the experience of being an IEGrantee has gone overall. Is there something that surprised you, or that you particularly enjoyed, or that you’ll do differently going forward as a result of the IEG experience? Please share it here!