- Mani Pande
- Nimish Gautam
- Ayush Khanna
This project will be to evaluate outcomes of various WMF outreach programs with regards to user contributions and participation in various projects
We will be asking for self-reported information at outreach events, collecting this information and then comparing various user contribution activities from the accounts of users at outreach events to determine the potential effectiveness of the event.
Given that an article has revisions (where is the most current revision of the article available at time of performing the analysis) and the revision we're interested in is :
- A byte is considered significant if it is non-whitespace
- A byte is considered to have survived if it was put in by the user in revision , and persisted to revision
- The set of survived significant bytes for a revision is then
Survival is calculated for this given revision as
Note on reordering text: There's a small, static "bonus" added to the number of significant bytes if any reordering of text was detected in that revision whatsoever (for instance, paragraphs being moved around). The reordered bytes aren't counted otherwise.
We want to be able to figure out the number of bytes a user has added or changed in a given set of revision differences, and we want to see whether those changes persisted, as an approximation of the community's judgement of the information being added as being of high quality. Although persistence is not always an accurate measure of quality, the chances of a given edit being high quality is higher if it has survived 1000 revisions moreso than if it has only survived 1.
- The ratio of survived significant bytes to edit count can aid in identifying users whose editing patterns consist of high-content, highly survivable edits
- The ratio of Survival to edit count can aid in identifying users with high-content, highly survivable edits with consistency over time.
- The ratio of survived significant bytes to bytes added can aid in identifying users who produce highly survivable edits in general.
- Ranges : still TBD
- Edits that occur in sections of articles or articles that are subject to time, such as a sports score. If a user puts in a score of 40, and soon afterwards the team scores 15 more points and the article now says 55, it will be seen as those bytes entered by the user did not survive. This is not a good approximation of quality, as the edit was of high quality.
- Reversions of vandalism. The edits will count an unfairly large number of bytes as having survived.
- Note: there are numerous methods to detect vandalism reversion, and in the code implementation there is room for use of these heuristics if they are needed
- Collaborative editing sessions. This can be remedied by looking at a group of collaborative editors as one unit.
Code that performs this analysis is available under the GPL on the wikimedia SVN repository
All findings will be publicly available on a WMF wiki.
Wikimedia Policies, Ethics, and Human Subjects Protection
Benefits for the Wikimedia community
Community and foundation will be able to better gauge and use effective outreach practices