Research:Automated classification of edit types

From Meta, a Wikimedia project coordination wiki
Created
19:34, 16 October 2014 (UTC)

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The goal of this research project is to develop a working edit classifier for English Wikipedia and a method for constructing classifiers for other languages. The edit type classifier will allow us to predict the type of contribution made by an editor. In turn this will allow us to better understand the type of work performed by individual contributors, provide an algorithmic signature of users as a function of their role, and measure the division of labor within individual articles or entire projects. This project will also allow us to go beyond raw edit counts or bytes added as measurements of contributed content.

Project plan[edit]

  1. Specify a list of edit types that we like (expert judgement?) (see /Taxonomy)
  2. Build a hand-coding interface ( Done see Wiki labels)
  3. Gather a group of hand-coders and have them manually classify edits (see en:Wikipedia:Labels/Edit types)
  4. Construct a feature extractor to gather edit metrics that we think will have a high amount of signal (Research)
  5. Train/test a set of classification strategies on the features/labeled edits (Research/Engineering)
  6. Apply the classifier to new edits (Engineering)

Use cases[edit]

rich revision histories
enrich article revision history pages, user contribution pages, recent changes with tagged edits
predict contributor roles
study if wikipedian roles can be predicted from their edit types and design automated recommendations / recruitment strategies for articles in need of specific roles
article lifecycles
analyze the evolution of individual articles (by type of activities) and study how the article lifecycle has changed over time and across languages
edit types and editing interfaces
study if people make different types of edits as a function of the edit interface they are using
newbie task recommendations
Understand what tasks are most likely to be successfully picked up by newbies recommending minimizing reverts or deletions; study the engagement/retention effects of priming new contributors with the expected response to the quality and type of their contribution
wiki work visualizations
Make it easy to perform project-level or article-level data analysis / visualization by type of contributions

Taxonomy[edit]

Main article: /Taxonomy

In order to predict the types of changes made in an edit, we need to develop a taxonomy of potential edit types. So far, we've identified several abstract classes that potential edit types may belong to.

Syntactic
These classes describe "what" was done during an edit. (As opposed to "why".) Within this class, we can identify several mechanical operations that are trivial to detect (e.g. inserts a wikilink, removes a category) as well as some that are more complex to identify (e.g. rephrase vs. change meaning).
Semantic
These classes describe "why" an edit was made. They usually amount to subjective applications of policy. (e.g. removing POV)
Complex operations
These classes describe changes that are part of a multi-edit operation (e.g. article merging and talk page archiving)
Discussion
Many edit types have to do with talk pages and other types of discussions (e.g. reply, !vote, etc.)

We've placed all but syntactic changes out of scope for this project.

Related work[edit]

  • Kriplean et al. (2008)[1] describe types of work that initiate receiving a barnstar
    • minor -- copy-editing
    • media -- images, audio
    • initiative -- starting articles, stubs
    • major -- substantial textual additions to an article
    • achievement -- shepherding and article to a higher quality level
    • classification -- categorizing articles, adding templates
    • redesign -- large-scale refactoring, merging pages
    • translation -- to or from another language
    • attribution -- citing sources, removing unsourceable info
  • Antin et al (2012)[2] used a high level taxonomy described by Kriplean et al (2008)[1] to create their own taxonomy to use. They had mechanical turkers apply this taxonomy to a sample of ~11k revisions saved by new editors: adding citations, adding content, changing Wiki markup, creating articles, deleting content, fixing typos, reorganizing text, rephrasing existing text, vandalism and deleting vandalism.
  • Faigley and Witte (1981) developed a taxonomy of textual changes to capture the underlying intentions of the contributor.[3] They introduce a distinction between changes that affect meaning (the "addition or removal of information"), which they call Text-Base Changes and changes not affecting meaning, which they call Surface Changes. This distinction is used by later work such as Daxenberger and Gureyvich (2012)[4] and (2013)[5].
  • Pfeil et al (2006)[6] used in Ehmann et al (2008)[7]
  • Jones (2008)[8]
  • Boulain et al (2008)[9]
  • Arazy et al (2010)[10] – taxonomy from early version used in Arias Torres (2009)[11]
  • Fong and Biuk-Aghai (2010)[12] and (2011)[13]
  • Liu and Ram (2011)[14]
  • McDonald et al (2011)[15]
  • Antin et al (2012)[16]
  • Bronner and Monz (2012)[17]
  • Daxenberger and Gurevych (2012)[4] and (2013)[5]
  • Ferschke et al (2013)[18]
  • Edit type classification in other collaborative platforms: Stack Overflow – Yang et al (2014)[19]
  • Hale (2015)[20]
  • Graham et al (2015)[21]

Subpages[edit]

See also[edit]

References[edit]

  1. a b Kriplean, T., Beschastnikh, I., & McDonald, D. W. (2008, November). Articulations of wikiwork: uncovering valued work in wikipedia through barnstars. In Proceedings of the 2008 ACM conference on Computer supported cooperative work (pp. 47-56). ACM. pdf
  2. Antin, J., Cheshire, C., & Nov, O. (2012, February). Technology-mediated contributions: editing behaviors among new wikipedians. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (pp. 373-382). ACM. PDF
  3. Faigley, Lester; Witte, Stephen (1981-12-01). "Analyzing Revision". College Composition and Communication 32 (4): 400–414. doi:10.2307/356602. 
  4. a b Daxenberger, Johannes; Gurevych, Iryna (2012-01-01). "A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles" (PDF). 24th International Conference on Computational Linguistics (COLING). 
  5. a b Daxenberger, Johannes; Gurevych, Iryna (2013-01-01). "Automatically Classifying Edit Categories in Wikipedia Revisions". Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). 
  6. Pfeil, Ulrike; Zaphiris, Panayiotis; Ang, Chee Siang (2006-10-01). "Cultural Differences in Collaborative Authoring of Wikipedia". Journal of Computer-Mediated Communication 12 (1): 88–113. ISSN 1083-6101. doi:10.1111/j.1083-6101.2006.00316.x. 
  7. Ehmann, Katherine; Large, Andrew; Beheshti, Jamshid (2008-10-06). "Collaboration in context: Comparing article evolution among subject disciplines in Wikipedia". First Monday 13 (10). ISSN 1396-0466. doi:10.5210/fm.v13i10.2217. 
  8. Jones, John (2008-04-01). "Patterns of Revision in Online Writing A Study of Wikipedia's Featured Articles". Written Communication 25 (2): 262–289. ISSN 0741-0883. doi:10.1177/0741088307312940. 
  9. http://eprints.soton.ac.uk/265200/
  10. Arazy, Ofer; Stroulia, Eleni; Ruecker, Stan; Arias, Cristina; Fiorentino, Carlos; Ganev, Veselin; Yau, Timothy (2010-06-01). "Recognizing contributions in wikis: Authorship categories, algorithms, and visualizations". Journal of the American Society for Information Science and Technology 61 (6): 1166–1179. ISSN 1532-2890. doi:10.1002/asi.21326. 
  11. "Visualizing wiki author contributions in higher education". www.editlib.org. Retrieved 2015-10-23. 
  12. Fong, Peter Kin-Fong; Biuk-Aghai, Robert P. (2010-01-01). "What Did They Do? Deriving High-level Edit Histories in Wikis". Proceedings of the 6th International Symposium on Wikis and Open Collaboration. WikiSym '10 (New York, NY, USA: ACM): 2:1–2:10. ISBN 978-1-4503-0056-8. doi:10.1145/1832772.1832775. 
  13. Fong, Peter Kin-Fong; Biuk-Aghai, Robert P. (2011-01-01). "Visualizing Author Contribution Statistics in Wikis Using an Edit Significance Metric". Proceedings of the 7th International Symposium on Wikis and Open Collaboration. WikiSym '11 (New York, NY, USA: ACM): 197–198. ISBN 978-1-4503-0909-7. doi:10.1145/2038558.2038591. 
  14. Liu, Jun; Ram, Sudha (2011-07-01). "Who Does What: Collaboration Patterns in the Wikipedia and Their Impact on Article Quality". ACM Trans. Manage. Inf. Syst. 2 (2): 11:1–11:23. ISSN 2158-656X. doi:10.1145/1985347.1985352.  PDF
  15. McDonald, David W.; Javanmardi, Sara; Zachry, Mark (2011-01-01). "Finding Patterns in Behavioral Observations by Automatically Labeling Forms of Wikiwork in Barnstars". Proceedings of the 7th International Symposium on Wikis and Open Collaboration. WikiSym '11 (New York, NY, USA: ACM): 15–24. ISBN 978-1-4503-0909-7. doi:10.1145/2038558.2038562. 
  16. Antin, Judd; Cheshire, Coye; Nov, Oded (2012-01-01). "Technology-mediated Contributions: Editing Behaviors Among New Wikipedians". Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work. CSCW '12 (New York, NY, USA: ACM): 373–382. ISBN 978-1-4503-1086-4. doi:10.1145/2145204.2145264. 
  17. Bronner, A., & Monz, C. (2012, April). User edits classification using document revision histories. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 356-366). Association for Computational Linguistics. pdf
  18. Ferschke, Oliver; Daxenberger, Johannes; Gurevych, Iryna (2013-01-01). Gurevych, Iryna; Kim, Jungi, eds. A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia (PDF). Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 121–160. ISBN 978-3-642-35084-9. doi:10.1007/978-3-642-35085-6_5. 
  19. Yang, Jie; Hauff, Claudia; Bozzon, Alessandro; Houben, Geert-Jan (2014-01-01). "Asking the Right Question in Collaborative Q&a Systems". Proceedings of the 25th ACM Conference on Hypertext and Social Media. HT '14 (New York, NY, USA: ACM): 179–189. ISBN 978-1-4503-2954-5. doi:10.1145/2631775.2631809. 
  20. Hale, Scott A. (2015-01-01). "Cross-language Wikipedia Editing of Okinawa, Japan". Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. CHI '15 (New York, NY, USA: ACM): 183–192. ISBN 978-1-4503-3145-6. doi:10.1145/2702123.2702346. 
  21. Graham, Mark; Straumann, Ralph K.; Hogan, Bernie (2015-11-02). "Digital Divisions of Labor and Informational Magnetism: Mapping Participation in Wikipedia". Annals of the Association of American Geographers 105 (6): 1158–1178. ISSN 0004-5608. doi:10.1080/00045608.2015.1072791.