Research:Using event stream prediction to predict vandalistic behaviour on Wikipedia

From Meta, a Wikimedia project coordination wiki
Contact
Cristian Consonni

This page documents a planned research project.
Information may be incomplete and change before the project starts.


Key Personnel[edit]

  • Alberto Montresor
  • Cristian Consonni

Project Summary[edit]

The goal of this project is to test wheter event stream prediction and graph mining techniques can be successfully applied in detecting and predicting vandalistic behavior on Wikipedia. Previous research has shown (Laxman et al., 2008; Laxman et al., 2005; Ypma et Heskes, 2003) that mixtures of Hidden Markov Models (HMM) can be successfully applied to build system that can perform event streaming prediction called Episode Generating HMMs (EGHs). We would like to apply this approach to Wikipedia build an off-line system that analyzes Wikipedia dumps to train an EGHs that can be subsequently used for prediction of vandalistic behavior. This system will consitute the baseline for the development of a more complex system that will analyze Wikipedia edits as a Time Evolving Graph (TEG). TEGs are receiving increasing attention (Leskovec et al., 2005; Fard et al., 2012) for the variety of their applications. We would like to explore the possibility to mine data (Sun et al., 2007) on a TEG representation of Wikipedia to discover interesting patterns to be applied to a prediction task.

Methods[edit]

We will develop our method completely offline and without engaging users or editors. Building the system will involve the automatic analysis and categorization of edits, and the construction of an Episode Generating HMM (EGH), and a Time Evolving Graph (TEG). Deletion logs and block logs[1] are instrumental constitute an important source of information about vandalistic behavior so we need to access these data.

Dissemination[edit]

We are planning to produce a paper for conferences and journals in the field of Knowledge Discovery, Data Mining, Data Management and Databases. We will make available our publication with at least as green Open Access.

We will not share the private data we get from the WMF, and the other data will consist of the public dumps.

Wikimedia Policies, Ethics, and Human Subjects Protection[edit]

We do not think think this project poses any relevant ethical concern. This study has not been submitted for evaluation by an ethical committee.

Benefits for the Wikimedia community[edit]

A successful vandalism prediction system, could be implemented as a tool for volunteers either as a new tool or as a new functionality in existing anti-vansalism tools. In the shorter term the goal of this project will consist mainly of scholarly publication about vandalism on Wikipedia and, more generally, about algorithms for event stream prediction.

Timeline[edit]

This is a tentative timeline for this project:

Date Task
June 2015 beginning of software development and data processing
July 2015 obtain access to the data requested
September 2015 software development is finalized
September 2015 - November 2015 data analysis, evaluation of results and writing
December 2015 submission for publication

Funding[edit]

This project is supported by the University of Trento, in the scope of the Doctoral program in Computer Science of the International Doctoral School in Information and Communication Technology (XXX cycle).

References[edit]

  • Laxman, Srivatsan, Vikram Tankasali, and Ryen W. White. "Stream prediction using a generative model based on frequent episodes in event sequences." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.
  • Ypma, Alexander, and Tom Heskes. "Automatic categorization of web pages and user clustering with mixtures of hidden markov models." WEBKDD 2002-Mining Web Data for Discovering Usage Patterns and Profiles. Springer Berlin Heidelberg, 2003. 35-49.
  • Laxman, Srivatsan, P. S. Sastry, and K. P. Unnikrishnan. "Discovering frequent episodes and learning hidden markov models: A formal connection." Knowledge and Data Engineering, IEEE Transactions on 17.11 (2005): 1505-1517.
  • Sun, Jimeng, et al. "Graphscope: parameter-free mining of large time-evolving graphs." Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007.
  • Fard, A., Abdolrashidi, A., Ramaswamy, L., & Miller, J. A. "Towards efficient query processing on massive time-evolving graphs." CollaborateCom. 2012.
  • Leskovec, Jure, Jon Kleinberg, and Christos Faloutsos. "Graphs over time: densification laws, shrinking diameters and possible explanations." Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005.

External links[edit]

This project has no external website.

Contacts[edit]

Cristian Consonni

e-mail address: cristian(dot)consonni(at)unitn.it

postal address:

Cristian Consonni
Università degli studi di Trento
Dipartimento di Ingegneria e Scienza dell'Informazione
Via Sommarive, 9 - Polo Collina (Povo 2)
38123 - Povo (Trento) - TN
Italia

See also[edit]

  1. The latter could be obtained from the publicly available logs