Research:Sockpuppet detection in Wikimedia projects

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search
Created
Contact
Srijan Kumar
Tilen Marc
Jure Leskovec
Duration:  2017-September — ??
GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.


Introduction[edit]

Sockpuppetry is the use of more than one account on a social platform. The reasons for sockpuppetry may be benign, such as using one account for work and one for personal use, or may be malicious, such as pushing one's point of view. On English Wikipedia specifically, the benign and malicious uses are defined here. Sockpuppets are frequently used in a large-scale coordinated way to create undesirable articles or own point of view into the articles. A couple of recent cases show the sophistication involved in their operation, for example in OrangeMoody case and Morning277 case.

In this project, we aim to design strategies to identify potential sockpuppet accounts on English Wikipedia, using machine learning algorithms. The aim is to develop high precision detection models using previously identified sockpuppets, which are publicly listed in this category.

The expected outcome of this project is a set of open algorithmic methods, and a report on their performance and limitations, that could be integrated later on into tools to support community efforts to identify and flag sockpuppet accounts.

Short literature review[edit]

Research on the use of sockpuppets on Wikipedia has been performed using the publicly available data from sockpuppet investigations and blocked accounts. One of the first works on Wikipedia sockpuppetry was performed by Solorio et al.[1], by using the comments made by sockpuppets on talk pages. Extensive stylistic analysis was performed on this data, and it achieved an F-1 score of 0.72. Solorio et al.[2] also released a dataset of sockpuppet comments on talk pages. As editors can make edits on Wikipedia articles and on talk pages, combined analysis needs to be performed. Tsikerdekis et al. [3] perform aggregate analysis of 7,500 sockpuppet accounts from Wikipedia, and contrast this behavior with 7500 legitimate accounts. Non-textual features are used for the analysis. However, aggregate analysis does not reveal insights into account-level behavior of sockpuppets, for example, how sockpuppets are actually used. Yamak et al.[4] studied Wikipedia 5,000 sockpuppet accounts among 120,000 sockpuppet accounts by creating Wikipedia specific features, including number of edits, reversion, time between registration and edits. Linguistic analysis, which has been shown to be very indicative of sockpuppetry[5], was not performed. Zheng et al.[6] performed sockpuppet analysis by splitting activity by the same user into that of two different accounts and trying to tie them back together. This assumes that sockpuppets write similarly using their multiple accounts.

Data[edit]

Data used for the purpose of this project includes, but is not limited to, edit and user data. Edit information includes edits on public as well as deleted pages, and edits by active and deleted (e.g. banned) users. Information such as the revision text, timestamp, user ID, username, IP address, and user agent may be used. Additional data sources and fields may be added along the course of this research and will be reported on this page. All nonpublic data (e.g. IP addresses) will be handled in accordance to the Wikimedia Foundation's privacy policy, data retention guidelines, and formal collaboration policy.

Lists of declared multiple account on Wikipedia include: en:Category:Alternative_Wikipedia_accounts and en:Category:Wikipedians_with_alternative_accounts.

Methodology[edit]

Qualitative data[edit]

We'll collect qualitative data from editors with CheckUser access engaging in sockpuppet detection, to identify strategies and workflows they follow, as well as potential sources of signal and feature candidates for our modeling efforts.

Modeling[edit]

Here we briefly describe the proposed machine learning algorithm. We are using deep learning model that aims to learn a low dimentional representation (called embedding) of each account, based on its edit history, such that sockpuppet accounts of the same person have similar embeddings. The edit history includes information related to the sequence of edits that the account makes, such as the page edited, the exact text added and removed, the time of the edit, the users who edited the page previously, relation between two consecutively edited pages (e.g., the hyperlink distance between the two), and so on. Instead of creating hand-crafted features that will encode these, we will create a recurrent neural network (RNN) model that will generate the embeddings with the optimization function to reduce error during training. The training phase includes positive examples, i.e., pairs of accounts that belong to the same user, and negative example, i.e., pair of random accounts or pair of sockpuppet accounts of different users. After successful training, the positive training examples will be embedded closer in the embedding space compared to the pairs in negative training example.

The above is a brief description of the modeling process. We are currently in the process of iterating to create the model with high accuracy. This page will be updated with details about the model frequently.

Results and evaluation[edit]

Results from this study will be shared here. Feedback on the scope and design of this research project are welcome on its talk page.

References[edit]

  1. Solorio, Thamar; Hasan, Ragib; Mizan, Mainul (2013). A Case Study of Sockpuppet Detection in Wikipedia. Workshop on Language Analysis in Social Media (LASM) at NAACL HLT. pp. 59–68. 
  2. Solorio, Thamar; Hasan, Ragib; Mizan, Mainul (2013-10-24). "Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities". arXiv:1310.6772 [cs]. 
  3. Tsikerdekis, M.; Zeadally, S. (August 2014). "Multiple Account Identity Deception Detection in Social Media Using Nonverbal Behavior". IEEE Transactions on Information Forensics and Security 9 (8): 1311–1321. ISSN 1556-6013. doi:10.1109/tifs.2014.2332820. 
  4. Yamak, Zaher; Saunier, Julien; Vercouter, Laurent (2016). "Detection of Multiple Identity Manipulation in Collaborative Projects". Proceedings of the 25th International Conference Companion on World Wide Web. WWW '16 Companion (Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee): 955–960. ISBN 9781450341448. doi:10.1145/2872518.2890586. 
  5. Kumar, Srijan; Cheng, Justin; Leskovec, Jure; Subrahmanian, V.S. (2017). "An Army of Me: Sockpuppets in Online Discussion Communities". Proceedings of the 26th International Conference on World Wide Web. WWW '17 (Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee): 857–866. ISBN 9781450349130. doi:10.1145/3038912.3052677. 
  6. Zheng, Xueling, Yiu Ming Lai, K. P. Chow, Lucas CK Hui, and S. M. Yiu. Detection of sockpuppets in online discussion forums. PhD disseration, University of Hong Kong, 2011.