Jump to content

Research:Automatic Detection of Online Abuse/notes

From Meta, a Wikimedia project coordination wiki

These are notes for Automatic Detection of Online Abuse, the research project for predicting misconduct on English Wikipedia. This was a 2018-19 research project at the School of Data Science at the University of Virginia.

Key resources[edit]

Weekly meeting

Fridays from 10:30-11am EST in the Data Science Institute conference room one.

  • Lane Rasberry, user:bluerasberry, Data Science Institute Wikimedian
  • Raf Alvarado, user:Rca2t, faculty, Data Science Institute
    • Rca2t@virginia.edu
  • Sameer Singh, User:Sameer9j
    • ss8gc@virginia.edu
  • Charu Rawat, User:Cr29uva
    • cr4zy@virginia.edu
  • Arnab Sarkar, User:Arnab777as3uj
    • as3uj@virginia.edu
  • Patrick Earley, user:PEarley_(WMF), Trust and Safety, Wikimedia Foundation

8project page on Meta-wiki

Wednesday 5 June 2019[edit]

  • Jason Wang
  • Arnav
  • Charu
  • Sameer
  • Lane

Meeting to produce video

Copyright Jason agreed - transfer copyright to DSI, DSI to apply a CC-By license Payment Jason to send an invoice - possibility of payment into student CIS account in attendance Arnav and Charu Sameer flying in from SF

What conversations do you think the Wikipedia community should have about using machine learning to predict misconduct.

We wrote papers to Samuel for this!

Sameer - I broadly talked about bias. Given that the model is opaque they might be at high risk as compared to other communities.

Charu - If your data underrepresents some communities then the algorithm will amplify that bias. For data which is underrepresentation. I talked about user privacy and transparency.

Arnav - I talked about free speech online and the principle of blocking users for what they say.

Charu - one of the ethical issues we discussed early on was whether to do research on the IP addresses. The IP addresses of unregistered users is freely available to the public on the Wikipedia website. Why not use this?

Sometimes it is not obvious what insights data will bring. If we even start to use IP addresses then we might get geographical information which itself carries other kinds of bias and implications which would enter the analysis, perhaps without us understanding it. We could unintentionally uncover information about users which they never imagined that we would be able to report. For this reason, we decided to not analyze the IP addresses.

Arnav We think that admins should have a drop-down menu which they choose as reasons for blocks.

Sameer If there were 10-15 different reasons which were standard then we would be able to do better analysis. As things are, people can put in free-form answers which we had to interpret.

Charu This is a small change which would lead to great insights on what kind of protection and research would benefit Wikipedia.

Charu? Do we have any suggestions about the revision activity data?

Arnav There is no way to separate the namespace from the revision data.


Charu In the revision activity data which we get from ToolForge this gives a lot of metadata, but it does not say what namespace it is in. This is fundamental information. If there was a column which associated a column with a namespace then that would save us the time of determining where someone made a comment.


Arnav Toolforge is very difficult just getting access! It was a lengthy process. Our professor was the first one to get access, then collectively as a group five of us researchers got permission.

Just to access the backend database which Wikipedia has was a multistep process.. In order to get the connection established we used SecureShell. There is a public key and a private key. On the ToolForge side we put our public key. On the AWS side we put our private key. Once we got into ToolForge there were lots of folders depending on what user account a person uses to access. Various people will have different folders. This is personal scratch space for each user. Once we are in ToolForge we are able to access various databases. We were only interested in the English Wikipedia database. We needed to use various shell commands to feed the username and password specific to en-Wiki to give it a SQL query, and we get the output as a CSV or a textfile.

Each ToolForge user has a "cnf" file containing credentials. I can use the CNF to write a ShellCommand to access the en-wiki database.

Once we have the file on the ToolForge site we still have to bring it into AWS. We use the SCP command.

A lot of the data is available on the Wikipedia pages of ToolForge. The main problem is how to generate a file on ToolForge itself. We queried ToolForge and wanted to dump the output into a MySQL database. Normally in a database you query whatever you want and with a graphical interface you, can get the output as a CSV. Since everything happens on the server side, we have to make the query as a single command.

We learned this from the ToolForge wiki documentation pages. We were also aware that we could ask questions to the ToolForge wiki community. To learn how to put this into our database we also did google searches for general documenation.

We also had access to this information through the MySQL interface. We could have used the Quarry web interface. It was useful for finding out what data we wanted because it was quick for small amounts of information. Quarry is not useful for getting large amounts of information, so once we examined the data with Quarry, we got the data through Toolforge. We also could have gotten data through the XML dumps.

Charu XML dumps happen about monthly. These are terabytes of data in each file and they dump about 100 of them. They are in bzip format. All the data in these XML dumps hold information on all namespaces of English Wikipedia. We only wanted conversation from namespaces 1 and 3. We used python code to extract only the text that we wanted.

Another challenge is that user comments are sorted on the basis of the article where they happen, and not by time, and not by the user who made them. Each comment has aggregated comment from the previous data, so each comment continues to aggregate historical text. Every edit made to a page contains the previous comments as well. This means to get a comment we need to filter out the previous text. It was exhaustive to access only the change in a current edit.

Revision edits could happen at any time. If you want to perform any edits pertaining to users, you have to sort through all the revisions.

Again - two problems with XML dumps.

The edits are not uniquely attributable to each user. The data was structured on each page basis and not user basis. Temporal analysis based on each user is very difficult.

Our entire analysis was about a user and their activity, and since it is so challenging to query this, that delayed our analysis.

We built a webscrapper with Python

Arnav We started with the users who were blocked and everyone else were users not blocked. We built a tool where we fed individual usernames and it returned all their recent contributions. For non-blocked users we looked two weeks back from the end of the data. For blocked users, we started at the date of their block, and looked 8 weeks back.

The first step is to get the links for all the diffs, and the second step is extracting all the information from the diff.

80% of the time on this project was data acquisition. In an ideal world this would not happen, but in the real world this is just how things are. It takes time to acquire the data. The first problem we had at the beginning of the project was identifying what datasets where available.

The XML dumps had all the data we needed but it was challenging to navigate so we found it easier to access the information through Python Scrapping.

For anyone who wants to replicate what we have done

The metadata is easily available through ToolForge, but the text of the actual edits came more easily from scraping. There is so much information available on Wikipedia on what data exists and how to access it but it would be easier to talk with a Wikipedia community member about how to access it rather than get documentation.

There are pages where you can ask questions about ToolForge. People always replied within 2 days and usually they replied within a day. If someone is not aware of how to work around a database then it can be challenging to get started. The Wiki community members are very helpful.

We also used web scrapping to get the ORES scores. There was an API to access the data. previously we were feedbing the username to the user contribs API. We queried to get the scores from the users we wanted.

Charu: For every user we had text of all the edits they made. The ORES score is a scoring system for particular edits. What we wanted was to match the ORES scores to the text of the edits they graded. ORES scores are very generalized and often inaccurate. There is a tradeoff for having ORES applied to everything, which gives some information at lower quality.

We used the ORES scores as an additional feature to our model.

We got the Google files from the CSV published with the Ex Machina project.

We used AWS to process all this because it was too much data to compute on our laptops.

We wish that the Wikimedia platform was equipped to support big data technology, like SPARK and Hadoop. When Google did their research they used big data analysis tools which we could not access in this project. It is much better to use the full dataset than to compromise like we did by sampling the data. Users with these skills would be able to do the research more effectively. In our course, we learned these skills in January. Our project was July - May, and if we had these skills a couple of months earlier, then we might have incorporated them.

Charu Regarding why they store the XML dump, it is hard for me to understand why they store the information only on the article side.

Arnav The articles are the most important in Wikipedia! User data is secondary. Even the comment data is in the form of posts to a page.

Charu It is very challenging to perform any user centric analysis in the XML dumps. When data scientist post their research about Wikipedia, they typically talk about the articles because that information is easy to access. Since there is a barrier to accessing

Friday 10 May 2019[edit]

Lane emailed team re: these issues

getting a video recording of the presenting Lane can arrange professional support - need info about team schedule ideally everyone can show up at same time for professional recording alternative is self-recording on own computer professional editing / matching with slides copyright of the paper Lane asks about details what templates did you use from SIEDS? did you take design or writing elements from anywhere? can Lane have image copies of charts, etc? Lane seeks permission for copy to wiki presentation to the Wikimedia Foundation research team virtual - recorded we can do this live or pre-recorded ideally at least one person there to answer questions - everyone invited will be scheduled either 12 or 26 June Lane will be there - can one of you join? ASONAM 2019 paper Can you deposit a "preprint" before submission? quickest - easiest - https://zenodo.org/ project of CERN, respected and stable and indexed exit interview Pete Alonzi, DSI staff, volunteers to do an exit interview the advantage of this to you is that Pete + I create some documentation with your name on it so that future researchers can more easily build on your work what we want is to sit with any of you, Pete asks questions, I take notes whenever schedules align could work for this

Friday 26 April 2019[edit]

This was the IEEE SIEDS 2019 conference. The team presented! http://bart.sys.virginia.edu/sieds19/

result: This project won "Best Paper, IEEE SIEDS 2019 Conference"

Wednesday 27 March 2019[edit]


  • Raf
  • Charu
  • Arnab
  • Lane
  • Patrick
  • Claudia
  • Sameer (absent)


The team presented their slides. Charu began with a recap of the project goals and their concept for addressing the challenge. She described the data pipeline: using SQL to get the set of Wikimedia users already blocked, use SQL query to get a set of user activity data, use python web scraping to get user comments, use python to collect ORES edit quality scores, and download the published CSV files from the Google Ex:Machina project. Aggregating all this data gives the final study corpus.

To establish a machine learning process, the team collected the last five comments and applied various machine learning techniques to generate a toxicity score for each comment. After trying various algorithms, the XGBoost.

Arnab presented the workflow. For actual evaluations, the system takes in more data per user than the learning model.

The analysis applies the learning model to try to predict when someone will get blocked before they are blocked. The team did some live testing by detecting probable blocks with incomplete data, then looking further ahead to see if the block actually happened.

Patrick commented that if this tool were developed further then there would be human review to institute a block.

Takeaways - The project's original "toxicity score" was effective as an evaluation tool for predicting the likelihood of a block.

Considering behavior aside from the text of editing is useful. Machine learning can detect characteristics of "sleepers" in ways that humans might not see. For example, there is a trend of some accounts being created with no edits, then remaining dormant, then being disruptive suddenly for a day, then going dormant again. There are some identifiable patterns to how this works.

The team shared the challenges of getting Wikipedia data. Because of the architecture, sometimes it is easier to get data from the database. Sometimes it is easier to get data through webscraping because of odd organization in the Wikipedia database. To go through the XML dump to get userdata, one has to go through the entire dump, which is very time intensive and unexpected.

The team shared that based on this data, there could be a classification overlay to predict why a block should be issued. The team recommended that if there could be more standardization in how admins apply rationales to blocks, then it would be easier to predict with automation why to issue blocks. Patrick commented on social issues in this space, such as the tendency for admins to list one reason when there are multiple offenses, the resistance of the wiki community to standardization, and the wiki community's resistance to change.

Patrick says - really cool that this does not rely on one data source. The route to go to get a more accurate tool is in combining those 5 datasets. Patrick commented that the Wikipedia community has responded favorably to machine learning applications, but this project sets a precedent to applying machine learning in a very personal way. Patrick brought up the example of The Minority Report in which technology predicts "pre-crime" before it happens, which is an idea in science fiction that many find unsettling. How will the wiki community respond?

Patrick requested that we get the research into The Signpost.

Wednesday 27 March 2019[edit]


  • Daniel Mietchen
  • Raf
  • Lane
  • Charu
  • Arnab
  • Sameer


Raf mentioned that the board of trustees of the Data Science Institute are having a meeting here next week. This group is interested in meeting some students and hearing about their work. Would this group be interested in presenting to them? Yes! The team shared that they also are presenting at the Tom Tom Machine Learning Conference. At that conference. They have a 6-minute time slot. Similarly, they can present to the board.

The team shared that they expect to have a draft of their paper to share Friday. They wanted to share their work on the paper to this point.

Outline Data collection Initial analysis framing of the problem Development of machine learning models Proposal of a toxicity score

The team described that they trained one of their evaluation tools on the Google Corpus. Google shares this collection in association with its Jigaw project. To create this, Google had humans evaluate the toxicity of comments.

In the classification scheme for this project, they had threshold expectations for precision and recall. For this project false positives are more tolerable than false negatives, so these expectations consider that.

Raf asked for the team's opinion on the extent to which they think the paper being published will be relevant to the Wikimedia Foundation as the client. The "toxicity index" is a research outcome.

The students asked if the client should be listed as co-author. Raf says that this is not the norm unless they contribute substantially to the research themselves. In this case, Raf said that the client did not scope the problem or directly provide the data. He advised asking the clients if they want an acknowledgement.

The team expressed that the journal's format will not permit publishing all the content which are interesting outcomes, and that they want to share more outcomes in other publishing channels. Lane suggested the main Wikimedia Foundation blog, blog.wikimedia.org.

Daniel asked what scores from ORES the team is using. They said that they are using the "damage" and "???" fields. They are scraping that from the API. The team considered whether ORES could be used for overall evaluation, and in their analysis, they could not find a way to get satisfactory results with this alone. Instead, it is useful as a complement with other data sets. In ORES the dataset is Jade, "judgement and dialog engine". https://www.mediawiki.org/wiki/Jade

Sample edit that ORES scored as “97% damaging”: https://www.wikidata.org/w/index.php?title=Q4961588&curid=4742264&diff=895177544&oldid=867197043&diffmode=source

Daniel asked if they considered activity across wikis. The team said that this research is not dependent on many dictionaries. There is a sentiment analysis component.

Lane asked how the team felt about the amount of conversation they were able to have with Patrick and the WMF. The team said that it was enough. They would not have wanted less and perhaps they did not need more. They said that the parts that they liked the best were the confirmation that their research direction was useful, the comments about which part of the data existed when they had doubts of what was there and what was not, and also the feedback when they gave presentations.

The team said that the data exploration phase was the most difficult part as they learned to access certain parts of the data. They said that they would have liked access to the MySQL database but they did not have access. If there was a royal road to the data then they would have liked it. As things happened, the team did not finish acquiring all the data they wanted until January. Ideally they would have had all this data in September at the beginning of the project. Raf commented that ideally the Wikimedia Foundation would provide a workspace where anyone could do this research.

Monday 25 February 2019[edit]


  • Claudia Lo, User:CLo_(WMF)
  • Lane Rasberry
  • Arnab
  • Sameer
  • Charu

notes meeting in

  • (not shared)

Start with introductions! Charu is coming from the financial analytics industry. Sameer said that he was very happy to be working with the large Wikimedia dataset. Previously he worked with marketing analytics. Arnab said that they had the most interesting dataset and liked seeing the community interactions. Arnab said that previously he worked on engineering and data pipelines.

Claudia said that she worked on qualitative research. She is going to attend editathons to get feedback on harassment on Wikipedia. Previously she did academic research on online moderation.

Charu started by saying what they would discuss. First they would show their datasets, then they would talk about their analysis, then they would say what their outcomes are. The team said that they had questions.

Arnab said that the 3 datasets they have are the IP blocks, which is the list of blocked users and any posted reason for why they are blocked. This includes everyone who has been blocked. Previously they analyzed this data by itself, clustering various reasons for blocks over time. All of this information came from Toolforge. They move it from there into AWS. This is backend data which is already structured. Similarly they have the second set, the "revisions table", which is temporal user activity. This shows what a user edited and when, so they can match those activities with the block. All of this is structured and readily available, but it is a challenge to manage the large size of the data. While they could example trends in the list of all blocked users, to manage the data on activities, they looked at only the past two years. The last dataset is communications, which are the messages on user talk pages and article talk pages. For those two namespaces they considered examining XML datasets. Because that dataset is so large they instead checked the user contribution page (history for a user) and checked their user contributions for the diff at the relevant point in time there. To give some stats, for 2017-18 there was data for 600,000 nonblocked users and about 98,000 blocked users. This is a huge dataset. They needed to make this data smaller, so they considered especially the top 20,000 most active users in both the blocked and nonblocked user categories.

Charu continued - now at this point they have a set of non-blocked users and a set of blocked users, and some activities of the 20,000 most active of each. As the team previously reported to Patrick, there was a massive spike in blocks in 2017-18 due to user Procseebot (a bot account for a human editor) which is blocking a lot of proxy users and only proxy users. Procseebot is a large percentage of all English Wikipedia blocks - perhaps 30-40%.

Charu reported trends - 92% of all blocked users are active in the week prior to their block. 10% are active 2 weeks prior to their block, and the percentage goes down to ~4% 8 weeks before their block. This is significant because by matching this temporal data to the activity data, that greatly reduces the amount of data which the analysis might have to consider. Other models seek to check all behavior, but if we can instead have evidence that checking only the past 1 week is useful, then checks become easier.

Sameer described the data reduction process. They look at the last week of comment history and the past 5 comments. This is the text corpus, with the assumption that something happening here might indicate cause for a block.

Charu described the analysis model. One is text based, using natural language processing (NLP), and the other is considering activity patterns, like edit count in a period of time. For example, if a user is mostly dormant, but then has an activity spike, then they that information paired with other information could correlate with a block.

Sameer showed a table of their analysis model attempts. They started with models which had about 60% accuracy to predict blocks on the historical data. Over time and with a total of 15 attempts with different models, they found a model with 95% success in predicting which user accounts should have a block. These later models incorporated more data and did more tests. The final model performed word n-gram, character n-gram, sentiment, username char n-gram, LDA, and temporal user activity. Note that this is analysis of those 20k blocked and 20k non-blocked. With more data and more analysis the analysis could be refined.

Charu said that with more time and consideration they could determine which parts of the algorithm are most useful. Their overall goal would be to develop a model to apply to new users. Charu said that one of their major challenges is determining the inflection point of activity which is the cause of a block. Their wish would be that there is a way to find exactly what is the cause of the block.

Claudia spoke up about a curve ball - there is a new partial block feature which allows a person to get a block on a certain page or a set of pages. Currently this is on Italian Wikipedia and is being rolled out to Arabic Wikipedia. Another feature is a "namespace" block to restrict from either the encyclopedia or talkspace.

Patrick said that he does not have good suggestions for limiting the noise in the way they require. Patrick suggested very small changes to one's own post could be filtered out. Sameer said that if they stripped out small edits they that could work. Patrick suggested removing 5-character or below.

Patrick suggested that using the existing ORES evaluations on particular edits could help identify which edits are tagged as problematic.

Claudia suggested that some problems go to noticeboards. The 3-revert rule noticeboard is only for violations of that rule, and is a good place to get information on that. The AN/I board is messy and a catch-all for all sorts of problems. Sockpuppet investigations is also well structured. The biography of living persons board is very active and less structured. Perhaps matching people and activities to these boards could identify the cause of a problem.

Patrick said that the database shows how many people are watching a talk page. Data only is public when there is 30 or more. If you looked at edits reverted by experienced users then that is perhaps a 95% positive that the comment is either misconduct or not policy compliant. Patrick said that if few people are watching a talk page then bad edits could persist for a long time. If the analysis did not take that into account then that could mislead the analysis.

Sameer suggested that they could scrape some of this data for the final model to see if how often what they predict actually comes to pass.

Patrick suggested that someone from the team apply to Wikimania to present the results.

Claudia followed up with this message -

As promised, I talked to one of the engineers, Dayllan Maza, on the AHT team with regards to getting data on partial blocks, and he's given me permission to share his script with you and your students: https://github.com/dayllanmaza/wikireplicas-reports/commit/91a2cae4a6f1d36c5ac87ece44a569863451f832

This should get all partial blocks in the last 30 days for the specified database, in this case Italian Wikipedia.

I've cc'ed Dayllan on this email if you need any further help. Dayllan Maza <dmaza@wikimedia.org>

Friday 22 February 2019[edit]


  • Lane Rasberry
  • Chris Shilling, Wikimedia Foundation grants


This meeting is to discuss WMF funding of this project. Grants:Project/University_of_Virginia/Machine_learning_to_predict_Wikimedia_user_blocks

The context is that last November Lane requested a WMF grant of US$5000 to fund misconduct research at UVA. Obviously research is ongoing in any case. If money is available then that will guide future research choices here.

Chris said that he has been doing due diligence to form a Wikimedia Foundation staff recommendation about the proposal.

Chris started by describing support: there is a lot of interest in this proposal in the exploratory phase. Chris said that it was especially interesting to look at Wikimedia accounts instead of particular edits.

A concern: machine learning is going to be able to look at some cases and some blocks, but will not be a good fit for other cases. For example, machine learning might detect vandalism, but might not be able to detect hounding or harassment.

Chris said that making practice recommendations as an outcome of the project would be a highly desirable outcome. Lane described some surprising outcomes, like small innocuous edits over time seemingly in readiness for spam farming, or tweaking music genre tagging, have a correlation to getting a block. Chris recommended avoiding an "open conversation", which is a conversation without an agenda to make some change. The WMF really wants research to come with a recommendation for a practical change.

Chris recommended that if possible, craft an RfC in a policy change policy. The WMF favors research which leads to community conversation about improving Wikimedia projects. A particular outcome is less important than hosting a conversation and getting community feedback.

Lane asked about budgeting. Chris recommended breaking out funding requests into staffing allocations. He said that this grant is the smallest of any requested in this round and so breaking this one into smaller pieces might not be necessary.

Monday 18 February 2019[edit]


  • Arnab
  • Sameer
  • Charu
  • Lane
  • Raf


The team added persondata to the collection of data which they are considering. Previously they had been examining the more easy-to-extract activity. Now they want to consider the consider to which a user is active and what the relationship is between recent activity and a block. This examination would check 8 weeks before the block and consider generally whether a user was active before then, but look more closely at what happened the 8 weeks before the block. To increase the prediction power of this there needs to be some normalization to understand the typical activity of nonblocked users and consider how blocked users are different. The point of this is to see whether blocked users are prominent immediately before their block or if they have a history of engagement which ends in a block.

The team has been modeling patterns with random forest and logistic regression. Since the meeting on 4 February the team has scrapped more activity data of individual blocked users.

Charu was considering getting a measurement of variance over a 24-week period, then look at the 8 week period.

With respect to the textual data Sameer reported that they tried to collect 40 comments from user spaces, including their user pages or their own posts in a talk page. Blocked users have an average of 2.5 talk page posts whereas unblocked users with similar activity have 4 talk page posts.

The initial result is that if there is blocked users often have an account for a long time, then they have a spike in activity, and within a week of that spike they get blocked.

features, Models which they tried and the accuracy are

  • word n-gram - random forest, 62.19%
  • character n-gram - random forest, 60.97%
  • word + character n-gram - random forest, 60.99%
  • 1+2+4- ensemble prediction, 61.57%
  • word n-gram, character n-gram, sentiment - SVM Linear - 62.44%
  • word n-gram, character n-gram, sentiment - gradient boosting , 65.23% (best)

and 10 others. All the best models had fit of about 60%. Note that all these various models are reporting different results. It is just a coincidence that each one has a 60% fit, because the 60% of user accounts in each model is different from other models.

Team proposed developing a TXI - the Toxicity Index! This is a vocabulary which considers the terms which correlate with a user getting a block. The team discussed some non-dictionary terms which might correlate with a block. For example, "!!" or "!!!". Sometimes lots of exclamation points can correlate with a likely block.

Raf noted that the team is playing with various pipeline of content into modeling systems and not getting such large differences. For that reason, he recommended trying big changes to the features and models rather than seeking to tweak any particular pairing of features and models. It does not make much sense to commit to a particular pairing and tweak that when it is not yet certain which modeling system is best to apply.

Monday 4 February 2018[edit]


  • Arnab
  • Sameer
  • Charu
  • Lane
  • Raf


Team is considering whether they wish to meet with Patrick or accept the meeting offer from Claudia Lo, a researcher addressing harassment and on Patrick's team. Now that the team is doing data science modeling, they are wondering for insights on the efficiency of their plan for doing this.

Team asked Lane for help interpreting Wikipedia article history logs, Wikipedia user history logs, and edit ID numbers. Lane had no advice on the process of querying individual user accounts for the text of all diffs in edits that accounts had made. The team described that they found a way to do this for Wikipedia article but are having trouble doing this for user accounts.

Monday 28 January 2018[edit]


  • Arnab
  • Sameer
  • Charu
  • Lane
  • Raf


Team reported a wish to meet Patrick at the WMF the week of 11 February but wanted to pause asking for the meeting.

Lane reported a response from user:Slackr, who runs user:ProcseeBot, which has been responsible for many recent Wikipedia user account blocks.

The team reported that they increased their corpus, the xml dump, by 3gb in Sagemaker. This is about 290,000 comments including 30,000 from users who are blocked. The team is considering whether they can reduce the number of comments examined by checking only some of the comments, like perhaps comments around the time of the block. Whatever the case, it is challenging to examine all the comments and it would be helpful to examine some.

Another team strategy is to examine non-textual features and metadata. For example, in considering "blocked users active in week prior to block", almost all users who get a block are active in the week prior to the block. This means that rarely do inactive users get a block, or if they do, it tends to be soon after the cessation of their activity. The point of this is to distinguish users who are consistently active versus users who are active sporadically.

Raf asked about tf-idf, "term frequency - inverse document frequency", to identify the important words used associated with a block. https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Sameer said that they were thinking of two directions for text analysis. One way is to set up a bag of words and look for key terms. Another way is to consider longer strands of text to do sentiment analysis. Charu said that they would experiment in each direction before choosing one to test.

The team said that they had a roadblock in mounting their S3 bucket. The team examined the commands they would use to mount the bucket into their environment. Raf and team talked it through, and the team found the solution to their problem.

Raf asked what the team ran. Arnab said that he already had 1.4GB data not in the bucket and ran random forest on it. Sameer asked how they should prepare the sample as they build their model. They were planning two different ways to analyze text and would be building the models independently. In the Google study they got good results from logistic and NLP, so they wanted to do that. Raf said that it was good that they have classifiers that they want to train, but would they have enough time? Charu said that they would be able to decide the model within the next two weeks. Raf said that this project has about 2 months left of work.

Raf asked if there were more blocks. Charu described a desire for more AWS computing resources. Raf said that if they made a request, then we can get a larger allocation. The team said that they already had access to 61GB of RAM, which is already quite big. The team talked about how much available data there could be in examining the Wikimedia corpus. There are terabytes of data available, which is beyond the team's capacity to examine. Friday 14 December 2018 attendees Lane Rasberry, Data Science Institute Wikimedian Raf Alvarado, Data Science Institute Faculty Nico Yang, MSDS researcher Justin Ward, MSDS researcher Erfaneh Gharavi, researcher

notes The researchers made a public presentation of their datasets, problem, results to date, and plans for addressing their research challenges.

The audience was their fellow student researchers, faculty, any of the client sponsors of the research, and anyone from the university who wanted to attend.

Tuesday 20 November 2018[edit]


  • Arnab
  • Sameer
  • Charu
  • Lane
  • Patrick


Sameer said that they have been examining IP blocks and revisions over a one-year time period. The next step will be to incorporate page meta history information.

Charu asked if Patrick could see the screen. Patric confirmed that he could. Charu started with analysis on the IP-blocks data. This included the list of all users who have ever been blocked in the English Wikipedia community from 2004 until October 2018. She first pointed to the orange line on the "Account blocks over the years". There have been 1.2 million blocks. In 2004-5 there were not many blocks. In 2017-18 the number of blocks greatly increased. Patrick commented that already they uncovered something useful, as the traditional wisdom is based on old data and this new trend is not well known.

In the "User share of account blocks" graph it showed that the majority of blocks are being imposed on users with registered accounts versus IP accounts. Charu showed that in

In the bar graph Charu demonstrated a large uptick in blocks of IP accounts in September and October 2018. Patrick asked if they checked for Eternal September by checking the same time of year in previous years. Charu said yes, and this new trend is unusual.

Charu said that they mined the text to determine what reasons human users give for blocks. For example, they found the word "resurrection". They graphed the "reasons most frequent for blocking" to categorize different sorts of blocks for various reasons. This current graph is a superficial level of what they were able to identify. Looking over this, Patrick commented that "sock puppet" can mean misconduct or someone who is trying to make edits which are not misconduct. In "user name" some accounts could be harassment but others could be some other problem. Sameer said that they had to make some decisions to aggregate some of the findings into a category.

Charu talked about how the "share of reasons change over time". The large share in this graph was for spam, showing that from 2006-2016 spam was the problem which got the largest percentage of blocks. In the past year, from 2017, there is a massive increase in users getting blocked for "proxy", which was not a reason stated often in the past. Furthermore, these blocks began in September 2018, so something major must have changed in the past two months. Patrick said that this must be coming from user:ProcseeBot, which patrols for these kinds of users.

Charu presented "share of reasons for account blocks 2018" to show just that one year and the difference in proxy blocks.

Top reasons for registered users to be blocked are spam and vandalism. For IP editors proxy is the biggest with school block being second biggest. Patrick said that school blocks are usually vandalism. Sameer said that probably these blocks happen over an IP range.

Sameer began to examine the key users performing the most blocks. Materialscientist is the top blocker at 1730 and 5% of the total blocks.

Sameer made comparisons of who were the top blockers in 2018 versus 2017. In 2017 there were a few admins doing lots of blocks, but then in 2018 ProcseeBot appeared and started doing lots of blocks. In 2016 this user did 6 blocks, in 2017 771, and in 2018 41,000. Most of ProcseeBot's edits are related to proxy editing.

Patrick said that all this was interesting. He noted that Russian Wikipedia is also getting lots of proxy blocks. Patrick said that his suspicion is that somehow ProcseeBot suddenly got better at blocking, and that before September it was blocking at the behest of a human and now it is working on its own.

Patrick said that while there might be insight to gain from ProcseeBot, but it might be useful to disregard this information as we know why ProcseeBot blocks - it blocks proxies.

The time was 25 minutes passed and at the end of scheduled time. Patrick agreed to stay 5 minutes over.

Arnab began to present the "spread of revision count per user in August 2017-October 2018. The finding was that in comparing IP users who are not blocked to those who are, accounts which eventually get a block are likely to make more edits than accounts without a block. This means that accounts which are doing misconduct tend to do more edits than accounts which do not. Patrick wondered if accounts which do spam are the major source of this misconduct, such as from posting lots of links.

Arnab reported also that "spread of average revision length per user in August 2017-October 2018" similarly found that blocked users tend to post more bytes/characters in their edits than users who do not get blocked.

Patrick said that he appreciated the update and preliminary

Friday 16 November 2018[edit]


  • Sameer
  • Charu
  • Raf
  • Lane


Sameer talked about the data pipeline. He said that they examined PySpark, which seemed to have a steep learning curve to start using. Also it would require migrating the data to an HTSF format, and there is a learning curve for that. Apache Spark is a challenge. Hadoop is difficult.

Charu described some challenges with accessing the main database. They can pull information from ToolForge but now need to put the information in a target database, perhaps DevDB. Charu said that they posted questions to Phabricator but the people who replied referred them to someone else.

Sameer said that for their machine learning they want 100,000 comments as an arbitrary number, and not all comments, and they will do manipulation on that dataset. To get this they need about 30 GB of data and can take what they need for that. The plan is for the team to have the compressed file in their own system. They wanted to manipulate it in AWS. Sameer said this will not work because the subscription has a 4GB size limit. Raf said that the department subscribed to a bigger account that ought to be sufficiently large, an ml.p2.xlarge, which is ready to run Apache Spark with 61 GB of RAM. The team said that they share a Wikipedia database with the Cochrane team researching Wikipedia.

Charu was saying that they have a basic processing task to do to clean up information about diffs, because they have to make sure they know the difference between the user who posted the comment and anyone else mentioned in the project.

Raf asked if the team was planning to use any NLP to analyze the data. Team said yes. Raf said that NLTK is great but it is slow and it offers a range of options, and it can be confusing to choose an analysis methodology. Raf recommended SpaCy as software which accepts a list of strings as input and outputs analysis of sentiment, particle dependency, and others. Another option for NLP is using IBM's Bluemix. Next Tuesday at 3pm there will be a presentation of this tool. The team cannot go because that is when the client meeting is.

Raf commented that the project seems to be at the level of making sure that the data pipeline works. Raf said that for the upcoming academic paper that they will publish, they should publish a diagram which is a high-level schematic to show where things are in the pipeline and why they go there. In meetings going forward to explain where to go then a block diagram would help explain things.

Raf asked about the team's coursework. Raf asked if they would like to have regression analysis before the text mining class. Charu said that other classes get tied to statistics, rather than statistics being tied to anything else. She said that she was a data analyst for 3 years and the statistics course covers what she thought a professional should know. She said that the class also went into deep detail. Charu said that she has a degree in math, but never studied statistics. Sameer said that he studied engineering and had two semesters of statistics but did not study it deeply.

Sameer and Charu said that because statistics is so fundamental then that should be part of the summer course. Charu said that the first part of statistics was linear regression then the second part was data modeling.

Charu asked about the curricula for the online data science program. Raf said that it will go at a slower pace for business professionals.

Monday 12 November 2018[edit]


  • Lane
  • Sameer
  • Arnab


Agenda MOU discussion on-wiki profiles team presents to Lane New Trust and Safety staffer Claudia Lo https://twitter.com/claudiawylo https://www.linkedin.com/in/claudia-lo-2660a448/ https://meta.wikimedia.org/wiki/User:Cwylo

Team has examined two data sets - the block and the behavior around the blocks. The team examined the IP blocks and compared them to registered user blocks. Given that a person is getting blocked or not, how is their behavior different?

(showed chart of blocks from 2004-2018)

block data began in 2005. in 2006 there were 80,000

Charu examined the differences between blocks for IP users versus registered users. They showed a chart of account blocks for anonymous users versus registered users.

In September and October the number of blocks for anonymous users have increased massively. Why is this? Was there a policy change?

In the IP block stable there is a list of all the account blocks in English Wikipedia along with the metadata. Despite this table having labels which say "IPB" for "IP user block", there are logs for registered users here also.

The team examined ipb_reason, the human written edit summary, and examined the reasons. The reasons were vandalism, spam, account block, (not available), username block, etc.

In August there were many blocks of proxy users, a massive increase, which must have an unusual cause. There could be a new blocking policy, or a new tool for blocking, or a massive proxy attack on Wikipedia. Arnab said that he thinks that the increase in proxy blocks is because of the use of a police bot to identify and block these.

The table lacks the information of who has executed the block. The "by" column should indicate who did the block.

The team had various categorizations and metrics about what reasons there were for blocks, how often they happened, what was the change over time, how many edits accounts made before betting blocked, and the amount of content per edit for these accounts. Blocked users tend to make slightly more lengthy edits than users who are not blocked.

The team examined various factors - age of account, edits by age, amount of content, whether they designated it as a minor edit.

Idea to expand the project is to attempt to detect botlike behavior.

Questions from team: Whenever a request is made to block a particular user, how is it requested? The team wants to know when the request was raised and to see it as an inflection point. Lane showed the ANI noticeboard.

When someone is posted for misconduct review on the ANI noticeboard, can the person edit while they are under investigation? Lane said yes.

The team asked if there is cross-referencing of vandalism across Wikimedia projects. If an IP address is blocked in English Wikipedia, are they blocked across Wikimedia projects? Lane said that he thought that this would not happen and is a problems.

Team had a hypothesis that either the IP blocks table or the revision table should have some connection across Wikimedia platforms.

The team is having difficulty matching blocks that they see logged on English Wikipedia to any revisions on English Wikipedia. This is making them think that there is a superset of blocks across Wikimedia projects that show up in English Wikipedia despite those accounts never having edited there.

Saturday 20 October 2018[edit]

There was a meeting at WikiConference North America.

https://wikiconference.org/wiki/Submissions:2018/Wikimedia_Infrastructure_for_being_nice https://wikiconference.org/wiki/Submissions:2018/Event_safety_workshop

Claudia, Trevor, Patrick, and Sydney each met Lane for casual chat during the conference. Claudia and Trevor came to Lane's talk and Lane went to the Trust and Safety Team talk.

Tuesday 16 October 2018[edit]


  • Patrick Earley, user:PEarley_(WMF), Trust and Safety, Wikimedia Foundation
  • Lane Rasberry, user:bluerasberry, Data Science Institute Wikimedian
  • Raf Alvarado, userRca2t, faculty, Data Science Institute
  • Sameer Singh, User:Sameer9j
  • Charu Rawat, User:Cr29uva
  • Arnab Sarkar, User:Arnab777as3uj


First meeting with client!

Charu, Arnab, and Sameer all introduced themselves, sharing their educational and professional backgrounds.

Patrick asked why everyone chose this project among the others. Charu said that she has had an ongoing interest in examining consumer behavior. Patrick said that the work on Wikipedia is still done by people.

Patrick said that he has been a Wikipedia editor since 2008. On English Wikipedia he is an administrator. He joined the Wikimedia Foundation in 2013, when there were about 140 people. Nowadays there are almost 300 people. Compared to platforms with a similar userbase this is much fewer people. The Trust and Safety team does do reactive responses to problems, but prefers to seek proactive solutions.

Charu asked Patrick to explain the challenges which the Trust and Safety team is seeking to address. Patrick said that from a platform perspective, something unusual about this situation is that the Wikimedia community sets the norms for what should and should not happen. The Wikimedia Foundation follows what the community decides rather than dictating what the Wikimedia users should do. This means that there are some special challenges which other platforms do not experience.

Blocks in Wikimedia projects are IP address based. For anyone motivated to do so, changing the IP address is fairly easy to do, so the Wikimedia user base has to address that.

Another challenge is trying to encourage diversity. The userbase is mostly male and mostly Western, and the Wikimedia Foundation would like to make its editing community more representative of readers.

Currently the Trust and Safety team is seeking to develop new ways to report harassment.

Patrick suggested talking about the students' proposal. Sameer explained a two-dimensional proposal: examining the characteristics of users who have been blocked in the past

Patrick said that this approach seemed new to him. He asked about their familiarity with Google's Jigsaw project in Wikipedia. Everyone said that they had read this. Everyone discussed that Jigsaw incorporated human ratings of edits, but that this current data science project is entirely data based.

Patrick described how there are different sorts of blocks - blocks for spam, blocks for abusing account creation. Patrick began to describe some problems with the block log that the Wikimedia Foundation has recognized: One problem is that it is not annotated in a consistent way. The human admins write whatever notes they like. Some admins link to a policy, and some write their own idiosyncratic notes.

Arnab asked what kind of user would be better to examine for blocks. Is long term abuse the most important kind to examine? Patrick said that it would be his bias to direct them to this kind of abuse, but that addressing any kind of problem is useful.

Sameer said that Jigsaw focused on the user pages and talk pages and they were thinking of following that lead. Patrick said that when people post insults into articles, like "(this person) is a loser", but also this also easily detected. Since it is easily detected, it is not the most urgent problem. More subtle problems on talk pages are the bigger issue.

Sameer asked Patrick for his opinion about trying to incorporate the ORES scores into evaluation of blocks. Sameer asked if anyone has done this? Patrick said that he is not aware of anyone reusing this data in that way but that it does seem helpful to explore the possibility of doing this.

Charu raised the issue again that certain people seem more susceptible to harassment. To what extent would it be okay to tag user accounts by their gender, location, religion? Patrick said that the available data is what people have self-disclosed about themselves. Unlike Facebook and Twitter, Wikipedia does not have hard data about demographics. Wikipedia does not reveal the IP addresses as a matter of the privacy policy. The data is not deep.

Charu asked whether Wikimedia users would be disturbed to know that this or any research team was examining their edits, even though the data is open and available to anyone. Patrick said that Wikipedia contributors typically feel good about the transparency and openness of Wikimedia projects and take it as a point of pride to adopt this philosophy. In the Detox project (Jigsaw) it came up that some users could be identified as causing problems.

Arnab asked why Wikipedia does not require people to identify themselves. Would that not improve the community mood? Patrick said that the Wikimedia community values the offer that anyone can edit Wikipedia. It is a trade off but the Wikimedia community decides these things.

Lane raised two items on Patrick's agenda - the issue of claiming IP, and the issue of the Wikimedia Foundation Open Access policy. Patrick said that the Wikimedia Foundation can only contribute resources, like his own time, to projects which are open and for everyone. His expectation to be able to participate is sharing all this information with the Wikimedia community.

After the call Raf led a discussion about patent law, the meaning of intellectual property, and what it means to share information. The question was raised - why would anyone publish open source software? Raf shared a few reasons. One is that people who publish open source software are advertising their skills. Another is that it enables collaboration. Another is the philosophy of the open community.

The team asked how ORES is publishing its code in data, because perhaps they can follow that model. ORES generates a list of users and comments which are bad, then a human follows to further classify. ORES uses labels like "good faith" and "damaging".

Raf asked if everyone has Tool Forge accounts. Everyone replied yes.

Friday 12 October 2018[edit]

Charu said that queries in Quarry timeout at 30 minutes so many of the queries they need have to be done in another way.

Raf said that Quarry will not work. The team requested ToolForge accounts so that they could do queries through MySQL. Sameer said that they requested ToolForge accounts through the Wikimedia request process and during the request they linked to the research page on meta. As of yet they have not gotten approval for their accounts.

When Raf made his request Andrew Bogott of the Wikimedia Foundation began a conversation with him without granting the access. https://en.wikipedia.org/wiki/User:AndrewBogott_(WMF)

Charu said that she had some conversations with Wiki community members who advised her to download XML files. Raf said that this did not seem like correct advice for the best way, but it might be.

Raf showed off https://wikitech.wikimedia.org/wiki/Main_Page which seems introductory.

Charu said that she expected that User Contribs had what they wanted. https://www.mediawiki.org/wiki/API:Usercontribs

Friday 5 October 2018[edit]



(informal meetings not noted)[edit]

Friday 10 September 2018[edit]


  • Raf
  • Charu
  • Sameer
  • Arnab

notes team has access to SQL database has lists of users who are blocked has some rationale for why users are blocked would like the data of what users were doing prior to blocking

Friday 7 September 2018[edit]


  • Raf Alvarado
  • Lane Rasberry
  • Sameer Singh - Trust and Safety
  • Charu Rawat - Trust and Safety
  • Arnab Sarkar - Trust and Safety



Justin graduated in 2018 with an undergraduate degree in mathematics at SUNY Albany. He said that he is good at the quantitative side but wants to learn more statistical methods. His minor is in informatics and wants to get more programming experience. He has no preference of language.

Charu graduated in 2015 with a degree in mathematics. She worked in 3 years at a hedge fund based in New York but working in India. She forecasted KPIs for companies so that portfolio managers could build up investment pieces. A lot of this included analyzing user behavior, especially purchasing dynamics. Consumer behavior analysis included examining how food purchases and luxury product purchases vary by season. Her favorite language is Python even though she has only been working in R since arriving.

Samir has an undergraduate degree in engineering. He worked in health care for 4 years. He also worked for 6 months at Uber for India as they were launching new products there. He has been working for years with SAS which is no one's favorite but he likes R.

Arnav is coming from an engineering background where he graduated in 2014. He worked for Tata Consultancy Services with General Electric as a client for their power and water division. He coordinated the business logic doing PL SQL and Oracle products. He does not have a favorite language.

Lane introduced himself!

Raf began to discuss the schedule for the project.

Raf said that he had a loose approach to planning for the project. He said there are 4 milestones he calls QDAP - asking a question, considering the data, doing analysis, and presenting a product. This project is unusual as compared to the other projects in this year's Data Science Institute cohort as it probably has too much data to consider and part of the challenge will be curating a select dataset instead of using all available data.

This class will be using GitHub to manage issues. In GitHub everyone will make goals in the project then tag them with issues when they arise. Tagging issues will communicate to Raf the progress of the project and also help to predict when certain parts of the project will be complete.

Raf mentioned that students often want to move out of the phase of thinking about the question and moving into the phase of playing with the data. Raf cautioned that it helps to review the published literature to see what published methods and algorithms already exist. The field of information science could be a foundation for designing a question and getting a better start into the research. Students at the university have a Microsoft Box account which has space to store the datasets. Someone asked if they could use Drop Box or Google Drive. Raf responded that the drawback with these is that

As a practical matter one of the challenges in this project will be moving datasets around. Pete Alonzi here in the department can assist anyone in posting datasets to the department instance of AWS. Raf recommended that if students need a bibliographic tool to manage the articles they read in Zotero.

Thursday 22 August 2018[edit]


  • Patrick
  • Lane

Agenda Limits of WMF support time from Patrick money / funding other support? Additional WMF requirements productization ??? Schedule of research documentation of process for accessing and engaging with WMF data set


Documentation of U of Virginia getting oriented and exploring options

  • (not shared)

List of research ideas in the harassment / blocking space

  • (not shared)

Proposal as a capstone project to students

  • (not shared)

Research proposal on wiki