Talk:Community health initiative/Measuring the effectiveness of blocks

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
Archives: 1

Research about how to measure the effectiveness of blocks[edit]

Currently there is not much known about how sitewide blocks affect users. This makes a comparison of the effects of the new partial block feature to sitewide blocks difficult. To provide all of us with some insight Anti-Harassment Tools team would like to examine historical block data to establish a baseline.

Additionally, AHT is particularly interested to learn whether partial blocks is successful as a tool or instrument. Therefore, the first part of this research will mainly focused on the short term gains in the utility of partial blocks in order to understand whether it appears to be working and if there appears to be a need for changes. These measurements will provide us with insight quickly.

However, our list of measurements that we propose a lot of longer term ones, e.g. surveys. These are important and should be considered to be implemented later, because they can provide all of us with insight that is otherwise hidden. Analysis of log data and such can tell us what is happening, but we cannot know why, which is where surveys and interviews are useful. This page is the place for people to discuss these topics, too.

Please join us in discussing these proposed measurement. For the Anti-Harassment Tools team and Morten Warncke-Wang. SPoore (WMF) , Trust and Safety Specialist, Community health initiative (talk) 20:29, 28 September 2018 (UTC)

Discussion about how to measure the effectiveness of blocks[edit]

I just wanted to highlight the importance of trying measurements like ORES good faith/bad faith ratios post blocking -- its important to see if blocking actually changes behaviour. Is there any way we could use the new Google supported interaction model to measure the relatively hostility of conversations by blocked users during/after the block? Blocking generally is used most (after clear/blatant vandalism), in my experience, when folks are belligerent in communications via both edit summaries and talk pages. Sadads (talk) 15:21, 1 October 2018 (UTC)

@Sadads: ORES will be able to score on wikis ORES supports but unfortunately Google/Jigsaw's Detox/Perspective AI system is only effective for English Language content. (There is also a much larger up-front cost to use Detox.) We're looking into Detox, because although it's not perfect it could show us something directional. Similarly, we wanted to determine a way to measure how a block affects other parties (the other user in a case of incivility or same-page editors in a case of vandalism) but it became very complex and error prone. Would love to identify another feasible way, though! — Trevor Bolliger, WMF Product Manager 🗨 22:04, 1 October 2018 (UTC)

Definitions need to be linked[edit]

This research is firming some fuzzy concepts into technical terms with specific meanings.

I think the first use of these words should wiki-link to someplace with shared definitions that the community workshops. I am not sure if a dictionary already exists.

  • block
  • partial block
  • two kinds of vandalism currently listed on Russian - Vandalism 54.95% and Vandalism 3.98%
  • I know that "open proxy violation" has many meanings also
  • school block circumstances are not so defined

I am not sure how to do this. I suppose that there could be full discussion pages for each of these various topics. There is a lot to say about each one! Blue Rasberry (talk) 16:24, 1 October 2018 (UTC)

Hi User:Bluerasberry, I agree! Last week, we identified the need for glossary but decided to go ahead and post on wiki without it because we felt this project needs crowd sourced work to have a good outcome. And sooner the work went public the sooner wikimedians can help inform the way the work will be done. SPoore (WMF) , Trust and Safety Specialist, Community health initiative (talk) 16:31, 1 October 2018 (UTC)
Good call. I expect that these topics might be defined over months or years after someone establishes them. I do not think it is urgent that you start definitions but also somehow we need to get progress on this eventually. I suppose the way to do this is a glossary page which links out to individual concepts if and when anyone wants to define them further. Blue Rasberry (talk) 16:36, 1 October 2018 (UTC)
Good call, these conversations will be flimsy if they're not on a common ground. I'll add some more clear language to the page right now, and we can work on how to make a glossary for the longer term, preferably as part of mw:Manual:Glossary.
As for the most common block reasons — I added "identifiable" because that data was generated by generating a distribution of the most common comments left in the 'block reason' field by the admin setting the block. Block reasons are optional and are free-form. I'd love to make the block reasons required and more standardized so it's easier to run analysis on the different types of blocks and to build more sophisticated functionality into the Blocking tool, but that's a topic for another day. And thanks for pointing out the Russian Wikipedia 'vandalism' duplication — I didn't combine them. Fixed! — Trevor Bolliger, WMF Product Manager 🗨 22:29, 1 October 2018 (UTC)

On how blocks[edit]

"In Phabricator task #T190328 our team generated some simple statistics on how blocks." There seems to be a word missing. --Gereon K. (talk) 21:26, 2 October 2018 (UTC)

@Gereon K.: Good catch — it was a rogue "how" from a previous incarnation of that sentence. Removed! — Trevor Bolliger, WMF Product Manager 🗨 22:24, 2 October 2018 (UTC)

Known block data[edit]

Your sample period is short, and looks to me strongly affected by some special countervandalism action (?) in ru.wp. Probably it's not representative. Did you consider resampling data from a whole month? --MBq (talk) 12:17, 3 October 2018 (UTC)

Hello MBq, that is a good point that we did not state well enough in our write up. I completely agree that our sample data is flawed in this way and others. One of the main reasons that we ran the original queries on the block data was to see what the data looked like in order to get some basic idea of the scale that we are working with. And also so that we can understand the limitations of data and find the best ways to go about collecting better data in the future. This Phabricator ticket T206021 is our start to making our plans future reports. We're hoping to get good feedback like yours that will improve the quality of the new data that we pull. SPoore (WMF) , Trust and Safety Specialist, Community health initiative (talk) 13:25, 3 October 2018 (UTC)
Actually, the Russians seem to run a bot for blocking open proxies for some time already. It does not seem to be something special during the sampled period. There are, of course, seasonal variations and other factors that should be taken into account.--Strainu (talk) 21:43, 3 October 2018 (UTC)

Block reasons for English and Russian[edit]

Are you sure the block reasons are not reversed between those 2 wikis? According to data higher in the page, 65% of all blocks came from ru.wp and 45% of all blocks were exactly 6 months long, mostly from ru.wp, which is the exact length of their bot-imposed blocks for open proxies. Thus, one can deduce that ~2/3 of ru.wp blocks are for open proxies.--Strainu (talk) 21:41, 3 October 2018 (UTC)

@Strainu: Yes, I'm very sure. This is just a byproduct with how ru.wp annotates their blocks with unique information on the proxy between <!-- --> HTML comments, which makes it difficult to create a distribution — thus me labeling it as "identifiable." We decided not to optimize the query merely for time investment purposes. Here's the raw data if you'd like to do some different analysis:

Trevor Bolliger, WMF Product Manager 🗨 00:00, 4 October 2018 (UTC)

Some stats from[edit]

Here [1] PMG (talk) 22:02, 3 October 2018 (UTC)

@PMG: ❤️! This is wonderful. It doesn't look like there's any historical reporting though? We're going to be building weekly reports for blocks in phab:T206021 for all wikis, we'll certainly take a look at how this is pulling the data. Thanks, Piotr! — Trevor Bolliger, WMF Product Manager 🗨 00:09, 4 October 2018 (UTC)
@Masti: - do you have any historical reports or can you help Trevor with data? @TBolliger (WMF): - not a problem - I am always ready to help old friends :). PMG (talk) 07:30, 4 October 2018 (UTC)
@TBolliger (WMF): unfortunately I am not logging the outcomes. So historical reports are not available directly but they can be regenerated from block logs. This is the source for data on the page. You can get the generator code from GitHub. masti <talk> 09:32, 4 October 2018 (UTC)
Wonderful! Thank you! — Trevor Bolliger, WMF Product Manager 🗨 17:49, 4 October 2018 (UTC)

List of proposed measurements[edit]

The full commentary and details on how these will be measured are under § Proposed Measurements. For sake of brevity and discussion here are the Wikimedia Foundation's 11 proposed measurements for determining the effectiveness of blocks:

Sitewide blocks effect on a user

  1.  Blocked user does not have their block expanded or reinstated.
  2.  Blocked user returns and makes constructive edits.

Partial block’s effect on the affected users

  1. Partially blocked user makes constructive edits elsewhere while being blocked.
  2. Partially blocked user does not have their block expanded or reinstated.

Partial block’s success as a tool

  1. Partial blocks will lead to a reduction in usage of sitewide blocks.
  2. Partial blocks will lead to a reduction in usage of short-term full page protections.
  3. Partial blocks will retain more constructive contributors than sitewide blocks.

Before we generate the baseline data, we'd like to hear from users who interact with blocked users and participate in the blocking process to make sure these measurements will be meaningful. Are we over-simplifying anything? Forgetting anything important? Other thoughts?

Thank you! — Trevor Bolliger, WMF Product Manager 🗨 21:23, 4 October 2018 (UTC)

  • Not sure how you can measure this, but as someone who strongly opposes partial blocks (and who will never place one as I think there is no possible way for them not to increase disruption), the thing I would be most interested in looking at is how much gaming of restrictions increases blaming the technical implementation of this. It's going to happen, but we need a way to measure it. TonyBallioni (talk) 15:28, 8 October 2018 (UTC)
    • @TonyBallioni: Thank you for your comment. I agree that an undesirable outcome is that this feature results in more effort spent moderating user misconduct due to nitpicking and wikilawyering around the exact parameters of a partial block. Our best attempt to measure the frequency of this is "Partially blocked user does not have their block expanded [...]"Trevor Bolliger, WMF Product Manager 🗨 16:37, 8 October 2018 (UTC)
hmmm the most obvious issue is a lack of definition of "user". IP blocks in particular can be targeting more than one person at once.Geni (talk) 21:04, 14 October 2018 (UTC)
@Geni: — Good observation, Geni. For each of these measurements we plan to run them twice — for username blocks and for IP blocks. (We may also do IP range blocks separate from IP address blocks.) I should probably update the content page to reflect this. — Trevor Bolliger, WMF Product Manager 🗨 17:47, 15 October 2018 (UTC)

December 14, 2018 update[edit]

Hi all, I wanted to provide a status update before the end of the year. Software development on partial blocks is still underway and in January we’ll be releasing on our first batch of wikis. (It’s currently testable on Test Wikipedia for those curious!) Meanwhile, our team’s analyst Morten has begun looking into available data in the Data Lake. You can see the queries and follow our progress at our GitHub repo:

We’re still planning on running calculations on these 7 proposed measurements to determine how effective sitewide and partial blocks are at preventing further harm to wikis. We originally were thinking of running this data on a week-by-week basis but because only monthly data is available in the Data Lake and because most of these measurements have several sequential steps (e.g. a user is blocked then re-blocked) we will instead be calculating these results monthly.  

The Data Lake has historical data for sitewide blocks so we hope to have some initial results for determining the effectiveness of sitewide blocks in December/January but we will not be able to have results for Partial Blocks until March 2019, two months after Partial Blocks have been released.

As a small interesting tidbit, as we were looking to see the average time between blocks on one account (i.e. a user is blocked once, that block expires, then they are blocked again.) Not counting periods longer than one year, we found that on English Wikipedia, median time to re-block for users is 12.1 days. Bottom quartile is 2.4 days, top quartile is 57.4 days. Italian Wikipedia has shorter times, median is 6.2 days, with bottom quartile at 0.9 days, and top quartile 42.9 days. This is a reflection of different moderation policies on the two wikis. We know that English Wikipedia’s blocks are longer (69% of blocks are longer than 7 days and 20% are indefinite) than Italian Wikipedia (2% of blocks are longer than 7 days and 11% are indefinite.)

As true with gathering any metrics, every data point gathered raises more questions than it answers.  We’re looking forwarding to have more data soon so we can start discussing what these results mean and what lengths and types of blocks are actually "effective."

Best, the Anti-Harassment Tools team (posted by Trevor Bolliger, WMF Product Manager 🗨 22:21, 14 December 2018 (UTC))