Talk:Community health initiative/Metrics kit

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Feedback on metrics[edit]

Overall size[edit]

One of the longest running metrics on the English Language Wikipedia is time between ten million edits. Which I seem to have adopted in the last year or so. This has all the advantages and disadvantages of a metric that looks at very raw data - like other metrics it showed the rise of the edit filters as a negative rather than a positive; But it does pick up on trends such as the decline in overall activity from 2017 to 2018 - activity at present is only about 10% above the 2014 low. I commend this methodology to other wikis, though perhaps in hindsight 10 million was too coarse a granularity. But it does leave me an interesting thought, could we do something similar, both for this and other metrics such as active editors but excluding other edits made by the same account in the same half hour? A fairly normal metric in volunteer based organisations is to try and measure volunteer hours contributed. When I fix a typo across Wikipedia I may do over a hundred edits per hour, someone copyediting one article may make dozens of little changes but only save once an hour, or once every ten minutes if they have learned to about edit conflicts. Activity metrics that measured estimated half hours of time volunteered to wikipedia would give an interesting metric that put AWB users like myself into a different perspective. WereSpielChequers (talk) 08:10, 18 September 2018 (UTC)

Ah, interesting, thanks WereSpielChequers. The time between edits metric is a good idea for the larger wikis, but I do wonder if it is scalable to smaller projects (perhaps 1,000 edits might work better for those, perhaps). I do think that's a good way to get an overall feel for the general activity level of a project.
I do wonder how that would circle back to indicate the health of a community, since a healthy and an unhealthy one would still, in theory, have roughly the same level of activity. The same may well be true for volunteer hours—which I also think is a good metric to look at, when extrapolated across an entire population on a certain project. We would need to think about how this would relate, I believe, to the notion of a healthy community atmosphere in that regard, but I do think that looking past raw edit counts is sensible. Joe Sutherland (Wikimedia Foundation) (talk) 23:19, 19 September 2018 (UTC)
I would have thought that by definition a declining community is an unhealthy community, though I will concede that a growing community can also be growing in an unhealthy way. WereSpielChequers (talk) 06:35, 20 September 2018 (UTC)
I think the definition of "healthy" can vary in this context; it can refer both to the output of the project and to the atmosphere on the project itself. They're both connected, of course, and part of the challenge of this project will be to best represent them using data we can collect and analyse. Joe Sutherland (Wikimedia Foundation) (talk) 23:24, 24 September 2018 (UTC)
The metric time between ten million edits as a proxy for activity level seems legit at first, but the complexity of the edit with a growing corpus is increasing, thus an increased time between edits can be expected. Lets say we make a simple way to propose links from existing articles in the same category, then linking will be faster and the time to reach next ten million will decrease.
Estimating editing speed in a way that are comparable to previous estimates are difficult when the defining context is changing. Unless measuring changes in the defining context is the purpose.
One possibility could be to measure how many non-whitespace characters are entered in text over a one year period, but that too is sensitive to increased complexity. — Jeblad 14:19, 3 October 2018 (UTC)

Mobile re PC[edit]

Wikipedia is much more editable on the desktop platform than it is on mobile, and this is very probably a major contributor to the phenomena of editing not taking off in those societies and demographics most likely to use Smartphones. It is likely to be a major contributor both to the greying of the pedia and also to the ethnicity skew. It should be possible to keep track of this by measuring differential rates of edits versus page hits on the mobile and desktop platforms, ideally by country. At the least this should test whether this really is our biggest current problem. WereSpielChequers (talk) 08:10, 18 September 2018 (UTC)

Also a fair suggestion, WereSpielChequers. Though this doesn't directly impact the working atmosphere of a project, new editors coming through mobile is certainly a trend that will become more and more important as time goes on. I know of a few editors who work primarily on mobile these days just using the desktop version of the site on their phones, but this is (in my opinion, at least) a pretty clunky solution. Comparing edit figures to pageviews, particularly across devices, is also interesting for that same "new editors" reason, though I don't know if it would indicate how healthy the project is. Joe Sutherland (Wikimedia Foundation) (talk) 23:23, 19 September 2018 (UTC)
Hi Joe, Perhaps we are using different definitions of the word healthy. In many countries the Internet is mainly viewed on smartphones, and if mobile devices were easier to edit Wikipedia from we would already have a substantial proportion of our editors coming via mobile. To my mind our low rate of converting mobile readers into mobile editors is directly our main cause of recruitment problems, and indirectly I suspect it is the main cause both of our ethnicity skew and possibly a growing age skew as we distance ourselves from the smartphone generation. I assume that the decline in editing on the English wikipedia is a sign that there is something unhealthy about that project and if our ethnicity skew is greater than our gender skew then that makes it one of the biggest problems that we face. Or are you only interested in those aspects of community health that are probably down to incivility, such as the gender skew? WereSpielChequers (talk) 06:29, 20 September 2018 (UTC)
We're interested in any and all aspects that may be improved or measured, ideally, which would include both of those. I absolutely agree that mobile views and edits are the future and that comparing them would make for an interesting look at how the projects will develop over time, particularly with the Wikimedia 2030 long-term direction (we already track pageviews by device, at least). The challenge will be to relate that back to the "health" or atmosphere on each project, and how that can inform changes to tools etc. on the projects themselves. Joe Sutherland (Wikimedia Foundation) (talk) 23:24, 24 September 2018 (UTC)
The obvious link is with our ethnicity gap. The ratio of mobile and desktop users varies by country and I haven't heard anyone seriously dispute both that this is a major contributor to our ethnicity gap and that the ethnicity hap is a big part of our having unhealthy communities. Less obvious is the possibility that this may also be contributing to our gender gap. The biggest drop in gender involvement comes between readers and editors (46% of our readers are female, I'm not sure how far that is from the ratio of women on the internet, but it is close to parity. However only 16% of those who make any edits are female), if there is a gender skew in the PC/Mobile ratio then that would be a contributor to our gender gap. I suspect it is only a small contributor, but I could be wrong and it would be worth finding out. WereSpielChequers (talk) 12:03, 30 September 2018 (UTC)

Missing metrics[edit]

Hi, i know i'm late, but i only knew of this since last week. There recently was a wmf initiative, and i guess one of the ideas (Grants:IdeaLab/Identify_and_quantify_toxic_users) might be useful? All the best, --Ghilt (talk) 19:01, 26 September 2018 (UTC)

You aren't late at all :) Thanks for pointing out your idea! Joe Sutherland (Wikimedia Foundation) (talk) 23:33, 27 September 2018 (UTC)

Metric "admins to content"[edit]

In my opinion the metric "admin to content" should be "user contributions to total content", where "user contributions" are total change from active user contributions normalized against total size of text corpus. This will say something about how large a corpus the active users can maintain at any time. We do not know how much work goes into maintaining the existing corpus, but we can say whether the trend is increasing or decreasing. — Jeblad 04:57, 3 October 2018 (UTC)

The reasoning is that it is not the admins alone that maintain the content, it is the users as such, and the users might not update the content evenly.
There is although a ration "admin to active users" that is interesting, but that should be measured as "admin contributions to active non-admin users contributions" to scale properly with the different groups contribution patterns. Unless normalized against the contribution patterns a wiki could have a lot of non-active admins and still get a descent number. — Jeblad 13:50, 3 October 2018 (UTC)
This could have other interpretations, like admins contributions to non-admins contributions. This is the contributions the admins must police at any moment. — Jeblad 20:07, 4 October 2018 (UTC)

I see it is badly wrong that this Community health initiative should make most (fast) all metrics related to admin presence. From the view point the healthy and the growth of Wikipedia are thousandth time more significant the users and they contributions! Texaner (talk) 09:14, 8 October 2018 (UTC)

Metric for "gender bias of content"[edit]

This could be analyzed by building a word vector space (w:word2vec) and checking if there are any specific trends that should not be there. Identified problems could then be traced back to grammatical constructs used in specific articles. — Jeblad 05:06, 3 October 2018 (UTC)

I guess this is a bit short, and not obvious at all. It is like measuring bias' given word embeddings, a kind of opposite to debiasing word embeddings.[1] It is pretty straight-forward to do this analysis, but it is not obvious how to present the result.

Assume we build a vector space describing the layout of words and terms in a corpus like Wikipedia. Some of the words will then be along an axis defining a bias, and on this axis it should have a placement that is close to some defined zero value. For example the term "nurse" will have a tendency to slide to a female side, but should be at zero.

Note that some languages use gender variants of terms, German for example, and that those gender specific terms will be inherently biased. The reason why should be pretty obvious. In those cases the two variants should be at equal distances from the zero value. Even if two terms are balanced at one axis, some other bias axis might be unbalanced.

If we analyze our use of various terms, and assigns values for how biased they are along different bias axis, then we can analyze an article and assign an overall score for the article. It is also possible to mark words according to how much bias it adds to the overall score, and thus which words would be best candidates for any changes to make the article more neutral.

The simplest way to present this to the editor could be to use a special page, but it would probably be better to somehow integrate it with the editing process. A simple solution could be to add a one-line overlay like the replacement dialog, and in addition add colorization of candidate terms that should be neutralized. In the dialog it should be a dropdown-list with bias axis.

Even if editing has support for showing bias there should be some kind of maintenance page to list pages with bias problems. — Jeblad 13:18, 3 October 2018 (UTC)

Could or should this ignore quotations and song titles? Sometimes Wikipedia is trying to neutrally cover non neutral subjects. WereSpielChequers (talk) 18:49, 3 October 2018 (UTC)
That is a quite interesting question. I guess the metric should not be part of an overall summation for the maintenance page, but it could still be interesting to calculate the metric for the article. It could even be necessary to turn the summation off for individual bias axis. Does that make sense? — Jeblad 23:21, 3 October 2018 (UTC)

Metric for "noticeboard activity"[edit]

Some rants at Grants talk:IdeaLab/Measure replacement rate among the admins#Additional health metric response time. Simply using the number of sections would not be meaningful. — Jeblad 05:10, 3 October 2018 (UTC)

I'm not sure, but wonder if it is good enough to just add a done-template in a thread if the thread is named in the comment. If a tag is added, then it would be sufficient to analyze the history. The number of unresolved threads on the last revision is also interesting, but I don't believe unresolved archived threads are interesting. It could be, perhaps…

If a long history is analyzed, then title of threads can be duplicates. We can't distinguish this without analyzing the revisions content. Perhaps new threads could be tagged, that would create a simple workaround.

Seems like we would need some sort of protected tags that can't be removed. — Jeblad 14:04, 3 October 2018 (UTC)

Metric for "contributor retention rates"[edit]

This is somewhat similar to a metric I've been calculating for nowiki, but then for new editors.

Assume we list all active users within a time period that is at least autoconfirmed at some level. Then we wait one year and checks whether the same group of users are still active. If the time period is to short the error rate will increase as the probability of observing a rare user is quite low. If the time period is to long the error rate will also increase because the user may have left even if we observe a contribution. The difference between the two, the last subtracted from the first, will give users still active. It is important that the two time periods are the same, otherwise it will create an artificial drop. Don't use a second time period that decreases to time now, it will show a catastrophical drop. Also it is important that the group from the second time period only consists of users observed in the first time period.

The retention r is given by the count of active users at time period 0, minus the intersection of the active users at at time period 0 and time period 1, normalized over the count of active users at time period 0. The formula can ble reordered and slightly simplified.

It is possible to generate this metric by using sliding windows, but then the calculations must stop before the windows starts shrinking because time period 1 starts sliding past now-time, or both windows must shrink at the same rate. In the later case a graph will become increasingly noisy, and it might exhibit strange behavior. It is probably better to stop in time, as the strange behavior can trigger unwanted (and completely unnecessary) discussions.

I would probably go for three month fixed windows, so May, June, and July will be compared to the same three month period from the previous year. The reason the diff goes over a year is to get rid of seasonal variations.

It is possible to get slightly more representative metric, an impact of contributor retention, by converting the number of users in each time period to total contribution size in each time period for the active users.

The retention r' is given by the contributions of active users at time period 0, minus the contributions of the active users at at time period 0 and time period 1, normalized over the contributions of active users at time period 0. The formula can be reordered and slightly simplified.

The impact of contributor retention will be less than ordinary contributor retention because most editors that leave the project are new contributors that has less contributions than more established ones. If established editors leaves the project it can although be more visible.

I believe this should be enough to give a general idea how this metric can be calculated. — Jeblad 23:15, 3 October 2018 (UTC)

Availability of admins (or any group)[edit]

I wonder if there should be a metric for availability of admins, that is time coverage where at least one admin is present. This can be measured as time slots of ten minutes where at least one admin pings a dedicated server. The actual client code should not ping the server unless it has a visible window.[2] It is possible to do this sightly more efficient by tracking stats in the browser, and only occasionally dumping the stats to the dedicated server. — Jeblad 12:07, 10 October 2018 (UTC)

User confidence in administrators[edit]

A slight variation on "User confidence in administrators" could be to use randomly chosen "metaadmins" that postulates outcomes of admin actions, and then give the admin a rating of how well they are doing. If the admins are to much out of sync with the metaadmins, then something must be done.

Metaadmins should not know which admins did the action, or what was the outcome. Metaadmins are chosen randomly among active users, but so that an user does not rate its own actions. A metadmin should rate 10-12 actions and then be relived.

It could be possible to rate ordinary contributions, but that would imply use of readers to do the rating. This could be interesting for some kind of contentious edits. — Jeblad 15:55, 14 October 2018 (UTC)

Gini coefficient[edit]

I've been skeptical to the w:Gini coefficient from the first time I heard about it, wondering if it muddle the issue more than it explain it. What it tries to explain is whether the distribution of contributions among users follows some kind of predictable pattern, but to do so you must make some assumptions about the behavior of the users. That assumption does not hold when there are no checks in place for new accounts, that is users can log out and still edits, in effect creating a new cheap throw-away account.

One way around this could be to only do statistics on logged-in users, but that will give skewed statistics. — Jeblad 12:29, 18 October 2018 (UTC)

It could be possible to set the previous user contribution level in the browser, and then report that as a w:randomized response even if the user isn't logged in while doing a contribution. The contribution level is reused as long as there is no higher level, or it times out. This work as long as the browser is not in incognito mode. The numbers in this case will be correct for non-logged in regular users.
A different way to do the statistics would be to track present user contributions for each day locally in the browser, and then report that as a randomized response. The obfuscated value could be the contributions with a randomly added noise. In mean this gives the correct value over the whole population. After pushing to the server the counts are reset. This work even in incognito mode if the browser is kept open. The numbers in this case will be correct for non-logged in new users.
An third method could be to count contributions and then report it back as a vector of binned values. That could also use randomized response. Counter could be initialized while user is logged in. If the user starts contribution without counter being initialized, then it is set to zero before being incremented.
For other methods on tracking anonymous users, see Novak, Julie; Feit, Elea McDonnell; Jensen, Shane; Bradlow, Eric; Bayesian Imputation for Anonymous Visits in CRM Data, (December 7, 2015). [3][4]Jeblad 12:34, 18 October 2018 (UTC)

Non-logged in users contributions as proxy for community resentment[edit]

It is possible to use the users edits while they are not being logged in as a proxy for community resentment. It can be measured as a ratio of the previously logged in users contributions normalized over the logged in users contributions. The ration should be very close to a fixed ratio. If it changes abruptly then something is going on in the community.

In particular if specific threads gets a high degree of contributions from previously logged in users then flags should be raised. That is although a discussion whether we want to flag threads as poisonous. — Jeblad 14:10, 18 October 2018 (UTC)

Better informing dispute resolution/sanctions[edit]

I would be very interested in having more metrics on how different methods of dispute resolution and sanctions affect the project, especially user retention. For instance, it would be helpful to know the "survival rate" of receiving an IBAN or a topic ban. How many editors who receive an IBAN suffer a significant decrease in activity shortly afterwards? How many such editors receive a block within six months? etc. Having hard data to answer these sorts of questions would allow the community to evaluate how effective our dispute resolution processes and tools are in reality. ~ Rob13Talk 17:56, 21 October 2018 (UTC)

Feedback on design[edit]

API[edit]

  • Guessing this can go here? Love the idea (moar data is always good!) - could we make sure this is easily accessible through an API or similar? :3 - TNT 💖 21:39, 2 October 2018 (UTC)
    Thanks There'sNoTime! An API endpoint, or something like it, is a great point to raise and I'll bring that up as we move forward. Joe Sutherland (Wikimedia Foundation) (talk) 21:47, 2 October 2018 (UTC)

Other feedback[edit]

  • Most items are about the administration. The apparatus strikes back. Kängurutatze (talk) 22:58, 17 September 2018 (UTC)

Possibility of future updates after the tool has been completed[edit]

I like the idea very much. Great work! I'm interested in the gender bias in content and doing research on it. I am hoping to be able to produce a tool for a more fine-grained measure of this by the end of my research. Just wondering, if I finish it, after the tool is up and running, would there be a possibility of adding it to the metrics? (after community approval, of course).--Reem Al-Kashif (talk) 10:51, 12 October 2018 (UTC)

Hi Reem Al-Kashif - thanks for the compliments :) Our intention is for this to remain an adaptable tool, and that could accommodate something like you are suggesting. So yeah, we could totally look into it as an extension of the tool once it's up and running! (P.S. I hope you don't mind but I moved your comment up the page a little bit :) Joe Sutherland (Wikimedia Foundation) (talk) 00:01, 18 October 2018 (UTC)

Sign up for future updates[edit]

Feedback from the Wikimedia CEE Meeting 2018[edit]

During the Wikimedia CEE Meeting 2018 participants of a workshop about the needs of healthy communities gave feedback regrading the Community Health Metrics Kit. They worked in 5 groups of up to 6 people and were asked to discuss the following questions:

  • Is the proposed kit missing a broad category?
  • Is there anything in the table which is especially important to your community? If so - why?
  • Is there anything missing from the table which would be especially important to your community?
  • Do you have any ideas how you could use such statistics to improve your community?

As the time was limited, most groups focused on one or two of the questions. Here is the feedback we gathered from the discussion at the session:

Metrics considered especially valuable[edit]

  • Level of freedom to contribute / government pressure - ?
  • Admin ratios (admins to active editors)
  • Level of abuse filter trips
  • Admin ratios (admins to content)
  • User retention
  • Connection with colleagues (number of affiliates in relation to active editors)

Comments on some of the metrics proposed[edit]

  • Number of active users - trolls and vandals can be considered active users, so how can you take that into consideration?
  • Ability to report bugs or concerns - There also needs to be a cultural shift for people to feel comfortable to report. Many people don't report for a variety of reasons
  • Frequency of blocking reasons considered valuable but difficult to measure
  • Gender bias of content considered valuable but difficult to measure (Bar/inclusion criteria?)

Missing metrics[edit]

  • User longevity
  • Quality of the content
  • Revert reasons
  • Protection reasons
  • speciality stats
  • Admin availability
  • Measure heated village pump discussions (keyword?) (AI?)
  • Number of checkuser checks
  • Estimate the number of editors vs the number of speakers of that language
  • Number of internet users per country

Other things missing[edit]

  • Study of rules and procedures
  • Qualitative data; structured in-depth interview