Research and Decision Science/Data glossary/Essential metrics

Background

Metrics and measurements for decision making have been in demand in the Wikimedia Foundation and the Movement for many years. Over the past years hundreds of metrics were defined by different Wikimedia Foundation teams alone. While some metrics have benefited from (relatively) significant research, engineering, documentation, and communication resources for their definition, maintenance, and raising awareness about their usage, the majority of the metrics remain for the most part unknown to decision makers, inaccessible, unmaintained/under-resourced, or sometimes contradictory/inconsistent.

Wikimedia Foundation has recognized the above challenges for multiple years and committed some resources to tackle them. More recently and in fiscal year 2022-2023 annual plan WMF recognized the need for creating and/or maintaining a small set of core metrics for decision making. That work resurfaced the need for revisiting the existing metrics, standardizing them when relevant, and developing a criteria for deciding whether to dedicate more resources to their maintenance.

As part of fiscal year 2023-2024, Wikimedia Foundation prioritized work to bring some organization in the way we think, dedicate resources, and use metrics. We recognized that while there will always be a need for a small set of metrics (currently referred to as core) to inform important decision making, the organization and the Movement needs more metrics than just a few for decision making. At the same time, with the limited resources, it was evident that we could not commit to standardize, maintain, document and make accessible hundreds of metrics.

Between July and December 2023, we brought together a representative group of experts and stakeholders in WMF as part of a working group to focus on one task: co-develop and propose the criteria for essential metrics with the understanding that the criteria, once accepted, will be used to determine which of the many metrics WMF has developed over the years are to be considered and resourced as essential.

In the rest of this page, you will learn about more details that came out of discussions and deliberations of of the working group which is now captured as part of the Essential Metric definition.

Definitions

The definition of Essential Metrics includes multiple terms that require further specifications for us to be able to use the definition in practice. In this section we expand these terms, the intention behind introducing them, and how one can interpret and use them in practice.

General definitions

Accuracy: how close a given set of measurements (observations or readings) are to their true value; or a description of systematic errors.

Benchmark: comparison of the relative value of a given metric or measure based on industry standards or best practices.

Baseline: initial level of a given measure collected for the purpose of serving as a reference point for future data collection, especially for impact evaluation or anomaly detection.

Business data steward: Business data stewards are individuals responsible for the day-to-day management and oversight of specific data assets. They ensure data quality, integrity, and compliance with established policies and procedures.

Credibility: the quality of being convincing and believable, especially by adhering to commonly-accepted scientific practices and methods.

Measurement: an objective representation of a (fairly) concrete concept used to calculate a metric.

Metric: a subjective, quantitative or qualitative representation of an abstract concept used for monitoring, learning, or decision-making. While we recognize that “metric” has a variety of definitions in other contexts (eg, “software metric”), we specifically define it in distinction to “measurement” to highlight the element of subjective judgment in defining a metric. Both effective decision-making and scientific credibility require that measures not be conflated with the abstract concept they are meant to represent (see here for an application to software metrics).

Important decision: this is intrinsically subjective, but important decisions that might be made by Wikimedia Foundation and/or affiliate leadership might include:

resourcing decisions such as which communities to focus interventions on, which product features to build. These might rely on metrics used to identify potential problems or opportunities (possibly by measuring which Wikimedia spaces are growing or have high potential for specific programmatic interventions).

Precision: how close a given set of measurements are to each other, or a description of random errors; a measure of statistical variability.

Technical data steward: Technical data stewards provide support and are associated with specific systems, applications, data stores, and technical processes such as data quality rule enforcement or ETL jobs.

Essential metrics evaluation dimensions

Accessibility

For essential metrics to effectively inform decision-making, they must be accessible and understandable for both:

decision-makers
those affected by the decisions made based on essential metrics and measurements.

What's more, as an organization responsible for a global movement, Wikimedia Foundation decisions must be made based on essential metrics, and the metrics themselves, must be as accessible and understandable as possible to members of that movement.

Relevance

Essential metrics should enable the Wikimedia Foundation and/or Wikimedia Movement leadership to assess progress toward our strategic direction or to support important Wikimedia Foundation operations (metrics required to measure important operations are metrics that may not be directly related to the strategic direction of the Movement, but that are needed to understand day-to-day operations that are essential or related to performance.)

Trustability

Essential metrics will be used for decision-making and for advancing the understanding of the Wikimedia projects. As a result, it is important that the people who use them and those who are affected by them can trust them.

Essential metrics should be scientifically rigorous. While there can be no single standard of rigor applied uniformly across all essential metrics, essential metrics should meet reasonable standards of rigor based on their anticipated use cases. That is, essential metrics should be as accurate, precise, and credible as feasible given their use cases.

Accurate measurement of an Essential metric is crucial. Essential metrics should address potential sources of error and should be representative of the population or quantity of interest (measurements should be based on representative samples or a complete census of relevant observations).

Precision is an intrinsically relative criterion, but Essential metrics should always utilize measurements with sufficient precision to serve their anticipated use cases. That is, these measurements must be granular enough to detect important changes.

Essential metrics must be credible to decision-makers and those affected by decisions based on them. That is, they must be convincing and believable, especially by adhering to commonly accepted scientific practices and methods.

FAQ

We will expand the FAQ section as we hear questions about the definition and implementation of essential metrics.

Q. Why should an essential metric be in "production" and what does that actually mean in practice?
A. There are many (endogenous and exogenous) factors that can affect the computation of measurements for a metric and the definition of a metric itself. As as a result a metric that is used in frequent decision making must be reviewed and updated as relevant. By in "production" we aim to communicate that for a metric to be considered essential by WMF, the organization must dedicated resources to it. At the minimum, a metric in production must have:

Defined business owners or data stewards who can make decision and answer questions about the metric.
Planned engineering resourced dedicated to enable standardized data pipelines, data quality checks, monitoring and alerts, service-level-objectives, automation, comprehensive technical documentation, and data data visualization.

Q. Who can develop metrics that can be considered to become essential metrics?
A. Anyone. We want to make sure to leave the space open for others to develop metrics that we can consider to categorize as essential metrics. This will allow us to benefit from the extensive knowledge of those closest to the places where decisions need to be made as well as those with more resources (for example, those in research or academic institutions) to develop new metrics.

Q. Why is it a consideration for a metric to be used frequently in decision making for it to be considered an essential metric?
A. The primary reason is that our resources are limited and we must choose which metrics we dedicate our resources to. As a result, if a metric is used for only a limited number of times for decision making, even when the decisions are of significant importance, we won't need to productionize such a metric or dedicate long term attention and resources to it. As a result, the metric is not considered "essential" with the specific definition we use here.

Q. Can new metrics be defined and considered as essential or is this criteria going to be applied only to already existing metrics?
A. New metrics can be defined and considered as part of the essential metric criteria.

Q. How do you determine who is going to be affected by one or more decisions based on the metric (referring to the "transparency" requirement for essential metrics)?
A. Our vision is that operationalizing transparency for metrics can be done through the development of "Metrics Cards", inspired by Machine Learning Model Cards. If such an implementation is adopted, the proposer of a metric and/or the committee that reviews the proposal will need to call out the audiences/groups that they know may be affected. At the same time, transparently sharing Metrics Cards allows those who may be affected by the decisions to surface potential challenges with making decisions based on specific metrics.

Q. Is a metric that is developed or needs to be developed for annual planning purposes automatically considered to be essential?
A. No. The frequency in decision making is important to consider when making a decision as to call a metric essential or not, which effectively triggers medium to long term resources to be dedicated to the metric. If we expect a metric to be used only for a few quarters or even just a couple of years by a small group of folks for decision making, the metric will need resources but is not necessarily considered essential.

Q. What do you mean when you say WMF's "strategic direction"?
A. At the time of developing the criteria for essential metrics, WMF started developing an updated strategy for the organization. The decision making related to this multi-year organization strategy, once it becomes stable, will require development or maintenance of metrics that can be considered as essential.