Learning and Evaluation/Evaluation reports/2013

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Share icon WHITE-01.svgLearning and Evaluation
Other languages:
العربية • ‎مصرى • ‎English • ‎français • ‎日本語 • ‎Lëtzebuergesch • ‎português do Brasil



Evaluation Report (beta)

Edit-a-thons Editing Workshops On-wiki writing contests GLAM Content Release Partnerships Wiki Loves Monuments Other Photo Initiatives Wikipedia Education Program


In 2013, the Wikimedia Foundation began an initiative that would lead to a greater understanding of the incredible work that international Wikimedia organizations and individual volunteers are doing around the world to increase content on Wikimedia projects.

This collection of reports is the result of this initial assessment. We invite you to read the overview first (just below): it sets the stage for the individual program reports that introduce our approach, methodology, important definitions, and some findings about the reporting process. Each of the seven programs listed above has a report of its own, including data and findings. Be sure to view the talkpages to see current discussion around questions.

We hope you find these reports useful, and we encourage you to engage in the conversation to bring to light the stories that Wikimedia editors are creating around the world.



Overview
This overview page provides up-to-date information about who submitted data and how we gathered it, along with additional data, for evaluating programs in the Wikimedia movement.

Here you'll find information about:

Introduction and background

Three important definitions[edit]

Program[edit]

Achim Raschka is a program leader. He organizes the German WikiCup contest and the photography program, Wikipedia Festivalsommer.

A program is a group of projects and activities that share a similar theories of change and often have the same mission or goals. A program may be a one-time event, the same recurring event, or a series of events with new or returning participants.

Our team has been looking into a specific set of Wikimedia programs: those that are organized and run by Wikimedia community members. These programs include edit-a-thons, editing workshops, GLAM content donations, on-wiki writing contests, photo upload competitions and events (e.g. Wiki Loves Monuments, Wiki Takes, Wiki Expeditions), and the Wikipedia Education Program. Some of these programs share similar theories of change, and many share goals of editor recruitment, engagement, and retention along with Wikimedia content creation and improvement. We plan to expand the number of programs we examine with each round of evaluation.

Program leader[edit]

A program leader is a person who plans, executes and, typically, evaluates programs. Sometimes programs have multiple program leaders. Wikimedia program leaders might be individuals with no chapter affiliation. They might be the (volunteer) president of a chapter, or perhaps a paid employee of a chapter who designs and executes its programs; or a member of an affiliate group recognized by the Wikimedia community; or a librarian who hosts workshops at their library to teach people how to edit Wikipedia. You might be a program leader!

Program implementation[edit]

A program implementation is an instance where a program leader plans and executes a program. An implementation may be for a single program round, or it may be ongoing. Either way, a program implementation refers to a particular time and place in which a program may occur.

Current evaluation initiative[edit]

The team has designed an evaluation initiative during this initial phase. This initiative consists of:

  • Self-evaluation – Program leaders are responsible for evaluating their own programs. The team is here to support that self-evaluation.
  • Collaboration – Program evaluation and design is a new concept to many community members and Wikimedia Foundation staff members. We are in this together, and by learning and working together as an evaluation community, we can build a shared understanding about evaluation and design to maximize the impact of our programs.
  • Capacity building – Our goal is to provide program leaders with the skills and tools to evaluate and design their programs. Doing this successfully will enable the community to evaluate more easily and effectively.

Community activities[edit]

During this pilot period of program evaluation and design we're continuing to plan and execute community activities to engage program leaders. The pilot will allow the team and the Wikimedia community to gain a better understanding of the "state of evaluation" in the community. Our activities thus far include:

Participants in the first Program Evaluation and Design workshop in June 2013

The first Program Evaluation and Design Workshop, June 2013, Budapest[edit]

This pilot workshop brought together 21 program leaders from 15 countries to learn the basics about program evaluation and theory, including theory of change and logic models. Between facilitated presentations and hands-on team-based activities, attendees gained first-hand experience at creating and using these tools to further the impact of their programs.

Learn more: "Finding out what works: first program evaluation workshop" via the WMF blog.

Evaluation capability status survey[edit]

In August 2013 we sent out the request for more than 100 program leaders around the world to take a survey. The data from their responses allowed us and the Wikimedia community to understand what types of data program leaders were tracking and monitoring regarding edit-a-thons, workshops, GLAM content donations, photography contests, online editing contests, and the Wikipedia Education Program.

Learn more: "Survey shows interest in evaluation in Wikimedia movement, with room to grow" via the Wikimedia Foundation blog.

Data collection survey[edit]

In October 2013 we sent out a follow-up survey requesting data from program leaders. Unlike the capability survey, which surveyed what kinds of data program leaders were collecting, this survey requested that program leaders voluntarily share actual data that they'd collected for edit-a-thons, workshops, GLAM content donations, photography contests, online editing contests, and the Wikipedia Education Program. The results are included in this, our first evaluation report.

Wikimedia programs: Evaluation report(beta)

Purpose and goals[edit]

This initial version of the Evaluation Report aims to provide the Wikimedia community with a first look at data collected from community-run programs around the world and to identify opportunities to further support the community with program evaluation and design. The report includes data from the first Data Collection Survey in addition to data pulled from online tools that are also available to the community.

The goals of this initial report are:

  • To use this as a baseline or starting point about metrics and data reporting, with the hope that program leaders in the Wikimedia community will be inspired to collect and report data that can assist programs in reaching their identified goals.
  • So that the Program Evaluation and Design team and the Wikimedia Community can use this pilot report to explore methods for improving the collection and reporting of data—learnings that can be applied to the next data collection survey and report! We want to support program leaders to make evaluation and learning easy and fun.

Overall response rates and limitations[edit]

Response rate[edit]

23 program leaders voluntarily reported on 64 programs they produced. Our team removed six reported programs due to the inability to disaggregate and confirm the numbers that were shared, bringing the usable total to 58. In addition, one program leader sent the Program Evaluation & Design team a list of cohorts from six workshops they produced, for which we pulled the data ourselves. That data is included in this report. To expand collection, we mined data for 61 additional program implementations from data that was publicly available on wiki (i.e. reports, event pages). This collected data comprised 51% of data used in this report. This increased our collection of output and outcome data, helping us to fill in some gaps that were not covered from the surveys or responses. In total, 119 program implementations have been included in this report.

Response rate.

Data issues and limitations[edit]

The survey had a low response rate, thus a low number of reported data and a high variability in the data that was reported.

Because of this, this report includes means, response range, standard deviations, and medians. Because of the wide range of numeric responses, and thus low number of modes, modes are being reported selectively, and not in all the data reported in this report.

Program leaders aren't consistently reporting program budgets and staff/volunteer program implementation hours.
Reporting detailed budgets are a challenge for most program leaders.

Even those who have been tracking their inputs, outputs, and/or outcomes have done so with varying consistency and levels of analysis. For example, while many program leaders track their budgets, they often don't track the budget down to the details—they only track the overall budget. Thus details about how much certain parts of a program cost, and other specifics, are lacking. Out of the 59 programs reported on by program leaders, 64% included a budget report; however, 22% (12 out of 13 reports provided directly) reported no budget but did report hours invested.

Most program leaders who responded to the survey were able to estimate how much staff (51% of data reported) and volunteer (81% data reported) hours went into implementing a program, but, very few were able to report exact hours (7% for staff hours, and 5% for volunteer hours). In total, 89% of program leaders reported some type of data about hours, with volunteer hours being reported most often at 86% of the implementations reported directly.

Out of the mined data that we collected based on public records, the only programs that had available budget information were the 24 Wiki Loves Monuments events (44% of mined data). However, the Wiki Loves Monuments data we mined did not provide any staff or volunteer hours. The other programs that we mined, which totaled an additional 30 programs, had no budget or hours that were publicly reported on. In conclusion, report data is lacking in each of the six programs at this time. This also means that we were unable to complete any meaningful cost-benefit analysis at this time.

For content production metrics, only a minority of program leaders were able to report on most measures for their program events (i.e. edit counts, characters added, media uploaded, pages created).

The following percentages are how many program leaders reported the following data:

  • 63% – photos/media uploaded
  • 39% – edit counts
  • 27% – amount of text added to Wikipedia's article namespace (for most European languages 1 byte = 1 character)
Finally

Under half of the respondents were able to share data about the retention of new or existing editors (45%). 63% of reports contained partial or complete data for budget, hours, and content production.

The team also acknowledges the timing of their reporting requests, due to many program leaders being involved in the wrapping up and reporting for Wiki Loves Monuments 2013.

Supplemental data mining[edit]

We did an extra bit of mining to make sure we could tell a better story about evaluation.
We had to collect extra data, to fill in some gaps due to the low response rate.

In addition to collecting self-reported program data from program leaders, we worked hard to identify and locate potential sources for program data. Some program leaders provided us cohort usernames, event dates, and times, which allowed us the opportunity to inquire into their events in order to fill-in certain data gaps. We collected additional data on the following programs:

  • Edit-a-thons – Edit-a-thons were the most frequently self-reported program type. However, many program leaders did not track usernames of participants in order to track their contributions made before, during, and after the event. We pulled additional data on 20 English Wikipedia edit-a-thons, for which public records of participants were available on wiki. These names were used as cohorts to track user activity rates 30 days prior to the event, during the event, and 30 days after the event. This allowed us to examine content production and user retention related to edit-a-thons. We also pulled 30 day prior and after data for two edit-a-thons submitted by program leaders through the survey.
  • Editing workshops – One program leader submitted usernames, event dates, and program details for six workshops. This allowed us to create cohorts for those six workshops and pull data via Wikimetrics. We pulled data on the cohorts to examine the three and six month retention of new users in the cohort list. Some usernames were unable to be confirmed via Wikimetrics and additional research, but the majority were able to report usable data.
  • On-wiki writing contests – Additional data for on-wiki writing contests were pulled for six contests in three different language Wikipedias. This data, which was publicly available on wiki, included data gathered regarding program dates, budget, number of participants, the content was that was created/improved, and the quality of the content upon the end of the contests. We worked with program leaders, when possible, to confirm volunteer hours and budget. We were unable to judge retention and characters added due to limitations in being able to pull only contest specific data.
  • Wiki Loves Monuments – We used data from three directly reported Wiki Loves Monuments as well as data from 24 Wiki Loves Monuments implementations from 2012 and 2013 that had received Wikimedia Foundation grants (including those from the FDC) and had reported a specific budget for the program. This totaled 27 program implementations of Wiki Loves Monuments for two years. We used publicly collectable data regarding those 24 Wiki Loves Monuments to gather information about the number of: participants, photos added, photos used, and photos named as Featured, Quality, or Valued images. This data was pulled using three community built tools: Wiki Loves Monuments tool by emijrp, GLAMorous, and CatScan 2, the latter two created by Magnus Manske. We also contacted program leaders from the 24 Wiki Loves Monuments to review and confirm numbers gathered, and contribute additional data regarding budget and donated resources.
  • Other photo upload initiatives – An additional five upload events were tracked down for reporting through various ways. Additional data was gathered by us for an additional five program implementations, these programs included three other Wiki Loves events, a Wiki Takes event and the pilot project Festivalsommer 2013. These programs were selected in order to expand on the amount of data regarding other photo upload events. Data pulled was based on publicly available information and on direct reports from program leaders. Data collected included the number of participants, photos uploaded, photos used, and photos named as Featured, Quality or Valued Images. The team used the Wiki Loves Public Art tool created by Wikimedia Österreich to pull selected data. For the Festivalsommer project, additional data was acquired through direct interaction with the program organizer.

Data and analysis[edit]

We had many questions to ask program leaders!
We have a lot of questions, and the data reported has helped us answer
  • What do these programs costs in terms of dollars and hours invested and what other costs may be hidden in donated resources used?
  • What is the reach of these programs in terms of accessing new and existing editors/contributors?
  • How much content do programs produce in terms of bytes pages or photos/media added?
  • What are the costs in terms of dollars and hours input Tper unit of content (text pages or photos/media added) or per participant/recruit (for workshops which produce no content)?
  • To what extent do program outputs increase the quality of Wikimedia projects?
  • To what extent does program participation produces new active editors/contributors, or retain active editors, at 3- and 6- months retention points?
  • To what extent does the program have examples for easy sharing and replication?

Priority goal setting[edit]

We worked with the community to discover the most commonly seen goals for programs. The June pilot workshop in Budapest served as a way for the our team to identify 18 commonly seen outcomes across programs, which were discovered through conversations, break out sessions, and logic modeling. Participants in the Data Collection Survey were asked to select outcomes and targets the programs they reported on had. These 18 priority goals are:

  • Building and engaging community
  • Increasing accuracy and/or quality of contributions (i.e. clean high resolution photographs which are placed in the proper articles)
  • Increasing peoples awareness of Wikimedia projects
  • Increasing peoples buy-in for the free knowledge/open knowledge/culture movements
  • Increasing contributions to the projects
  • Increasing diversity of contributions and content
  • Increasing diversity of contributors
  • Increasing positive perceptions about Wikimedia projects
  • Increasing reader satisfaction
  • Increasing the usefulness, usability, and use of contributions
  • Increasing the use and access to projects
  • Increasing peoples editing/contributing skills
  • Increasing volunteer motivation and commitment
  • Increasing respect for the projects (i.e. higher education acceptance)
  • Making contributing fun
  • Making contributing easier
  • Recruiting new editors/contributors
  • Retaining existing editors/contributors

In the survey, program leaders could also write in other goals in a section titled "other". This set of 18, with the "other" option, were presented for each program that program leaders reported on. They were asked to select priority goals for their reported programs. The number of goals program leaders reported ranged from five to 11. The overall mean for any given program was nine selected priority goals. In general, program leaders demonstrated difficulty in "prioritizing": Out of all reports, only 12.5% selected five or fewer priority goals.

Inputs and participation[edit]

Inputs[edit]

The majority of program leaders reported some type of budget, but the majority didn't report data about hours it took to implement their programs.

Regarding inputs, program leaders were asked to report:

  • Budget – how much it cost them to produce their program in US dollars
  • Staff and volunteer hours – How many actual or estimated hours staff and volunteers put into their program from beginning to end
  • Donated resources – Including equipment, prizes, give-aways, meeting space, and other similar things donated by organizations or individuals to support the program

Most program leaders reported budget data, while a larger number did not provide data about hours. Across the 119 report responses:

  • 55% included budget data (22% of budgets reported were zero dollars)
  • 34% included staff hours (51% of staff hours reported were zero dollars)
  • 44% included volunteer hours (2% of volunteer hours reported were zero dollars)

Participation[edit]

The majority of program leaders were able to report on how many and what kinds of participants were involved in their programs. It takes all kinds!
The majority of program leaders could report how many people participated in their program, when little over half were able to tell us how many new editors made accounts for their programs.

Regarding participation, program leaders were asked to report:

  • Total number of program participants
  • Number of participants that created new user accounts during the program

The majority of participants reported the total number of participants (98%), when little over half (57%) reported number of new user accounts created during their program.

GLAM content donations had a slightly different reporting request about participation:

  • Total number of GLAM volunteers involved in the program (78% reported)
  • Total number of GLAM staff involved in the program (89% reported)

Program leaders were also asked to provide the dates, and times, if applicable, for their program.

Content production and quality improvement[edit]

Content production[edit]

Most program leaders were able to tell us how much media was added during their program, but a minority were able to report on how many characters (bytes) were added during their events, let alone how many editors actually editing during their programs.

Regarding content production, program leaders were asked to provide various types of data, about what happened during their program, depending on the level of data they were able to record and track. These data types were:

  • Total number of characters added (33% reported)
  • Average number of characters added (33% reported)
  • Number of participants that added characters (8% reported)
  • Number of photos/media added (80% reported)
  • Number of Wikimedia project pages created or improved (50% reported)

Content production metrics were not requested of those who reported about editing workshops, since content production is not the main goal of that type of program.

11% of program leaders were able to report on how many Featured Images their program produced.

Quality improvement[edit]

Most program leaders were able to report how many and much of their images, uploaded during their program, were used in the projects after the program ended. However, most were unable to report about the quality of articles and images, and most that did report on it stated that no featured, good, or valued articles or images came out of their event.

The survey also asked that program leaders report on the quality of the content that was produced during the program. They could report:

  • Total number of good articles (38% Reported, 51% of reported no good articles)
  • Total number featured articles (34% Reported, 77% of reported no featured articles)
  • Use count of photos added that were being used in Wikimedia project pages (63% reported)
  • Number of unique images used on Wikimedia project pages (63% reported, 9% of reported that none were being used)
  • Number of Quality Images (27% reported)
  • Number of Valued Images (29% reported)
  • Number of Featured Pictures (28% reported)

Those who reported about edit-a-thons and workshops were not asked to report about image use and quality.

Recruitment and retention[edit]

Just over half of respondents were able to tell us how many of their participants were active 3 months following their program and less than half were able to do so 6 months after.

Tools like Wikimetrics can make this possible, which means tracking usernames is important to learning about retention. For edit-a-thons and workshops, the majority of those reported on did not retain new editors six months after the event ended. A retained "active" editor was one who had averaged five or more edits a month.[1]

Regarding the recruitment and retention of active editors, program leaders were asked to report two areas of data. An "active editor" is defined as making 5+ edits a month.[2]

  • Total number of contributors still active 3 months after the event (55% reported, 15% reported zero retained)
  • Total number of contributors still active 6 months after the event (45% reported, 19% reported zero retained)

If the program reported on was an edit-a-thon or workshop, program participants may have been split into two groups: new editors or existing editors, in order to learn the retention details about each cohort. This is important, since both edit-a-thons and workshops often attract new and experienced editors, unlike on-wiki writing contests that generally target existing contributors, and the Wikipedia Education Program that generally targets new editors.

In terms of recruitment and 6-month retention of new editors (those who made accounts at or for the event):

  • Edit-a-thons (44% reported, 85% of those reported zero retained)
  • Editing workshop programs (56% reported, 78% of those reported zero retained)
  • Wiki Loves Monuments recruitment and retention data was mined using the entire set of uploader usernames for the 2012 events.
We asked different questions about recruitment and retention for Wikipedia Education Program and GLAM content donations.

We asked about the retention of partnerships between educational institutions or cultural organizations, instead of editor retention. Wikipedia Education Program respondents were asked to identify how many instructors were participating in the program, and GLAM content donation respondents were asked if the GLAM they worked with would continue their partnership with Wikimedia and if the content donation would lead to other GLAM partnerships.

We wanted to know if program leaders had or produced materials to hand out, and had the ability to support other program leaders to do the same.

Replication and shared learning[edit]

We wanted to learn if program leaders believed their program(s) could be recreated (or replicated) by others. We also wanted to know if program leaders had developed resources such as booklets, handouts, blogs, press coverage, guides, or how-to's regarding their program. We asked if the program:

  • had been run by an experienced program leader who could help others do;
  • had brochures and printed materials developed to tell others about it;
  • had blogs or other online information written to tell others about it (published by yourself or others);
  • had a guide or instructions for how to implement a similar project.

Reporting format and goals[edit]

The data, discoveries, and conclusions shared here are based on the data points above. This is the first time that this type of reporting has been done about data collection and evaluation in the Wikimedia community. For this reason, we ask that readers assume good faith and interpret the results of this pilot survey with caution. Here is why:

  • Portions of the data used in this report is based on data that we were publicly available to mine and access. Meaning, there might be more data that exists that only program leaders know of or have access to, and we hope that program leaders will report on that data in the future.
  • Most reports are partial and incomplete and reported data demonstrate very wide ranging, non-normal, distributions for which standard measures of "averages". Meaning, examination of relationships between data points are especially difficult. For this reason, we present range and averages with median scores along with standard mean and standard deviation statistics in parentheses. Where there are major issues with any particular metrics or data points within any program, these are noted in the response rate and data limitations subsection. (See this section to learn more about these terms and their uses.)
  • Many program leaders did not report on the budget that went into producing their programs and hours were not available for reporting on the mined data. Due to this, and the small number of reports in some cases, areas of the report which examine financial inputs to programs may not be representative of all programs throughout the Wikimedia community.
  • The survey asked about a small set of outcomes and goals that the community worked with us to determine were the most common and important priority goals. We know that there are other outcomes and goals that the community has regarding programs. In the future, we hope to expand to support those outcomes and goals, too.

Next steps and recommendations[edit]

Major needs[edit]

This is only the beginning! In order to continue on the path of supporting program leaders in evaluation and design
  1. We need more data, we need better data.
  2. We need more tools to gather more metrics.
  3. We need to more carefully look at some different program design strategies and experiment with different implementation models – let's be bold!
  4. We need to understand that value of volunteer and staff support as a human resource in the areas where programs are being implemented so we can determine the monetary value for inputs, outputs, and outcomes.
Better tools can help program leaders gather more metrics faster and easier to gauge the impact of their programs.

Next steps for evaluation capacity building[edit]

  • We need to provide tools and guidance to make it easy and frequently possible for program leaders to report inputs (i.e., tracking budgets and hours) and outputs (participation, content added, usernames)
  • Create some sort of tool/extension for tracking program participation easily (cohort tags).
  • Need to gain ability within Wikimetrics to easily slice cohorts into "new" vs. "existing" editor/contributor groups to examine these participant groups differently.
  • Need to continue to develop tools and strategies for data collection of these as well as other priority goals (i.e., making contributing easier, making contributing fun, increasing awareness of, and support for, Wikimedia projects and open knowledge/open source in general).
  • Need to seek more implementation data, and additional results from replicated programs, so that we can discover promising practices for further research.

Next steps toward evaluating better assessment of quality and other common key outcomes including[edit]

  • Increasing skills for editing/contributing;
  • Increasing motivation/intentions to edit/contribute;
  • Increasing diversity and quality of content;
  • Increasing diversity of participants.

Next steps for identifying and experimenting with different program designs and implementation models[edit]

  • Targeting participants — who is being recruited and how? Can we apply different strategies for more purposeful recruiting?
  • How does programming in a series vs. as a one-off events change effectiveness and impact?
  • What is the added value of new editor/contributor participation?

Next steps toward return on investment analysis capability[edit]

  • Accuracy of hours input and budget information (i.e., detail of program level budgets)
  • Identify hourly/salary pay rates for comparable work being done in local areas to determine the value of staff/volunteer time, outputs and outcomes.
  • Determine the monetary value of donated resources (exact values aren't necessarily needed, but, it is important to know a general value since donated resources are very important to some programs)
  • The believed monetary value of featured/quality articles or images, images used, active editors created or retained, volunteer hours leveraged, etc.
  • Go global: think about how far dollars go in one city/country versus another, and other economic contexts, and determine strategies to compare differences.

Your guide to understanding the numbers, graphs and charts in this report[edit]

We understand that there may be some words, terms, and reporting methods unfamiliar to some readers. Feel free to use the talk page to ask further questions for clarification.

The numbers: range and averages[edit]

First, let's define some terms that might be new to readers:

  • Mean- This is the number that is most representative or typical of the group. You discover this by adding all the numbers up and dividing the total by how many numbers you added up.
    Example – A program leader reports on six photo upload competitions that happened over a year. They report the number of participants at these six events: 10, 6, 15, 7, 10 and 8. To find the average, you add those numbers up to get 56. You then divide 56 by 6 (the number of events reported on) to get an average of: 9.3.
  • Median – The median is the middle number in a group of numbers.
    Example – For the set of numbers that is: 3, 5, 6, 7, 7, 7, 8, 9, 10, 10, the median is 7
  • Mode – The mode is the most commonly seen number in a group of numbers.
    Example – For the set of numbers that is: 3, 5, 6, 7, 7, 7, 8, 9, 10, 10, the mode is 7
  • Range – This is the difference between the lowest and highest numbers reported. You find a range by subtracting the lowest number from the highest number reported.
    Example – Five edit-a-thons have reported user retention numbers of: 3, 6, 7, 9, 12. To find the range, subtract 3 from 12, which equals 9. 9 is the range.
  • Standard deviation – This is the measurement of how spread out numbers are from the mean. Read the Wikipedia article about standard deviation. The smaller the standard deviation, the smaller the range of responses in a distribution, and vice-versa. Importantly, when the value of the standard deviation exceeds the mean value, the range is extreme and the mean is an unreliable measure of average.
  • Interquartile Range – Also known as the middle 50, where a distribution is ordered and grouped into quartiles, the interquartile range is the range of values between the first and the fourth quartiles.
    Example – Imagine that you have a distribution of 3, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 12, you order the values and split them into four quartiles of data, for this example, 3 numbers each: Quartile 1= From 3 to 6, Quartile 2 from 7 to 7, Quartile 3 from 8 to 10, and Quartile 4 from 10 to 12; surrounding a median of 7.5 (7.5 as the mean of the two middle numbers 7 and and 8, since number of distribution values is even). The interquartile range is then 7 to 10. Here, we have a more narrow, "average," range of response, those falling around the median to get a better picture of average without the confusion of high or low performing outliers in the distribution.
How are they used in this report?

In this report, the range might seem like an obvious measure of distribution of the numbers reported. However, the average in a group of numbers is not as evident. Most of the time, averages are represented as means, medians or modes.[3]

The graphs[edit]

The report uses a variety of graphs to depict data:

Bar graphs
Bar graphs show the extent to which a metric value is observed, or reported, within a set of data. Bar graphs allow for for easy comparison of a metric over time or across groups and other categories.
Here's an example of a pie chart. This one shows the average length of an edit-a-thon, as reported in 2013.
Pie charts
Pie charts show the size of responses using the numerical response to determine how big the "slices" of the pie are.
Scatter plots
Scatter plots are charts that graph the points of intersection of two variables, or points of data reported.
To do this, the each variable is associated differently around a grid of possible responses; one variable is measured along the x-axis and the other along the y-axes.
Within the scatter plot area, each variable intersection reported in a set of data (one intersection for each individual report of data) is plotted as a separate point on the graph to illustrate where the two variables intersect, or meet, for all the pairing of those variables.
Bubble chart
Bubble charts are more advanced than a scatter plots. Bubble graphs plot data in 3D by using bubble size in order to graph three-dimensions, or variables/data points.
Just like scatter plots, bubble graphs use the x- and y-axes to show the intersection of two different variables for all the pairings of those variables that were reported.
Additionally, the 3D bubble graph uses different sized "bubbles" to show the intersection of values of a third variable along the three-dimensional z-axis.
In this report, the bubbles are labeled with the variable that is represented by their bubble size, and the x- and y-axis will have their own values noted on their appropriate scales. Tip – Make sure to always read the axis labels and bubble size legend included with each chart.
Box plot
A box plot is a way of depicting groups of numeric data using their quartiles.
The vertical lines running below and above the box plot area (also called tails or whiskers) depict the full range of values reported for a variable while the shaded box area illustrates the 25% of values reported that fall below the median (Quartile 2), and the 25% of values reported that fall above the median value (Quartile 3) reported.
That is, the area contained within the box of the box plot depicts the interquartile range and includes the middle 50% of all reported values, the ones that fall closest to the "average."
The length of the vertical distribution line beyond the box itself illustrates the full range of responses. Here, we look at the average range of most responses, those falling around the median, to get a better picture of average without the confusion of high or low performing outliers that exist within the distribution.

Notes[edit]

  1. This was measured for each three-month window for the first quarter year and second quarter year following the program to allow for minor month-to-month fluctuations as it was often observed that a user name would surpass the threshold in two of the three months and just miss it on one of the particular months, but not necessarily in a chronological order. For, example, an editor making 8 edits in Month 1, 10 edits in Month 2, 4 edits in Month 3, 8 edits in Month 4, 4 edits in Month 5, and 6 edits in Month 6 would be counted as a retained "active" editor at both 3- and 6-months follow-up, rather than inactive at 3-months and active at 6-months if the measure was based on a strict single month comparison. It will be important to further examine this approach and design to simultaneously report those that may start off highly active and drop off more rapidly than this quarterly perspective.
  2. mw:Special:MyLanguage/Analytics/Metric definitions#Active editor
  3. Often in reporting, you may only see means reported or depicted. Usually reports of mean scores are accompanied by report of the standard deviation which indicates the distribution of values, assuming a normal distribution, although means and standard deviations are also shared in this report, the data do not meet the assumptions of normality and mean scores are not the best measure of "averages" in most cases. For this reason, averages reported refer to the median response. While the means and standard deviations (SD) are reported in the parentheses that follow each reported median, they are there for the purpose of triangulation of averages and demonstration of the often extreme range of reported values. Means and standard deviations have also been rounded as they are, for the most part, not a meaningful estimation of average, and specificity down to the tenth is extraneous. Modes would also be reported, however, in nearly all cases there were no valid mode values due to the range of response values exceeding the number of responses. Where there were valid modes, they are reported.

Template:Programs:Evaluation portal/Library/Program2013

This report series was produced by the WMF Program Evaluation and Design team[edit]

The team responsible for producing this report is the Wikimedia Foundations Program Evaluation and Design team, and includes Frank Schulenburg, Senior Director of Programs; Dr. Jaime Anstee, Program Evaluation Specialist, Interns Edward Galvez and Yuan Li; and Sarah Stierch, former Community Coordinator.