As Wikimedia program leaders and evaluators work together toward more systematic measurement and evaluation strategies, we face a common challenge to have both systematic quantitative data and qualitative information. By doing so, we can establish successful practices in outreach programs, and understand the depth and variety of Wikimedia programs across our vast Wikimedia landscape. To meet this challenge, we are working to combine efforts and knowledge for two key purposes:
- To generalize program knowledge and design patterns of Wikimedia programs, and
- To deepen our understanding of Wikimedia programs and projects and how they impact our communities.
Often times, program leaders and evaluators question whether methods and measures that are quantitative are preferred over the qualitative, or whether qualitative outcomes can be given value like quantitative outcomes.
A good evaluation requires numbers and stories—there is no meaning of one without the other. How those stories and numbers are collected can, and will, vary greatly, as each program leader seeks to understand their project or program story. Methods and measures for program evaluation should be designed to track how your project or program is doing, what it intends to do, and how well the expected outcomes are reached. Whether those methods and measures are qualitative or quantitative will vary depending on your interests, your measurement point, and your evaluation resources, but should, no matter what, be useful in telling the story of what you did and whether or not your work was a success.
What about the “Quantitative vs. Qualitative” debate?
The divide between quantitative and qualitative is a debate theoretical in origin, one between positivism versus idealism. The first one holds assumptions and beliefs in a single unchanging truth; the latter states constructivism and belief in truth as a collection of varied interpretations and truth as socially constructed.
It is a phenomenological divide between physical versus non-physical phenomena. Between one system of thought accepting only phenomena which can be systematically sensed through basic human sensory perception (i.e., sneezing, blinking) and/or scientific instrumentation (i.e., temperature, heart rate, brain activity); and another which contemplates phenomena which cannot be sensed through basic sensory perception but are still thought to exist (i.e., love, fear, happiness).
Often, it is also a methodological divide between deduction versus induction. Quantitative approaches are used to test and systematically measure the occurrence of a phenomenon, and qualitative methods are used to explore and deepen understanding of a phenomenon.
Although quantitative versus qualitative is a debate at the heart of the philosophy of science, today’s research more often includes both quantitative and qualitative components, using mixed methods and measures, in order to seek deeper understanding of many complex social phenomena which cannot be seen.
A good evaluation requires numbers and stories—there is no meaning of one without the other. How those stories and numbers are collected can vary greatly, and will, as each program leader seeks to understand their project or program story.
Most often, through triangulation of quantitative and qualitative measures, today’s social researchers define approximate measures or “proxies” for honing in and measuring a phenomenon of interest. Together triangulated measures, qualitative and quantitative can tell a better story of outcomes than either can alone. For instance, the phenomenon of volunteer editing behavior:
"edit count" "bytes added/removed" "page views"
|A BETTER STORY|
"article subjects" "categories" "quality" ratings
Evaluating In a mixed-methods world
In practice, quantitative and qualitative tend more to be two-sides of the measurement coin: related and nearly inseparable in practice. While triangulation and mixed methods are often thought to be disconnected from the background theoretical debate and disassociated from the root philosophical challenge, much off the positivist paradigm, and belief in a reality of one truth, persist. When we do try to draw out the distinction it goes something like:
All quantitative measures are based on qualitative judgments
In quantitative measures the judgment just takes place in advance, in anticipation of possible responses, and often with direction for numeric assignment, rather than post-hoc after the data are collected. Numbers do not mean anything without assigning a description. Whether it is a question about physical count data, or about an attitude, we must create the meaning of numbers in measurement.
- With count data, such as using «edit count» or «bytes added» we can end up with a story piece that tells us precisely how many times our student cohort hit the save button (i.e., edit count) and how much content was added (i.e., bytes added) but says nothing about the qualities of those contributions or how participants worked to make them.
For this reason some may opt for what such metrics and consider them to be a more rigorous assessment of editing behavior and choose to measure respondents edit count directly through the WMFLabs tool, Wikimetrics, for instance. Still, what «counts» as an edit is qualitatively defined in the tools parameters as the count of each time an editor hits the «save» button, and that edit is «productive» meaning that it is not reverted or deleted . Further, the metric «edit count» does not directly correlate to the amount of text or other data contributed by an edit, or time devoted to editing before that save button is hit.
However, the metrics are very useful for telling a piece of a program story, for instance:
To get a deeper understanding of what took place, we may want to ask participants to self-report their experience editing any Wikimedia project to see if the editing behavior measured on the target project (Wikipedia) tells the whole story of editing behavior.
In presenting a self-report question to learn how often a student edited during a course, we have several choices and routes as well:
- We may ask for students to respond with a direct average estimate: On average, how often did you edit WP during your time in this course? and leave it open ended, which could lead to some giving a number for daily, weekly, or monthly editing sessions and a lot of post-coding of the data into a consistent interval scale.
- We could specify interval: On average, how often did you edit WP each week during your time in this course? and also leave it open-ended, reducing the variability in responses and mostly  controlling the response interval, and have only minimal post coding to obtain the full range of responses.
- If we already knew the range of responses might be limited, or if we had a target performance level, we could further control responding and reduce variability, by defining count data in ordinal response categories which are assigned a response numeral (1) through (7). Each of these methods will produce slightly different results and lead to different steps and burden in data cleaning and analysis, depending on how we attach meaning.
|How often did you edit Wikipedia, or other Wikimedia projects, during your time in this course?||(1) Less than once a month |
(2) 1-3 times a month
(3) 4-5 times a month (about once a week)
(4) 2-3 times a week
(5) 4-5 times a week
(6) 6-7 times a week (about once a day)
(7) More than once a day
|Defining count data in ordinal response categories that are assigned a response numeral (1) through (7).|
By assigning numeric meaning to quantify behavior we end up with a different story piece, for instance:
On the other hand, we might have a qualitative outcome of interest such as a feeling or attitude about editing. While behaviors are observable and we can easily define and count them, feelings and attitudes cannot so easily be tracked. Similar to the quantitative categorical scale shared above, we can assign numbers to mean different levels of applicability of a sensed state of being which cannot be otherwise observed.
From this, we could end up with another small story piece, for instance:
While we can also observe editing behavior directly to assess whether a person is able to edit by checking how much they edited, we would be making an assumption that performance is equal to experienced preparedness in order to tie the measure back to feeling prepared. Instead it makes more sense to ask for qualitative experience data. Depending on the measurement point one may make sense more than another.
Ideally, the best story would contain multiple descriptive parts, the quantitative observation of online editing behavior as well as the qualitative description of editor’s reported preparedness and the descriptive information about what that editing behavior worked to improve. So that we could say something more like:
All qualitative measures can be coded and analyzed quantitatively
Conversely, any simple qualitative coding table can be easily converted to quantitative numeric coding and analyzed quantitatively to reveal how qualitative themes relate to one another or how similar or dissimilar interviewee responses were to one another.
As seen in the qualitative and Quantitative data tables below, simple qualitative coding table can be easily converted to quantitative through simple binary coding, «0» for not observed and «1» for observed, or into count data using the sum of codeable observations.
Once converted, we can run basic correlation analyses to examine how qualitative themes relate to one another or how similar or dissimilar interviewee responses were to one another.
For instance, here we see that all interviewees made mention of their motivation, but were less likely to touch on the learning support theme. Those who discussed their skills did not discuss learning support and vice-versa. Further, there was 100% positive correlation of interviewees 2 and 4 as well as 1 and 5, and 100% negative correlation between interviewee 2 and 3.
In the end, there is a difference in qualitative methods and measures and what they produce in terms of knowing. For the most part, help us to understand deeper while quantitative allow us to systematically check that understanding across contexts. However, the divide is not-so-much, they are more dimensions than dichotomy, and more friends than enemies. In Wikimedia, we know context matters, in the work we do it is likely best for all program evaluations to consider triangulation of quantitative and qualitative measures and use of a mixed methods approach as we explore program implementation in each new context together.
Program Leader Next Steps:
Trying to choose the best measures for your Wikimedia project or program?
Check out Measures for Evaluation, a helpful matrix of common outcomes by program goal that we use to map measures and tools.
Have you been collecting data to evaluate a Wikimedia program?
Round II Voluntary Reporting is open - We are looking for data. Last week, the Program Evaluation and Design team initiated the foundation’s second round of voluntary programs reporting. We invite all program leaders and evaluators to participate in, yet again, the most epic data collection and analysis of Wikimedia programs we've done so far. This year we will examine more than ten different programs:
- Editing Workshops
- On-wiki writing contests
- Wikipedia Education Program
- Wikimedian in Residence
- GLAM content donations
- Wiki Loves Monuments
- Wiki Loves Earth, Wiki Takes, Wiki Expeditions, and other photo upload events
Did you lead or evaluate any of these programs September 2013 through August 2014? If so, we need your data! For the full announcement visit our portal news pages.
Reporting is voluntary, but the more people do it, the better we can representation of programs to help us understand the depth and impact of programs across different contexts. This voluntary reporting allows us to come together and generate a bird’s eye view of programs so that we can examine further what works best to meet our shared goals for Wikimedia and, together, grow the AWESOME in Wikimedia programs!
- Case in point, within Wikimetrics, this will be changing to give users to a count that also includes those edits made to pages that were since deleted
- Of course, in addition to this assignment to specific meaning to numbers in responses, we also make assumptions of a respondents understanding of the terms «edit» «Wikipedia», «Wikimedia projects», and «course». We assume that the participant reads and understands both the question and the response options, and that they respond meaningfully and accurately.
- In such a case, we again make an assumption that respondents consistently understand the terms «edit», «Wikipedia», and response labels, as well as understand the intended interval nature of the response scale. Alternatively, we can try to observe the editing behavior directly to assess whether a person is able to edit, however, we would be making an assumption that performance is equal to experienced preparedness in order to tie the measure back to feeling prepared.
Further Online Reading
- Ridenour, C. S., and Newman, I. (2008). Mixed methods research: Exploring the interactive continuum. Southern Illinois University Press: Carbondale, Ill. USA. Available online.
- Newman, I., and Benz, C. R. (2006). Qualitative-quantitative research methodology : exploring the interactive continuum. Southern Illinois University Press: Carbondale, Ill. USA. Available online.
- Sale, J. E. M., Lohfeld, L. H., and Brazil, K (2002). Revisiting the Quantitative-Qualitative Debate:Implications for Mixed-Methods Research. Quality & Quantity, 36: 43–53. Available online.
- Trochim, William M.K. (2006) Research Methods Knowledge Base. The Qualitative Debate. Retrieved online