Talk:Learning and Evaluation/Evaluation reports/2013/WLM

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

Photo quality ratings[edit]

« Out of 126,424 photos uploaded in for Wiki Loves Monuments 2012, 932 were rated as Quality Images. »

365 182 were uploaded as part of Wiki Loves Monuments 2012 ; and among those 2007 are rated as quality images.

I understand the difference in figures come from the choices made in data collection. The introduction states that data comes from 27 WLM implementations for 2012 and 2013, 24 mined from publicly available data ; but as far as I can see there is no way to know how many of these 24 are from 2012 (reminder: there were 34/35 WLM implementations in 2012). And the final formulation (very definitive, as quoted above) is highly misleading.

Jean-Fred (talk) 12:00, 25 January 2014 (UTC)


Hello Jean-Fred, The data in this report are specific to the contest for the 27 contests which we collected inputs (12/35 from 2012 and 15/51 from 2013). We refer to these counts throughout this report, while I have now altered the text to add back some specificity as a reminder of our limited examination (a redundancy that has been reduced to prevent the TL:DR comment flood), if you feel we need to do more here, please share your suggestion. Nearly all data refer only to the 27 events, 12 from 2012 and 15 fro 2013 which we had specific input data. The only data that refer to the totals from WLM are for recruitment and retention as we could not obtain the usernames by specific event so we analyzed all 2012 contributors.
As far as how many were self-reported vs. mined for each year. This is a level of specificity that seemed unnecessary to include in the report text as the self-reports were minimal and did not appear especially unique in their details. Only three program leaders reported directly (two reported 2012 events and one on their 2013 event).
Importantly - the contests we were able to examine demonstrated the following compared to the total 2012 set of uploads:
Report Count Report Percent Total Count Total Percent
Total Images 126424 365182
Unique Images Used 21639 17.116% 63639 17.427%
Quality Images 932 0.737% 2007 0.550%
Valued Images 30 0.024% 72 0.020%
Featured Pictures 32 0.025% 69 0.019%
As outlined in the table, the Unique Images Used, Valued Image, and Featured Picture rates are quite comparable, while the portion of Quality Image ratings indicates slight difference from our reported sample and the total population. As we note in the section on ratings, this is driven by a couple contests that seem to be the outliers for advancing image ratings.
I hope this comparative data table helps to couch the results a little better. JAnstee (WMF) (talk) 18:39, 25 January 2014 (UTC)

How did we get usage and rating numbers? (A question from mailing list)[edit]

For image use counts you can use http://tools.wmflabs.org/glamtools/glamorous.php to look up usage for any commons category name For quality ratings and image use you can use https://tools.wmflabs.org/catscan2/catscan2.php to look for template inclusion on the file pages, here is a link to a learning pattern we posted about it: https://meta.wikimedia.org/wiki/Grants:Learning_patterns/Counting_Featured_pictures,_Quality_and_Valued_images_in_Wikimedia_Commons

General observations[edit]

This evaluation of outreach programs is incredibly important, but since it is in its first year, we should all recognize that it's not a perfect measure of what's happening with outreach in general or WLM in specific. We should probably have constructive criticism for the evaluation system itself, as well as for WLM. My attempt follows.

  • Folks from WLM looking at this are probably pretty disappointed, really hard numbers are difficult to come by and they don't all point to "WLM is an outstanding success." If they did we wouldn't have any idea of how to improve WLM. In some cases, I think, program specifics need to be mentioned. In other cases, I think the WLM-mailing list has a pretty optimistic tone (as it should), coming across as "we are the largest photo contest in the world" and "we can do it!" (We are, and we did!). But that makes coming here a bit shocking. In some cases, the evaluators just need to make sure to underline the good points as well as the bad (and in some cases they did.) I believe that it is fair to say that WLM has received, by far, the best evaluation of any of the other programs mentioned, though it's easy to miss where the evaluation says that if you just read through it quickly.

Looking at the key findings 1-by-1:

  • The more participants, the more photos uploaded; however, the more money spent to implement a Wiki Loves Monuments event doesn't equal higher participant counts or more photo uploads.
    • Different countries are coming into this with different histories and backgrounds. E.g. in the US we've had a year-round project similar to WLM (WP:NRHP) going for about 7 years now, and going ahead with WLM only involved a few changes then full speed ahead. I think for other large western countries, the situation was similar. On the other hand, smaller countries, the global south and places without easily available site listings had a very difficult task. The solution could very well be to put money into the smaller countries, even though their quantitative results would almost inevitably be lower.
  • A very small percentage of images (0.08%) from Wiki Loves Monuments 2012 are rated as Quality, Valued and/or Featured pictures
    • Here many countries just ignore these ratings - just another set of hoops to jump through. Better to use the time to get more photos!
    • What are the percentages for Commons as a whole?
  • About 17% of the images uploaded through Wiki Loves Monuments 2012 are in use on our projects.
    • What's the average for all photos on Commons?
    • In many cases a photographer might take and upload 5 pix of a site, but only one will be used in a table, especially if there are no articles yet on the sites.
    • Photos are often used indirectly, most commonly via a Commonscat link to a category on Commons.
  • The majority of Wiki Loves Monuments participants are new users; however, the survival rate of new users is low (1.7% of the 2012 participants made at least one edit and 1.4% uploaded at least one new file to Commons six months after the event).
    • I'd think these are remarkably HIGH compared to other outreach programs, or to the overall experience of Wikimedia projects. I think nobody really knows how or why retention rates go over 1%. Perhaps a study is needed here on the overall determinants and experience of retention rates in different circumstances.
    • That said, I think retention rates for photographers should be higher than for article writers. It can be easier to snap a photo than write an article.
  • Half of the existing editors who participated in Wiki Loves Monuments 2012 also participated in Wiki Loves Monuments 2013, with new users making up for the other half.
    • It's not clear to me whether editors new in 2012 were considered experienced in 2013 (Only clarification needed)
  • The global Wiki Loves Monuments organizing team helps support Wiki Loves Monuments organizers around the world, providing replication opportunities
    • This can be incredibly important according to our goal to increase participation in the global south
  • It's obvious that Wiki Loves Monuments is successful at increasing the number of freely licensed images of historic monuments, but is it successful at educating participants about open knowledge and free licensing?
    • Again it would be nice to have a comparison with other outreach programs

Just my 2 cents worth, hope it helps.

Smallbones (talk) 19:43, 25 January 2014 (UTC)

Hello! Thanks for your constructive commentary. To answer your question, yes, a returnee in 2013 from 2012 would be considered an existing user, however, we have not analyzed that far. For now, a new user is defined as someone who made no contributions in the year prior (so if someone had participated in WLM 2011 and not again until the event in 2012, they would have been counted as an existing user still.
As for your comments. I think these are all very important points. We need to remain cautious in our interpretations and conclusions knowing that these are only initial reporting points. These reports will need to expand based on more systematic monitoring and reporting and program leader capacity to evaluate outcomes related to educational and movement related goals. Importantly, the rating of photos is clearly one of those areas which people engage very differently. I think it will be important to examine what efforts are input to make such a thing happen en mass and whether those efforts are worth the obvious benefit of monitoring for quality, but maybe also beyond assessment in terms of possibly motivating participants, highlighting quality content to increase its use (directly in articles or as examples of quality for new contributors), or increased content promotion via cross-listings that may result.
Lots of ideas will be generated here I imagine - Thanks for sharing yours here JAnstee (WMF) (talk) 20:21, 25 January 2014 (UTC)

Key messages[edit]

Smallbones said in the section above "Folks from WLM looking at this are probably pretty disappointed". I think they shouldn't. Let me clarify this by restating what I think are some of the key messages of this report:

  1. Wiki Loves Monuments is a success story. It encourages people to upload tons of highly valuable photos of historical monuments to Commons. When it comes to the sheer amount of freely available content that got added to Commons over the years, the program's success is amazing.
  2. The organizers of the event should not be blamed for the low retention rates. Yes, having all these new people stick around would be great. But without substantial changes to Commons' interface and a more friendly attitude of some parts of the community towards new users, it's hard to imagine that retention rates will be significantly higher in the future.
  3. The effectiveness of the global organizing team in setting the stage for new countries to join the contest is second to none. The global team has not only invented a model that works, it has also been very good at providing people around the world with the necessary support and tools to make that model work in their countries.

I am wondering how we can do better in future reports to highlight key messages around the successes. Any suggestions? --Frank Schulenburg (Wikimedia Foundation) (talk) 23:28, 29 January 2014 (UTC)

“a more friendly attitude of some parts of the community towards new users” − could you please both elaborate on the exact meaning of this, and detail your backing data? Is it an opinion based on empirical observations, or is there any study out there on the survival of newcomers on Wikimedia Commons (as far as I know, all such studies have been conducted on the (English) Wikipedia)? Thanks, Jean-Fred (talk) 01:37, 30 January 2014 (UTC)
Jean-Fred, you're right I should have been more specific about what I mean. I was referring mostly to the fact that warning templates can be very off-putting to new users and have significant impacts on whether those people stay or not. We know this from research that has been done on the English Wikipedia and I actually think that the effect on new Commons contributors won't be much different. We also know that the first communication between new users and the existing community is essential. And I think that a lot can be done to make this first interaction more pleasant for beginners. With that said, I'd be happy to help with thinking about how this problem could be solved (in my capacity as a volunteer). You know that I'm very dedicated to the success of Wikimedia Commons and I'm very eager to improve things. Best, --Frank Schulenburg (Wikimedia Foundation) (talk) 18:58, 30 January 2014 (UTC)
Hey Jean-Fred, it's me again. After reading my remarks again I think I made a mistake in framing it as attitude of the community. I think what's happening is that some people on Commons who leave those warning templates are unintentionally rude. I didn't mean to imply that people are intentionally rude with newcomers. So, "attitude of the community" is the wrong framing and I apologize if I hurt you or anyone else with what I was saying. --Frank Schulenburg (Wikimedia Foundation) (talk) 19:43, 30 January 2014 (UTC)
Retention - maybe one is enough!
Hey Frank!
Good to hear from you. The Program Evaluations are making a good start. Six months ago, I don't think there was was anything like this. Did anybody have a clue on what programs were succeeding? Now you've got a methodology started along with experience and feedback. There will certainly be improvements coming in the future!
I didn't mean to jump on you or the program evaluation - though you guys need some evaluation too. There's lots in the evaluation, including WLM successes. Presentation to a diverse audience can be a challenge, especially when many in the audience are volunteers who put a lot of effort into a project. You should remember that any balance sheet has both assets and liabilities. I've been thinking about a couple of WMF projects to propose something like What does success look like? The WMF/Wikipedia/Commons has a huge number of successes. When it comes to outreach programs it would be nice to be able to go somewhere and see what those successes are and maybe how to emulate them. Maybe some of this can be a bit informal, personal stories and the like, right at the top of the report.
To start you off, I've included a video of a 2012 WLM success story. I'll leave off his name, but he is an octogenarian who contributed about 10 photos to WLM 2012. He's had some problems with categorization and some technical things. Another user noticed some of his covered bridge photos and asked if I could help him a bit. He's up to 3,000 photos contributed to Commons now, most 1 photo per historic site. Sometimes it seems like retaining just one user is worth it.
I think there is a disconnect between reading the WLM mailing list, and its sort-of cheerleading atmosphere, with reading a report like this. Maybe a What does success look like? section right at the top will help.
Whoops, it's getting late. I'll be back tomorrow with specific suggestions about comparisons, retention rate methods etc.
All the best, Smallbones (talk) 05:20, 30 January 2014 (UTC)
Back again. I'm quite serious about "Sometimes it seems like retaining just one user is worth it." In the case above there were 30,000 photos submitted for WLM-US in total for 2012-2013, and the user (mostly outside WLM) has submitted 3,000 photos. He searches for places where we need photos on en:Wikipedia (mostly NRHP sites, but also in articles like List of Museums in Michigan) so I'd guess well over half of his photos are used in articles or lists (or both). So we might be talking about an amount of photos in use equivalent to 10-30% of the amount from all WLM-US, just from one user! Delving into the specifics of this one user (he has some special circumstances) might give a lot of insight into he entire success of WLM-US. There are other individuals worth considering. One of the winners of the US contest in 2012 submitted a couple of hundred photos for WLM 2012, then disappeared for a year, and submitted a couple of hundred for WLM 2013. Mostly 1 per site for previously unillustrated sites. I'm not sure what to conclude from that, but maybe the "output" of the contest comes down to just a couple individuals. A third important case involves one user, who contributed good photos of about 25% of the sites in a medium sized state in WLM 2012. He was about halfway through writing and illustrating a series of 3 e-books about NRHP sites in his state, had used us as a resource, and wanted to give back to the project. I encouraged him to help start articles on some sites and he started doing that, but quit everything in December 2012, because of what he considered to be nasty feedback from our editors. Considering that feedback - it was certainly not friendly, but wasn't much worse than what most editors get starting off.
Comparisons are needed for most numbers- it's hard to interpret their meaning without them. So for example, "the percentage of photos in use" seems to be very important and has an easy interpretation: 100% would be good, though we know that's impossible, and 0% would indicate a major problem. So what's "the usual number?" How low a number would indicate a problem? Perhaps we have the average number on Commons somewhere in our computer system? Maybe you have data on other projects? Well, next year you'll at least have this year to compare to. One possible solution is just to take random samples, e.g. select 100 photos on Commons and check to see if they are in use. You'd have a standard error of 5% (2.5% if you sample 400 photos), so the results would be somewhat meaningful even if not as accurate as everybody would like. You might even take a random sample of non-WLM photos uploaded in September to see how "photos in use" change over time for the 2 groups.
I spent a lot of time when skimming the report on the retention question (perhaps that's why I got a bit discouraged!) As I understand it we don't really have a handle of retention numbers in general, or in specific situations yet, but for just about any group other than active experienced editors, the retention rate over a year's time is something like 1%. Maybe that's just how Wikipedia works, or maybe there's something we can do about it. Big question that would be nice to know something more about (BTW a related article in this week's Signpost). One thing do to start would be to have an overall retention study, not necessarily mining all the too much available data, but sampling might work just as well. E.g. start with 5-year-old data, take a random sample of 1,000 new editors and see how many last at least a month, then for editors who last a month, see what the half-life is (the time until 50% of the editors are left). Do this for 4 year-old data, 3 year etc. At that point you might look at editors in identifiable groups, e.g. folks who contribute to both Commons and Wikipedia, vs. those who only contribute to one of the two. Other identifiable groups might be those who contribute a large amount in their first month, those who started in outreach projects, (maybe!) self-identified female editors, major area of contribution (e.g. pop culture, science and math, history). This might be too much of a scattergun approach, but after completing it, you'd likely have a better starting point for the next study.
Another possibility is to work backwards. Starting with a sample of editors who have been active for over a year, follow their histories backward. Was the amount of their contributions immediately higher than those for editors of a similar life-span (are Wikipedia editors born, not made?)
Related to WLM, you could do a matched random sample for new editors, e.g. select 1000 new WLM editors, 1000 new non-WLM Commons contributors and follow their experience over time.
These are just ideas off the top of my head but I hope they might inspire other ideas. BTW, if anybody is interested in working on a possible long-term study of random ARTICLES, please contact me. Smallbones (talk) 18:17, 30 January 2014 (UTC)
Hi Smallbones,
Yes, I'm also very happy that we started the discussion around evaluation. Although we've now access to some initial numbers, I think there's a lot that needs to be done in the future. I think the main limitations of the current results are that (a) the data base is still very narrow (one the one hand, the evaluation team strives to get as much data as possible, on the other hand, we have to be very careful not to overwhelm program leaders with data requests; I think it will be essential to find the right balance in the future), and (b) most of the data is quantitative and we'll need a lot more qualitative data to back things up (the evaluation team is already addressing this issue with helping program leaders to set up surveys).
When it comes to your question of what success looks like, I also agree with you. The example of the one user who uploaded tons of good photos reminded me of a case that happened around 2006. At that time, I was still living in Germany and was very much involved with the German chapter. We organized our first Wikipedia Academy and one of our participants, a retired professor of agriculture, got hooked on Wikipedia. With some in-person support (I visited him several times a month and helped him to understand Wikipedia's guidelines and workflows; when I moved to the US, another local Wikipedian took over), he created more than 500 high-quality articles. Looking back, I think that's a success story. At the same time, I always saw this as some kind of random success. Don't get me wrong, I still think it's as much as a success as your example of the new Commons photographer who came in through Wiki Loves Monuments. But wouldn't it be nice if we could find a way that would make it more predictable whether someone becomes a long-term contributor? If we figured this out, out outreach efforts would become more effective.
So, for Wiki Loves Monuments, that would mean to find out what the conditions are that make people come back or that even turn them into long-term photo uploaders. And the conditions are not the only factor that might come into play. I also think that some people are more prone to becoming long-term contributors than others. For Wikipedia, I guess that those people are more likely to stay that enjoy doing research, that enjoy writing and that have some special interest. And of course, they have to believe in our values. Now, what would that look like on Commons? Is there a way that we could do some targeted outreach to people who already own a DSLR and who'd enjoy sharing their photos with others? What other characteristics would those people need to have in order to be eager to stay?
One solution, as you pointed out, would be to really deep-dive into analyzing the behaviors and attitudes of people who became long-term contributors to Commons in the past. And I agree with you that this should be something high on our priority list.
Having said that, I really enjoy this conversation. It makes me very happy to see that the evaluation team has been getting so much helpful and constructive feedback over the last couple of months. --Frank Schulenburg (Wikimedia Foundation) (talk) 19:27, 30 January 2014 (UTC)
One quick comment - your example didn't seem to be random at all. Very similar (but not the same) as my example. It's clear that PhDs are very much over-represented among our editorship. As a matter of fact older folks with advanced degrees are a real bulwark on Wikipedia. Maybe 7 out of 10 of my favorite editors fit in this category. Lots of quality, lots of quantity, good quiet leadership and a basic anti-drama stabilizing force. Maybe we should set up a "Faculty Lounge" to go along with the Tea House? Smallbones (talk) 22:39, 30 January 2014 (UTC)

"A third important case involves one user, who contributed good photos of about 25% of the sites in a medium sized state in WLM 2012. He was about halfway through writing and illustrating a series of 3 e-books about NRHP sites in his state, had used us as a resource, and wanted to give back to the project.

I encouraged him to help start articles on some sites and he started doing that, but quit everything in December 2012, because of what he considered to be nasty feedback from our editors. Considering that feedback - it was certainly not friendly, but wasn't much worse than what most editors get starting off."

That is terrible, and we should be focused in preventing it. The evaluation program says this:

"Without a significant investment in technical resources to make the contributing and interacting interface on Wikimedia Commons better and community consensus to make Commons more welcoming to newcomers, it's hard to imagine image upload campaigns like Wiki Loves Monuments will achieve a higher retention rate of new contributors."

I disagree with the first part. The latest file uploading pages are fine. The biggest problem around is reaction by other users. The "Thank" button is a nice gadget which helps a little, but rejection has deeper roots.

I think that editors are very focused in keeping the projects full of good articles and files, which is eseential. But they often forget about trying to encourage other users to keep participating in the projects.

Policies sometimes work against user retention. And some editors think that policies are more important that the ultimate goals of the projects. I've tried to propose changes in policies in the Spanish-language Wikipedia, but it's hard to convince people of the issues with current rules.

This is where the Foundation should intervene: convining the most active project editors to lead such changes. --NaBUru38 (talk) 20:36, 9 February 2014 (UTC)

Warning: non-standard metrics definitions[edit]

Note still missing from the page text: the metrics used in this report are (in part) non-standard, compared to mw:Analytics/Metric definitions and (draft) Category:WMF standardized editor classes.[1] Keep this in mind when referring to this report, until a sync between definitions here and in the standards happens. --Nemo 09:12, 1 February 2014 (UTC)

Thank you, Nemo, for posting here. To be clear, in this series of reports we have introduced a new metric of new editors who are retained and active editors at two follow-up points (three and six months for two quarter follow-ups. This longer term "active" editor retention lens is new, and has not been standardized. We describe this metric in the overview and in the report text in the section on "active editor retention" which is split by cohort: new contributors in WLM 2012 or existing contributors in WLM 2012. Further specificity is also provided as notes in the in-text references and report overview description notes for how these metrics were constructed.
Whether this way of calculating Active editor retention at three and six months will be how we standardize this metric, this is how we have done so here. In essence we are proposing this as a standard metric, for evaluating survival and "active editor retention" for programs with that aim. We look at this differently for new vs. existing users for programs because they are expectedly different. Still, retention of active editors, while named by many, may not be among the primary goals of many programs or program leaders; a systematic lens that allows program leaders to see which goals and metrics are appropriate for telling the story of their program. We are adding to that work here. Still, this is only a piece of many future standardized metrics for common program goals, it does not exclude the possible use of others such as "new active editor" or otherwise. We have simply introduced another lens.
I do not understand what the required sync you refer to is as this metric can only be aligned with definitions that do exist (which is only what an "active" editor is); not "retained active editors" at three and six months which is the lens we are applying for programs. This is why we described in the overview page as well as report text and references how we standardized these for this report series. Please, could you tell me more specifically: why you feel we need a warning, and what, if any, suggestions you have regarding the design of this new metric. JAnstee (WMF) (talk) 18:55, 2 February 2014 (UTC)

Humph[edit]

“Without […] community consensus to make Commons more welcoming to newcomers, it's hard to imagine image upload campaigns like Wiki Loves Monuments will achieve a higher retention rate of new contributors.”

Huh. This had escaped my first reading. Hum. I suppose this was not the intention, but it makes it sound as if there is some community consensus on Commons to be unwelcoming to newcomers. Please let me know what makes you seem thinking so ; and please be careful with such constructs − frankly, I find it highly insulting, as a member of the Commons community. Jean-Fred (talk) 01:32, 10 February 2014 (UTC)

See the thread #Key messages. --NaBUru38 (talk) 15:02, 10 February 2014 (UTC)
Wiki Loves Monumnets was the first wiki event that I was involved in as an active perticipant. And today, i am a staff member of the DC chapter. Geraldshields11 (talk) 21:14, 7 April 2014 (UTC)

WLM Participants' Surveys 2012 / 2013[edit]

In 2011, 2012, and 2013, a WLM Participants' Survey was carried out. The 2012 and the 2013 editions used a very similar questionnaire, in order to allow for longitudinal comparisons. An Overview with further information on the 2013 edition can be found on Commons, including a link to a draft analysis plan. So far, I have the impression that the analysis of the survey data will lead to valuable insights that are complementary to the analyses that have already been carried out by the program evaluation team.

The 2012 data has been cleansed, and a preliminary analysis has been carried out by students of the Bern University of Applied Sciences. With regard to the 2013 data, I'm about half way through the data cleansing process. I still need to negotiate with the WLM international team to whom and in what form the data can be made available for further analysis.

Main findings of the preliminary analysis of the 2012 data:[edit]

  • There are notable differences between age groups as to whether they have edited Wikipedia before participating in WLM (the ratio of new contributors being higher for the age groups 1-20 years as well as 50plus). No notable differences between the age groups have been found with regard to gender, how they found out about WLM, usability, likelihood to participate again or to recommend the contest, or likelihood to contribute to Wikipedia in the future.
  • There are notable differences between male and female respondents as to whether they have edited Wikipedia before participating in WLM (the ratios of females that have never edited Wikipedia before or have done so only sporadically being substantially higher than the ratios of their male counterparts). No notable differences between male and female respondents have been found with regard to usability, likelihood to participate again or to recommend the contest, or likelihood to contribute to Wikipedia in the future.
  • There are notable differences between countries, especially with regard to participants’ demographics (age, gender), participants’ previous contribution to Wikipedia/Wikimedia, participants’ motivation to participate, usability, likelihood of participating again or recommending the contest to friends, as well as likelihood to contribute to Wikipedia/Wikimedia in the future.

I think it is quite save to conclude from these results that WLM is likely to have a positive impact on age and gender diversity on Wikipedia. Further investigation is needed into country differences. Our students have carried out a cluster analysis which suggests that the main country differences lie in the differences regarding usability (How easy was it to find information about the contest and the monuments? How easy was it to upload pictures?). Furthermore, a regression analysis suggests that there is a link between user experience and likelihood to participate in future contests. These questions are definitely worth being further looked into.

See also the draft analysis plan for further suggested analyses.

--Beat Estermann (talk) 11:57, 19 February 2014 (UTC)