Research:Crowdsourced evaluation of Upload Wizard instructions

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
20:56, 25 April 2018 (UTC)
Duration:  2018-April — 2018-May
VisualEditor - Icon - Check.svg
This page documents a completed research project.

The overall goal of this project is to understand how to phrase instructions in the Upload Wizard interface in order to reduce potential confusion between the "caption" and "description" fields, allowing uploaders to add short captions to each media item they upload that are rich and accurate without directly duplicating the content of the (ideally, longer and more detailed) description field.

Research questions[edit]

  1. Do different sets of instructions elicit captions of different lengths? All things being equal, longer captions contain more metadata about an image. Ideally, we would like uploaders to provide longer captions (within the character limits of the field) rather than shorter ones.
  2. Do different sets of instructions produce better quality captions? The way that people are instructed to describe an image in a caption can influence the overall quality of the captions they generate. Instructions should be as clear and accurate as possible, to ensure the most accurate and detailed captions.

Although small design decisions like how to phrase a particular set of instructions may seem inconsequential in the grand scheme of things, the importance of capturing good-quality captions from Commons contributors at the point of upload--and the opportunity to evaluate out a user-testing methodology that has never been tried by the Wikimedia Foundation before--make this study compelling and worthwhile.


Perform a controlled study with Amazon Mechanical Turk workers. Each worker will view one of 10 images and presented with one of 4 different instruction conditions. The instructions tell the worker to add both a caption and a description for the image. We will aggregate the data across all conditions and evaluate which of the 4 instruction sets produced the best overall captions and descriptions.

Experimental conditions[edit]

Instructional conditions displayed to Turk workers
Analysis set* instruction set Header text Instruction text
A instruction_1 Preview Text Add a short phrase to convey what this file represents, including only the most relevant information.
B instruction_2 Preview Text Add a one-line explanation of what this file represents, including only the most relevant information.
A instruction_3 Caption Add a short phrase to convey what this file represents, including only the most relevant information.
B instruction_4 Caption Add a one-line explanation of what this file represents, including only the most relevant information.

*we decided to simplify our analysis by ignoring the "Header text" conditions, and focusing on the differences in the wording of the instruction text. Hence, our analysis compares only two conditions instead of four; different header text options were randomly distributed among sets A and B.

Images displayed to Turk workers
image number image URL

Policy, Ethics and Human Subjects Research[edit]

This study has been reviewed by the Wikimedia Foundation Legal department. It does not involve collecting private or personal data about the Turk workers who participate. A privacy statement for this study (shown to each Turk worker before they perform the task) is available on We also made sure to follow the best practices for test design, payment, and communication outlined on the WeAreDynamo wiki.[1]


The final dataset consisted of 286 captions and descriptions distributed randomly across four experimental conditions and ten images. View the source data.


Our first analysis was to investigate whether different instructions led to captions of different lengths. The average length of captions across the four conditions ranged between 33 and 37 characters. However, we found no statistically significant differences between the average lengths of captions (or descriptions, which varied between 84 and 90 characters on average) generated by Turk workers across our four conditions.


Our second analysis investigated whether the quality of captions was impacted by the instructions. For this analysis, we randomly paired captions generated with one instruction set with other captions (for the same image) generated with the other instruction set. We then performed 126 comparisons between pairs of captions from set A and B for the same image. One coder chose which of the two captions was better, without knowing which experimental condition (set A or B) each caption came from. We were able to come to a decision on 123 of the 126 pairs. The distribution of "wins" and "losses" was:

Coding results
Set Wins Losses
Set A: "short phrase" 50 73
Set B: "one-line explanation" 73 50

A one-proportion Z-test found that captions generated from the "one-line explanation" instructions were significantly more likely (Z=2.085, p=0.0371, 95% CI = 50.18 to 68.16) to be chosen over captions generated from instructions that invited labelers to use a "short phrase" to describe the image.


Instructions inviting labelers to "Add a one-line explanation of what this file represents, including only the most relevant information" about an image produced better quality captions on average. No differences were observed in the length of captions or descriptions produced under any of the experimental conditions. Therefore, we intend to use the phrase "one-line explanation" in the Upload Wizard instructions for eliciting captions.

See also[edit]


  1. "Guidelines for Academic Requesters - WeAreDynamo Wiki". Retrieved 2018-05-23.