Research:Crowdsourced evaluation of Upload Wizard instructions

Tracked in Phabricator:
Task T192843

Created

20:56, 25 April 2018 (UTC)

Contact

Jonathan Morgan

Wikimedia Foundation

Collaborators

Pam Drouin

Wikimedia Foundation

Duration: 2018-April – 2018-May

Research:Projects

This page documents a completed research project.

The overall goal of this project is to understand how to phrase instructions in the Upload Wizard interface in order to reduce potential confusion between the "caption" and "description" fields, allowing uploaders to add short captions to each media item they upload that are rich and accurate without directly duplicating the content of the (ideally, longer and more detailed) description field.

Research questions[edit]

Do different sets of instructions elicit captions of different lengths? All things being equal, longer captions contain more metadata about an image. Ideally, we would like uploaders to provide longer captions (within the character limits of the field) rather than shorter ones.
Do different sets of instructions produce better quality captions? The way that people are instructed to describe an image in a caption can influence the overall quality of the captions they generate. Instructions should be as clear and accurate as possible, to ensure the most accurate and detailed captions.

Although small design decisions like how to phrase a particular set of instructions may seem inconsequential in the grand scheme of things, the importance of capturing good-quality captions from Commons contributors at the point of upload--and the opportunity to evaluate out a user-testing methodology that has never been tried by the Wikimedia Foundation before--make this study compelling and worthwhile.

Methods[edit]

Perform a controlled study with Amazon Mechanical Turk workers. Each worker will view one of 10 images and presented with one of 4 different instruction conditions. The instructions tell the worker to add both a caption and a description for the image. We will aggregate the data across all conditions and evaluate which of the 4 instruction sets produced the best overall captions and descriptions.

Experimental conditions[edit]

Instructional conditions displayed to Turk workers
Analysis set*	instruction set	Header text	Instruction text
A	instruction_1	Preview Text	Add a short phrase to convey what this file represents, including only the most relevant information.
B	instruction_2	Preview Text	Add a one-line explanation of what this file represents, including only the most relevant information.
A	instruction_3	Caption	Add a short phrase to convey what this file represents, including only the most relevant information.
B	instruction_4	Caption	Add a one-line explanation of what this file represents, including only the most relevant information.

*we decided to simplify our analysis by ignoring the "Header text" conditions, and focusing on the differences in the wording of the instruction text. Hence, our analysis compares only two conditions instead of four; different header text options were randomly distributed among sets A and B.

Images displayed to Turk workers
image number	image URL
image_1	https://upload.wikimedia.org/wikipedia/commons/f/f3/GDSF_2008_old_fair_section.jpg
image_2	https://upload.wikimedia.org/wikipedia/commons/8/80/BAMICORI.jpg
image_3	https://upload.wikimedia.org/wikipedia/commons/d/d3/20141203_155433_B.jpg
image_4	https://upload.wikimedia.org/wikipedia/commons/5/59/Cat_Matahari_on_the_%27Internet%27.jpg
image_5	https://upload.wikimedia.org/wikipedia/commons/a/aa/Dog%2C_man%2C_portrait%2C_outdoor_chair%2C_yard_Fortepan_6371.jpg
image_6	https://upload.wikimedia.org/wikipedia/commons/4/40/Car_in_Oradour-sur-Glane4.jpg
image_7	https://upload.wikimedia.org/wikipedia/commons/e/eb/People_walking_on_sidewalk_in_central_Pyongyang%2C_with_trolleybus_and_trams_in_background.jpg
image_8	https://upload.wikimedia.org/wikipedia/commons/2/23/Will_Robertson_of_the_Washington_Bicycle_Club_riding_an_American_Star_Bicycle_down_the_steps_of_the_United_States_Capitol_in_1885.jpeg
image_9	https://upload.wikimedia.org/wikipedia/commons/9/93/Grocery_shopping_in_Tokyo%2C_Japan_-_DSC09687.JPG
image_10	https://upload.wikimedia.org/wikipedia/commons/9/90/S-Bahn_Berlin_Gesundbrunnen.jpg

Policy, Ethics and Human Subjects Research[edit]

This study has been reviewed by the Wikimedia Foundation Legal department. It does not involve collecting private or personal data about the Turk workers who participate. A privacy statement for this study (shown to each Turk worker before they perform the task) is available on WikimediaFoundation.org. We also made sure to follow the best practices for test design, payment, and communication outlined on the WeAreDynamo wiki.^[1]

Results[edit]

The final dataset consisted of 286 captions and descriptions distributed randomly across four experimental conditions and ten images. View the source data.

RQ1[edit]

Our first analysis was to investigate whether different instructions led to captions of different lengths. The average length of captions across the four conditions ranged between 33 and 37 characters. However, we found no statistically significant differences between the average lengths of captions (or descriptions, which varied between 84 and 90 characters on average) generated by Turk workers across our four conditions.

RQ2[edit]

Our second analysis investigated whether the quality of captions was impacted by the instructions. For this analysis, we randomly paired captions generated with one instruction set with other captions (for the same image) generated with the other instruction set. We then performed 126 comparisons between pairs of captions from set A and B for the same image. One coder chose which of the two captions was better, without knowing which experimental condition (set A or B) each caption came from. We were able to come to a decision on 123 of the 126 pairs. The distribution of "wins" and "losses" was:

Coding results
Set	Wins	Losses
Set A: "short phrase"	50	73
Set B: "one-line explanation"	73	50

A one-proportion Z-test found that captions generated from the "one-line explanation" instructions were significantly more likely (Z=2.085, p=0.0371, 95% CI = 50.18 to 68.16) to be chosen over captions generated from instructions that invited labelers to use a "short phrase" to describe the image.

Conclusion[edit]

Instructions inviting labelers to "Add a one-line explanation of what this file represents, including only the most relevant information" about an image produced better quality captions on average. No differences were observed in the length of captions or descriptions produced under any of the experimental conditions. Therefore, we intend to use the phrase "one-line explanation" in the Upload Wizard instructions for eliciting captions.

References[edit]

↑ "Guidelines for Academic Requesters - WeAreDynamo Wiki". wiki.wearedynamo.org. Retrieved 2018-05-23.

[1] "Guidelines for Academic Requesters - WeAreDynamo Wiki". wiki.wearedynamo.org. Retrieved 2018-05-23.

[1]