Research:Understanding perception of readability in Wikipedia/Pilot survey

The aim of the pilot survey is to test our approach for measuring how readers perceive readability of Wikipedia articles. This involves not only the survey itself but also the infrastructure to host the survey and recruit participants.

The desired output of the survey is to obtain readability scores for N articles based on the ratings provided by the participants.

Methods

Survey design

We develop a survey following the approach by Benoit et al. ^[1].

The idea is to ask participants to compare two snippets at a time and rate which of the two is easier for them to read. From these relative ratings of pairs, we can obtain an absolute ranking of readability scores for which snippets using the Bradley-Terry model. In principle, one could ask participants to rate each snippet on some predefined scale, but experience demonstrates that humans find it considerably easier to do pairwise comparisons with respect to a trait.

Details about survey on readability:

Snippets: We select N=21 snippets (see Research:Understanding perception of readability in Wikipedia/Pilot survey snippets). They were chosen from the same topic (Food and drink) to avoid that participants might be more familiar with one topic than another when comparing snippets. All snippets are of approximately equal length. Snippets are sampled to cover a maximum range of automatic readability scores (Flesch-reading ease) in order to make potential differences in readability more pronounced.
Pairs of snippets: we randomly select E=40 different pairs of snippets. The requirement is that these pairs are sampled in a way that we have a single connected component among all snippets (when considering snippets as nodes and pairs as links between nodes). In total, there are N*(N-1)/2 possible pairs (for N=21 that would be 210); however, in practice, it is often sufficient to sample E ~ 2*N.
Ratings by the participants: Each participant of the survey is shown a randomly chosen subset of S=10 different pairs one at a time and asked to rate them (which one is easier to read for them). We aim to get R=3 ratings per pair. For this, we need to recruit at least P=E/S*R=12 participants. In order to take into account for missing ratings (undecided) or uneven sampling, we typically will increase the number of participants by ~20% (such that we will get P=15). The order in which pairs are presented is random.
Final scores: The survey will yield T=P*S ratings of the form (snippet 1, snippet 2, rating which one was easier, index of the participant). Given all the ratings from all the pairs, we can calculate the readability scores of the snippets using the Bradley-Terry model.

Additional questions:

Informed consent: We describe the purpose of the study and ask participants for their consent and to acknowledge the survey privacy statement.
Language proficiency: We ask participants about their proficiency level in speaking, reading, and understanding English on a scale from 1 to 5.
Topical Interest: We ask participants about how much they are interested in the topic of Food and drink on a scale from 1 to 5. All snippets come from articles labeled with that topic. The question is to potentially take into account overall familiarity with the topic of each participant.
Attention check: We check for the attention of the participant using an Instructional Manipulation Check following Prolific's attention check policy (one of two allowed ways for attention check). This is not used for rejecting participants from taking the survey, i.e. regardless of the participant's answer, they will continue to take the full survey. Nevertheless it serves two purposes. First, it can be used for post-processing survey responses, potentially improving data quality. Second, studies suggest that passing attention checks increases the motivation of participants^[2].

Survey infrastructure

Recruiting participants

We recruit participants via Prolific. This is an online platform to conduct research with interested volunteers.

We recruit participants from all countries available using the standard sample. We apply the following pre-screen criteria:

Participant is fluent in English. This is to ensure that participants fully understand the instructions of the survey to evaluate the readability of snippets.

We pay the standard hourly rate to participants.

Hosting the survey

We implement the survey in Limesurvey.

Using the instance run by Wikimedia Foundation this has several advantages over other tools such as googleforms etc. Most importantly, it is free and open source-software and we keep full ownership of the data.

Moreover, Limesurvey can be easily integrated with Prolific where participants are recruited:

Prolific -> Limesurvey: After activating the survey in Limesurvey, we can add the survey's public URL (something like https://wikimediafoundation.limesurvey.net/<SURVEY_ID>) to Prolific. This is the link that participants will click to take the survey. We can use Limesurvey's panel integration to record the Prolific ID of the participant in Limesurvey.
Limesurvey --> Prolific: Prolific provides a completion URL. We can add this to Limesurvey to the End URL field. Upon completing the survey in Limesurvey, the participant will be automatically redirected back to Prolific to let Prolific know the study was completed by the respective participant.

Results

In summary, we find that this setup is not a reliable way to assess perception of readability of Wikipedia articles. Likely, this is because the task of comparing readability of two articles about different topics is too difficult. The agreement between raters comparing the same pair of articles is very low. In fact, it is virtually indistinguishable from chance. The agreement is poor independent of the difference in the automatically assessed readability score (FRE score). As a result, the inferred readability scores from the ratings using the Terry-Bradley model are not statistically significantly correlated with automatic readability scores.

The main conclusion from this pilot is to revise the conceptual idea on how to effectively ask readers about readability of articles. However, the technological setup worked as expected and can be re-used.

References

↑ Benoit, K., Munger, K., & Spirling, A. (2019). Measuring and Explaining Political Sophistication through Textual Complexity. American Journal of Political Science, 63(2), 491–508. https://doi.org/10.1111/ajps.12423
↑ Shamon, H., & Berning, C. C. (2020). Attention check items and instructions in online surveys: Boon or bane for data quality? Survey Research Methods. https://doi.org/10.18148/SRM/2020.V14I1.7374

[1] Benoit, K., Munger, K., & Spirling, A. (2019). Measuring and Explaining Political Sophistication through Textual Complexity. American Journal of Political Science, 63(2), 491–508. https://doi.org/10.1111/ajps.12423

[2] Shamon, H., & Berning, C. C. (2020). Attention check items and instructions in online surveys: Boon or bane for data quality? Survey Research Methods. https://doi.org/10.18148/SRM/2020.V14I1.7374

[1]

[2]