Research:Article feedback/Quality assessment

From Meta, a Wikimedia project coordination wiki

This article presents a summary of the analyses performed during an experimental trial of version 5 of the Article Feedback Tool. The primary focus of this analysis was to determine which interface gadget (see three gadget options) elicited the best signal-to-noise ratio.

The methods section describes the analytical approach. The proportion of useful feedback for each version of AFTv5 is presented in bar plots and explained in each section.

Questions[edit]

  1. Which version of the gadget interface elicits the most useful feedback?
    1. Did the proportion of useful feedback change when the link was made prominent?
  2. What types of feedback do the versions of the gadget solicit?
  3. How is the usefulness of feedback different for popular/controversial articles?
  4. Which version of the gadget elicits the most inappropriate feedback?

Methods[edit]

In order to evaluate the quality of feedback received by the different versions of AFTv5, a combined qualitative and quantitative approach was used. Four samples of feedback elicited by AFTv5 were gathered (1 half-picked, 3 random) and hand-coded by a group of 17 volunteers. The coded feedback was then analyzed using a proportion test between the different categories.

Feedback samples[edit]

Four samples of feedback data were acquired from two different populations of articles: random picked and hand picked. The random picked set of articles represents 22,480 (11,611 until January 3, 2012) randomly picked articles from the pool of encyclopedia articles to allow for inference across the set of all Wikipedia articles. The hand picked set of articles represents 115 articles that were picked by WMF staff and Wikipedians to test the feedback interface.

Three random samples over the random picked articles were gathered over the course of the experiment. These samples were gathered in such a way that no two feedback items belonged to the same article. This reduced the effect of pages that received a large amount of feedback and would have otherwise dominated the samples.

  • 12-27: This sample was gathered on Dec. 27th, 2011 and consists of 99 option #1, 99 option #2 and 31 option #3 feedback submissions.
    • Note that the lower amount of option #3 submissions was due to a bug that prevented some feedback submitted via that interface to be ignored. This bug was subsequently fixed before any of the other samples were taken.
  • 01-09: In order to supplement the sample acquired on Dec. 27th, an additional sample was gathered on Jan. 9th, 2011 that consists of 151 option #1, 151 option #2 and 112 option #3.
  • post_01-11: To look for differences in quality after a change to the feedback link placement, another sample was gathered to identify differences in usefulness. This sample gathered 100 option #1, 100 option #2 and 100 option #3 feedback postings.

A random sample of feedback from hand-picked articles was generated to explore what differences in feedback usefulness exist for popular articles. Unlike the random picked article samples, this sample was *not* limited to one feedback item per page. This is because measuring the overall difference in usefulness of feedback for popular pages (that dominate the sample) was what was intended.

  • hand-picked: 100 option #1, 100 option #2 and 100 option #3 feedback posting to hand picked articles

Qualitative Hand Coding[edit]

All sampled feedback was loaded into the Feedback Evaluation System (FES, pronounced "fez" like pez), a dynamic user interface developed specifically to aid volunteers in quickly performing this task. 17 volunteers coded between 50 and 359 items each. Each feedback item was coded by at least two volunteers so that their evaluations could be verified and analysis could be performed over their agreements and disagreements[1]

Volunteers were also asked to categorize feedback according to a set of pre-determined categories decided on from an analysis of early-submitted feedback performed by EpochFail and Ironholds. This categorical data is used to further differentiate differences in the quality of feedback by the type of feedback left and to determine the rate of abuse that should be expected on a wider deployment.

Quantitative analysis[edit]

Since the usefulness of feedback was qualitatively determined as a boolean outcome, tests were used to look for significant differences between the signal-to-noise ratio of each of the three gadget options for AFTv5 and between the signal-to-noise ratio for various categories of feedback.

Where possible, bar plots of proportions include standard error bars based on the normal approximation of a binomial distribution.

Results[edit]

Which option produces the most useful feedback?[edit]

Analysis of the 12-27 and 01-09 sampled showed the three gadget options to insignificantly differ in the proportion of useful feedback they acquired. Figure 1 shows consistent, but insignificant differences between the proportion of useful feedback solicited by each of the interface options for each of the rating aggregation methods. Overall, about 65% of the feedback evaluated was marked as useful by at least one coder, 45% was marked as useful by both and 38% was marked as useful by both and neither was unsure about this assessment. This result suggests that there is a substantial amount of useful feedback being acquired by AFTv5, but also that the particular interface gadget used to solicit this feedback did not have a significant effect on usefulness.

Rating aggregation schemes:

  • "someone" = at least one rater marked the item as useful
  • "everyone" = both raters agreed that it was useful
  • "strict" = both raters agreed and neither marked that they were unsure

Figure 2 shows the difference in feedback utility elicited via the more prominent link. Although we originally hypothesized that feedback acquired via this link would be lower in quality (and therefore, usefulness), our analysis suggests that the quality of feedback from the prominent link may be higher than that received directly via the feedback form at the bottom of the article. However, this effect is small and not quite significant. Analysis of a larger sample will be necessary to determine whether the effect is significant or not.

Figure 1.
Proportion of useful feedback for each option by agreement
Error bars represent standard error based on the normal approximation of a binomial.
Figure 2.
Proportion of useful feedback by link placement by agreement
Error bars represent standard error based on the normal approximation of a binomial.


What types of feedback do the versions of the gadget solicit?[edit]

Figure 3 shows minor difference between the types of feedback solicited by each version of the AFTv5 interface. A chi^2 test was used to determine whether any difference were substantial.

  • The proportion of useful issues raised is higher for option #2 than option #1 (diff=0.225, p=0.035)
  • Option #3 acquired more praise than option #2 (diff=0.067, p=0.084 [marginal])
  • Option #3 acquired less questions than both option #1 (diff=0.1, p=0.005) and option #2 (diff=0.064, p=0.059 [marginal])

Figure 4 shows the proportion of feedback determined useful (by the "everyone" aggregation method). Suggestion and issues appeared to have the highest ratio of useful feedback with 82% utility and 76% utility respectively. It's interesting to note that praise and questions were agreed to be useful more than 50% of the time as well. As expected, abuse and irrelevant feedback was rarely marked as useful and such small proportions could be attributed to evaluator mistakes.


Figure 3.
The proportion of the category of feedback solicited by AFTv5 is plotted as a stacked bar of "useful" and "not useful" feedback (using the "everyone" aggregation method for utility).
Figure 4.
Proportion of useful feedback by link placement by the intention of the feedback leaver(using the "everyone" aggregation method for utility
Error bars represent standard error based on the normal approximation of a binomial.


How is the usefulness of feedback different for popular/controversial articles?[edit]

Figure 5 shows the proportion of useful feedback by sample. The proportion of useful feedback left on the hand picked articles is substantially lower than the feedback left on randomly sampled articles from the three other random samples (controlled for the effects of popular pages). Taken together, this suggests that the average page receives higher quality feedback than pages picked for their popularity/controversial topic. This result confirms our hypothesis, although it is interesting to note that 33% of the feedback received on popular/controversial pages was still agreed to be useful by both Wikipedians who reviewed it.

Figure 6 gives us an idea of the types of feedback that are solicited on the differing articles. Unsurprisingly, abuse and irrelevant feedback is much more common in the hand picked articles than the random sample, but suggestions are substantially less common. This could be due to the fact that the hand picked articles are all of relatively high in quality and it is more difficult for viewers to find something to suggest.

Figure 5.
The overall signal (proportion of useful feedback) is plotted for all samples. Note that the post_01-11 sample contains feedback submitted via the prominent placement link while 12-27 and 01-09 do not. Error bars represent standard error based on the normal approximation of a binomial.
Figure 6.
Proportion of feedback is plotted by intention of the feedback leaver and usefulness of the feedback.

Which version of the gadget elicits the most inappropriate feedback?[edit]

To determine how much feedback would eventually need to be hidden, FES was updated to include the question, "Should this feedback be hidden?" and the volunteer coders asked to make the judgement call. Figure 7 shows a significantly larger proportion of inappropriate feedback is submitted via hand-picked articles (30.1-40.2%) than randomly sampled articles (18.6-30.1%). However, figure 8 fails to show a significant difference between the three interface options no matter the aggregation method.

Figure 7.
The proportion of inappropriate feedback is plotted for both recent samples. Error bars represent standard error based on the normal approximation of a binomial.
Figure 8.
The proportion of inappropriate feedback is plotted for each interface option and for both recent samples. Error bars represent standard error based on the normal approximation of a binomial.

Is the found/not found question for Option #1 useful?[edit]

Since Option #1 elicited the most feedback, supplemental analysis was performed to examine whether there was signal in the "Did you find what you were looking for?" responses. After dividing the responses between those who answered "yes" (found=yes) and "no" (found=no), there were significant differences between the proportions of feedback type. As figure 9 suggests, there were significantly more issues reported (diff=0.163, p<0.001) and questions raised (diff=0.163, p<0.001) when the users leaving the feedback indicated that they had not found what they were looking for, but significantly more praise for editors and Wikipedia (diff=0.297, p<0.001) when users indicated that they had found what they were looking for. Overall, we did not find any significant differences in the usefulness of feedback between the two cases or for any of the feedback types.

This result suggests that the answers left by users to the question are consistent with what we might expect to leave them for feedback in the free form text field and that the yes/no question is likely to carry signal that may be useful to editors. The fact that Option #1 generates a significantly higher volume of feedback with text flagged as "not found" than "found" also suggests that comments we collect will overrepresent questions and issues over other types of feedback.

Figure 9.
The proportion of feedback is plotted for two condition of feedback elicited by interface option #1: where the user indicated that they found what they were looking for (found=yes) and where they indicated that they had not (found=yes). Substantial differences in the type of feedback left in these two cases suggest an intuitive relationship between the free form text feedback left and the answer to the "found" question.

Summary[edit]

The results of this analysis do not show a clear "winner" for the three experimental gadget options for the AFTv5 despite relatively strong statistical power. This suggests that there is unlikely to be any substantial difference between quality of feedback elicited by each of the three interfaces.

Minor differences were found between the types of feedback elicited. Option #3 elicited more praise than option #2 and less questions than both options #1 and #2 while option #2 elicited high quality issues than option #1. It's not immediately clear to use what these differences might mean for deployment decisions, but it is our hope that these differences will be informative to subsequent analysis.

Of the productive types of feedback (praise, question, issue and suggestion), issues and suggestions tended to be the highest signal-to-noise ratio. Designers of the next version of the Article Feedback Tool may consider emphasizing these types of feedback in order to try to direct readers of Wikipedia articles towards feedback that Wikipedians find useful.

Overall, the quality of feedback produced on a random sample of pages (unbiased and therefor dominated by unpopular pages that receive little feedback) is much higher than that of the 115 hand-picked page sample. As expected, this result appears to be the product of abusive feedback and nonsense/irrelevant postings in the hand-picked articles.


References & Footnotes[edit]

  1. No substantial/effectual disagreements were discovered on analysis as is reported in the results.