Research:Article feedback/Stage 2/Quality assessment

Article Feedback v5

Data & Metrics

Stage 1: Design
(December 2011 - March 2012)

Stage 2: Placement
(March 2012 - April 2012)

Stage 3: Impact on engagement
(April 2012 - May 2012)

WP:AFT5 (Talk)
Feature requirements

Dashboards

Overview
Article samples
Feature data
Clicktracking data

Volume analysis
Quality assessment
Reader survey / Team survey
Usability testing

Volume analysis
Quality assessment

Conversions and newcomer quality

Final tests

Quality assessment
Research report (2012 Q4)
Moderation tools usability study

Methods[edit]

Experimental conditions[edit]

1X: No prominent link, form appears at the bottom of the page.
1A: Less prominent link to form at the top of page, form also appears at the bottom of the page.
1E: Most prominent link to form fixed to lower-right side of screen, form also appears at the bottom of the page.

Sampling[edit]

To look for differences between feedback received via prominent placement links, we gathered a random sampled of 300 feedback items for each experimental condition.

Quality assessment[edit]

The feedback evaluation system (FES) (see the documentation and phase 1 analysis) was used by [N] volunteers to rate the sampled 900 feedback items. Each item was rated by at least two Wikipedian volunteers for:

Is this feedback item useful: yes/no
Are you unsure of your evaluation: yes/no
Should this feedback item be hidden: yes/no
How would you categorize the intention of this feedback:
- Abuse - Offensive
- Irrelevant - Nonsense or other uselessness
- Issue - Points out a problem
- Suggestion - Suggests a change
- Question - Asks a question
- Praise - Praises Wikipedia or editors

Rating aggregation[edit]

The usefulness of a feedback item was determined by aggregating the (>= 2) ratings using three different strategies:

someone = at least one Wikipedian thought it was useful
everyone = all Wikipedians (2) thought it was useful
strict = all Wikipedians thought it was useful and none marked that they were unsure

Intention was determined by an the union of all selected intentions by the volunteer raters. For example, if one rater selected {Suggestion, Question} and another selected {Question, Issue}, the intentions for an item would be determined to be {Suggestion, Question, Issue}.

Research questions[edit]

Which experimental interface elicits the highest quality feedback?[edit]

To explore the quality of feedback submitted via each of the interfaces, we measured the proportion of useful comments submitted using the three aggregation strategies and plotted them to look for differences. Utility by experimental condition shows no significant differences between the proportions of useful comments between the experimental conditions for any of the aggregation approaches.

This result is interesting due to the massive difference in the amount of feedback elicited via the three different conditions (1E elicited more feedback than 1A and 1X combined). One might expect that, when eliciting feedback from a wider array of readers, quality would suffer. However, this result strongly refutes that hypothesis.

What types of feedback does each interface elicit?[edit]

To determine whether different experimental interfaces elicited different types of feedback, we used the intention categorization to explore the proportion of feedback by category and the quality of feedback within each category. Utility by intention proportion of feedback by category split by the proportion within that category that was useful and not. Superficially, this plot appears to show the same pattern across experimental conditions.

1A. Condition 1A elicited a higher proportion of irrelevant feedback than 1X (marginal: diff=0.079, pval=0.063), a higher proportion of issues than both 1E (diff=0.1, pval=0.003) and 1X (marginal: diff=0.06, pval=0.094), and a lower proportion of suggestions than 1X (diff=-0.087, pval=0.04). 1A also elicited a lower proportion of useful questions than 1X (diff= -0.203 pval= 0.051).

1E. Condition 1E elicied a higher proportion of irrelevant feedback than 1X (marginal: diff= 0.079 pval= 0.062), a lower proportion of issues than 1A (diff=-0.1, pval=0.003) and a lower proportion suggestions than 1X (marginal: diff=-0.079 pval= 0.062).

How is the quality of feedback submitted via prominent links different?[edit]

To explore how the quality of feedback submitted via prominent links, compared the proportion of useful feedback submitted via the link to the proportion of useful feedback submitted directly via the form at the bottom of the article. Utility by origin shows a minor, insignificant difference in the proportion of useful feedback for both origins in 1A. The error bars are very large around the origin=link for 1A due to the small proportion of feedback submitted via the link (6.0%, n=18). However, the proportion of useful feedback submitted via the link in the 1E condition is significantly lower than that submitted directly via the form (someone: p=0.002, everyone: p=0.027, strict: p=0.029).

To explore this difference in quality, we looked examined both the categorization of feedback submitted via the two origins. Intention & utility by origin for 1E shows the categorization of feedback submitting via the form and link. A chi^2 test across the categories show that substantially more irrelevant issues are submitted via the link than the form (diff=0.133, p=0.031). More questions (diff=0.12, p=0.007) and suggestions (marginal: diff=0.116, p=0.055) were submitted via the form directly as well.

While we hypothesized that this difference could be due to increased usage of the prominent link by anonymous/bad-faith users, our sample included of the feedback from 1E suggests that registered editors favor the prominent link. While 4.7% of the feedback submitted via the link was submitted by registers, we observed that no registered editors submitted feedback via the form, which given our number of observations, suggests a significant difference (diff=0.047, p=0.035). As Utility of prominent link submission by user status shows, we were surprised to observe that the proportion of useful feedback submitted by registered editors was substantially lower than that of anonymous users. However, due to the low number of feedback items submitted by registered editors in this case (n=8) we can't determine if this difference is significant with a chi^2 test.

Although these results may appear to suggest that the prominent link in 1E elicits lower quality feedback, the results above show that, overall, the quality of feedback elicited by 1E is consistent with the other two conditions. This suggests that readers who submit useless feedback via the link would have eventually found the form at the bottom of the article and that low quality feedback is more likely to come via the prominent link when it is available. Also worth noting is that more than half of the sampled feedback submitted in the 1E condition was done so through the prominent link (56.3%, n=169), and through our strictest measurement, 32.6% of that feedback is useful.

Conclusion[edit]

The primary motivation for this analysis was to explore the differences in quality of feedback elicited by AFTv5. The research question explored was: When eliciting feedback from a wider range of readers by increasing the prominence of the form, how would quality change. Although it seems only natural to expect that quality would decline when eliciting feedback from a wider audience, and specifically, those readers who would not have otherwise found the feedback interface, that quality would decline. However, the results above offer a strong rejection of that hypothesis. Overall, we found no indication that the quality of feedback changes substantially when eliciting feedback from more reader.

These results may have wider implications for what it means to elicit wider participation in open contribution systems like Wikipedia. Projects like the Visual Editor, aimed at increasing participation by minimizing technical barriers to new editors, have been criticized (e.g. [1]) over concerns that the new contributions that such a system would elicit would be of lower quality, and therefor, undesirable. Although this experiment did not test the reduction of a technical barrier, it does show that eliciting contributions from a wider array of readers does not reduce the contribution quality which is an important step toward assuaging these concerns.