Research:Article feedback/Moderation Tools Usability Study

Article Feedback v5

Data & Metrics

Stage 1: Design
(December 2011 - March 2012)

Stage 2: Placement
(March 2012 - April 2012)

Stage 3: Impact on engagement
(April 2012 - May 2012)

WP:AFT5 (Talk)
Feature requirements

Dashboards

Overview
Article samples
Feature data
Clicktracking data

Volume analysis
Quality assessment
Reader survey / Team survey
Usability testing

Volume analysis
Quality assessment

Conversions and newcomer quality

Final tests

Quality assessment
Research report (2012 Q4)
Moderation tools usability study

Overview

The simplification of moderation tools presents design ideas for the Article Feedback Tool to help editors moderate feedback more effectively and reduce their workload. A usability test was conducted based on an interactive prototype to verify that some of these ideas worked as expected.

Four categories are provided to classify feedback (different wordings were used in the tests).

Classification takes place immediately to make the workflow more fluent.

The goals for these design changes were:

Making the workflow more fluent. Initially, a dialog was asking for optional comments after each moderation actions. Most of the time posts are self-descriptive enough for not requiring further description, thus the presence of that dialog was only a barrier in the workflow fluency. With the new design, the dialog is only shown by users request. Users can apply several moderation actions without the need to explicitly confirm them or provide comments. Users can "undo" or provide comments after the fact if they want.
Better mapping user mental models for classification. The designs propose a shift from action-based tools (e.g., hide) to a classification actions (e.g., unusable). With action-based tools, users need to evaluate the relevance of the feedback and then make a second effort to decide which is the appropriate action to make with this kind of feedback. With the classification approach, users are free from having to decide on actions: the user indicates how good/neutral/bad feedback is, and the system takes automatic action.
Increasing the classification coverage. With the initial actions, "feature" were rarely used (less than 1%) and it was not clear for most users what to do but positive but non-constructive feedback. As a consequence most feedback consumed revisors time without a final action. New designs propose a classification scale that includes specific categories for positive, neutral and negative feedback so that the feedback is processed only once for its classification.
Simplicity. Elements such as "view activity" and the "yes/no" voting were removed for moderators since they were distracting the user from task at hand with little added value (moderation actions and "view details" link were addressing the use cases).

The usability test was defined not only to verify that the above goals are met but also to help in the decision process for some of the design choices. In particular, several classification schemes for feedback were considered during the testing sessions. Previous discussions with the community helped to identify some concerns with some of the proposed terms, but testing them with practice was helpful to (1) evaluate how each term was used when used next to real feedback, and (2) evaluate how a set of terms worked together as a scale for evaluation as opposed to individual terms.

Method

An HTML prototype that simulated the new features was used in five usability testings to observe the monitoring workflow and identify problems in the designs.

The prototype

The prototype presented the user with real feedback from one of the following Wikipedia articles: Barack Obama, Global warming, and Higgs boson. Each article contained between 26 and 29 feedback posts. The selection of the articles was biased towards high-traffic articles since those are the ones that contain a greater percentage of the non-useful feedback we were targeting.

Users were asked to classify the feedback for the three pages but using a different classification schema in each one. The classification schemes considered:

Usable, Done, Unusable, Inappropriate. View prototype for Barack Obama, Global warming, and Higgs boson.
Useful, Done, Unusable. View prototype for Barack Obama, Global warming, and Higgs boson.

Usable, Done, No action needed, Inappropriate. View prototype for Barack Obama, Global warming, and Higgs boson.

Useful, Resolved, Not applicable, Inappropriate. View prototype for Barack Obama, Global warming, and Higgs boson.

Useful, Resolved, No action needed, Inappropriate. View prototype for Barack Obama, Global warming, and Higgs boson.

The testing sessions

Five users from UK, US and India participated in the testing sessions during the 11 - 21 Dec 2012. Users were asked to process feedback for three different combinations of pages and classification schema. Users were encouraged to talk aloud while classifying feedback. Users were asked about their interpretation of the classification options as well as additional details for the specific feedback they left unclassified, had some doubts in the classification or classified in an unexpected category.

The testing sessions were conducted remotely and the recording for the sessions was analysed and shared with the members of the Editor Engagement team.

Results

The simplifications proposed resulted effective to better moderate feedback. The conclusions from the user observations are provided below for each of the goals we defined:

User classifying feedback for the "Higgs Boson" article with the final classification scheme.

User classifying feedback for the "Global Warming" article with the final classification scheme.

User interaction shows the problems of not having an explicit "inappropriate" option and the need for a neutral action such as "no action needed"

User finds "no action needed" bette than previous options and emphasises the need to communicate with users to clarify some feedback.

Making the workflow more fluent

The proposed design allowed users to focus on the essence of their task: deciding whether posts were positive, neutral or negative. Removing the need to confirm their actions resulted in a more fluent workflow. Difficulties of classification were mainly due to difficulties in properly understanding what the user tried to express or the lack of information, but not problems in using the moderation tools. Users commented the need for communicating with users to clarify the situation or ask for more information.

Since the overall moderation experience is also affected by the amount of feedback to moderate (not only how comfortable is to do it), features to automatically reduce the amount of neutral and negative feedback should be also applied for a successful moderation experience.

Better mapping user mental models for classification

The classification scheme that worked best was: Useful, Resolved, No action needed, and Inappropriate.

Useful. Both "usable" and "useful" terms considered for positive feedback were clearly understood as feedback that can be used to improve the article.

Resolved. The use of "Done" was not initially clear and was sometimes confused as the neutral action (as "nothing more to process"). "Resolved" conveys better the idea that an element requiring action was completed which reduces the possible overlapping with terms such as "no action needed".

No action needed. For feedback which is not useful to improve the article but is not either abuse, "No action needed” worked well. This kind of feedback can be presented in many forms, but the “no action needed” is neutral enough for classifying feedback that users verbalised while using the prototype as “irrelevant”, “not useful”, “useless”, “nonsense”, “I don’t know what that means”, or “I don’t know what to do with it”. This neutrality may make it a bit unclear for some users initially, but the fact that it is part of a scale of clear actions makes it not a problem in practice for classifying feedback, and the addition of tooltips helped for the first-time learning experience. Alternatives such as “unusable” presented problems for classifying some due to its negative bias.

Inappropriate. "inappropriate" was not only well understood to identify abusive posts but also helped users to better understand the scale of classification. One of the classification schemes used "unusable" to classify both negative and neutral feedback. That caused problems for classifying neutral feedback.

Increasing the classification coverage.

Although users were not told to classify all the posts presented, the test environment made users to try most of the cases to complete the whole list. Only one user did not read the whole list. For all attempts of classification, users were able to select a category from the ones proposed. The classification schema proposed helped to classify comments that otherwise would remain unclassified. Users were more prone to classify posts as "Useful" and "no action needed" than their equivalent previous actions ("Featured" and "hidden").

Reducing unclassified comments is interesting not only to surface useful comments but also to reduce the number of times a post is exposed to moderators, reducing the perception of the number of posts.

Simplicity

The reduction of elements such as reader's voting system and a direct access to the post history was not causing any problems for users to classify posts. The reduction of non-essential tools helped users to clearly identify the entry point for moderation. The iconography used for the classification scheme did not generated confusion to users.