Research:Voice and exit in a voluntary work environment/Elicit new editor interests

This page in a nutshell: This page describes the research for building a language-independent model for automatically generating questionnaires that can solve the cold-start problem in Wikipedia which in turn can help with learning about newcomer interests quickly. (paper)

Contact

Bob West

EPFL

Ramtin Yazdanian

EPFL

Bahodir Mansurov

Wikimedia Foundation

Leila Zia

Wikimedia Foundation

Jonathan Morgan

Wikimedia Foundation

Research:Projects

This page documents a completed research project.

Upon signing up, new Wikipedia editors-to-be are confronted with a vast wealth of articles, not knowing what articles and topics exist and in addition, not knowing what articles are in need of contributions. In the framework of a recommendation system for Wikipedia editors, this problem is called the user cold-start problem. One of the primary motivations for this project is the fact that the gender and ethnicity distribution of active Wikipedia users is currently very biased, and having a system for holding the hands of newcomers can help alleviate this problem by encouraging more of them to stay and contribute. This can be achieved through recommendations of articles and more importantly, of users - pairing newcomers with veterans.

Currently, no such system exists in Wikipedia, and users who want to get past the initial barrier will have to get past the cold-start phase by a combination of getting to know other users and sub-communities and also getting to know different topics, which will have to be done without systematic guidance. Therefore, it requires determination and dedication that might not be present in every user. We believe that a bit of systematic hand-holding can go a long way in encouraging users to stay within the system.

Our work aims to create a questionnaire that captures topical dichotomies based on article content and user preferences. The questionnaire will be presented to users upon signing up, and their answers will act as an initial profile that will allow us to recommend articles and their prominent editors to the newcomer. This will allow them to pair up with the said editors, allowing them to get into editing quickly.

Adapting recommender systems to Wikipedia

Creating a recommendation system for Wikipedia poses two challenges in particular:

Wikipedia editors do not "rate" articles; the only information available about their preferences is their editing history.
As opposed to most systems which recommend "consumable" items such as movies to watch, recommending Wikipedia articles to editors would not be for their consumption, but for their contribution; and while every movie can be watched, not every Wikipedia article needs contributions (and additionally, not every user will be qualified to contribute to a given article).

The former point prevents us from using most of the existing questionnaire methods, which are built for explicit-feedback systems such as Netflix. The data we have available on users only includes their editing history, based on the revision history data. This data lets us know who has edited which article with what frequency (and it provides additional information such as the size of the article after the edit and whether it was flagged as a minor edit or not), and therefore if a user has a history of repeatedly editing an article, we know that they are interested in that article. The problem of implicit feedback arises when we have few or no interactions between a user and an article; if the interactions exist, but are few in number, we cannot reliably say that the user dislikes the article, but we cannot be sure that they like it either. On the other hand, non-existent interactions are treated as unknown in explicit feedback systems, but if treated as unknown in an implicit feedback system, the system will basically only have like and unknown, instead of like, dislike and unknown. This has prompted existing literature on implicit feedback systems ^[1] to treat unknown data as dislikes, while incorporating a confidence value that increases with the number of interactions, therefore putting a higher emphasis on known and frequent interactions rather than infrequent ones.

The latter problem can be solved by pairing newcomers with veterans who have significant experience in Wikipedia and expertise in the topic in question. This way, experienced users who know which articles require edits can point the newcomers towards them, in addition to assessing their qualifications.

Method summary

In this section we will discuss our questionnaire generation method in brief, and we will elaborate on our recommendation and evaluation methods. A flowchart of the entire pipeline is displayed on the right.

Data

The data we use consists of two parts:

The content of Wikipedia articles.
The revision history of Wikipedia articles.

The former is used to generate a bag of words representation of each article, which allows us to represent each article as a vector whose dimensions correspond to terms in the vocabulary and the value of the each vector element is the count of the corresponding term in the article. This will be the content-based element in our recommendation system.

The revision history data ^[2] informs us of the editing history of all users. Out of the raw revision data, we extract a user by article 'editing matrix', whose each element is the number of times that user has edited that article. This is the collaborative element of our recommendation system, and will allow us to capture user preferences and detect groups of users with similar interests.

Throughout this document, the words 'article' and 'document' will both refer to Wikipedia articles, and may be used exchangeably.

Extracting the topics

Our basic idea is founded on Latent Semantic Analysis (LSA), a method that extracts topics from a set of documents using Singular Value Decomposition (SVD) on their bag of words (or TF-IDF) representation. Since we want to accommodate both the content of the articles and user preferences in our system, we will use a joint topic extraction method that utilises both of these data sources. Our method is a matrix factorisation method that attempts to factorise the editing matrix as a multiplication of a user latent matrix by a document latent matrix, while attempting to keep the document latent matrix close to the result of LSA on the content TF-IDF matrix. Our desired output from this algorithm is a document latent matrix, whose each row is the representation of the corresponding article in our latent space. We consider each dimension of the latent space to be a 'topic', as determined by the distribution of words in documents and the editing behaviour of users regarding different documents.

In our tests, we will primarily consider two different systems: one using the original LSA on the document content matrix, and another using the joint method. In our offline tests we have also used a random recommender baseline.

Generating questions

To generate questions, we take each dimension of the latent space (i.e. each topic), and look at the top 20 and bottom 20 articles. The top 20 are the articles with the highest positive weights in that dimension, and the bottom 20 are articles with the highest (in terms of absolute value) negative weights. Our idea is that since we cannot show the user the entire topic (which consists of hundreds of thousands of articles and their weights) for them to decide which direction in this topic they lean towards, we need to capture the main dichotomy involved in said topic, and these two sets are our solution to this problem. The question generated from this topic takes the form of "Which of these two sets of documents would you be more interested in editing?", with a 'neither' answer also being possible. Then, based on their answer, we can say that when it comes to this topic, the user is generally leaning towards the positive direction, the negative direction, or neither. The user's answers to these questions (which will be presented to the user as a list) will function as an initial profile for the user, based on which we will proceed with our recommendations as described in the next section.

Recommendations

The profiles we have created for the users give us multiple options:

Recommending articles: Based on their question-space profile, we can calculate their latent profile as the average of the latent representations of articles that they have chosen in their answers to the questions. Afterwards, for this latent space profile, we can perform a k-nearest neighbours search among articles in order to find the k closest articles. However, there are two practical considerations that make this a less-than-optimal choice:

The user in question does not necessarily have the qualifications to edit the articles we recommend.
Most of the closest articles will be highly popular ones, which are most probably quite complete and in no need for contributions.

The two aforementioned reasons mean that recommending articles will detract from our goal of increasing editor diversity and empowering them.

Recommending existing users: Since we also have latent representations for users, we could match the newcomer with experienced users whose latent space profiles are close to said user. However, the main issue here is that the latent profiles are time-agnostic: a user who has edited some type of article in the past might not necessarily still be interested. In addition, matching one newcomer with one veteran might be quite intimidating for the newcomer, and could potentially reinforce an existing hierarchical editor structure, which goes against our goal of increasing diversity. Finally, the latent representation from the newcomer comes from a different matrix than the one for the veteran, and this could potentially negatively affect the quality of recommendations.
Matching newcomers together: Since we have question-space profiles for each newcomer, we can match them based on their answers. Therefore, in the deployment version of this system, several newcomers with similar interests would be matched together, along with one more experienced user to show them the ropes (who could, in order to volunteer for the task, also take the questionnaire). We believe that this would go much further towards the goal of newcomer empowerment than the other two options, and therefore this is the option we have ultimately chosen. However, for the sake of simplicity of evaluation, for our online experiments we have chosen to only match pairs of users together. We will describe our matching scheme in the next section.

Evaluation

We use two types of evaluation: offline and online. The online evaluation will be discussed in a separate section, since it constitutes a considerable portion of our work, while the offline evaluation is part of the question-generation approach.

Offline evaluation

Our offline evaluation consists of two parts:

Assessing the intrinsic quality of the questions.
Simulating the to-be-online experiment on the profiles of a set of held-out users, and comparing the results of profile-based matching with the results of random matching to assert that our questions are capturing useful information about the users and that this information is improving the matches.

In order to assess the intrinsic quality of each question, we calculate a cohesion metric for each question that assesses the similarity of the top 20 documents and also the similarity of the bottom 20 documents, and sum these two values (which are between 0 and 1 each) to get a cohesion score from 0 to 2.

In order to simulate the online experiment, we take a held-out set of 8500 users, hide 20 of their edits, and simulate their questionnaire-answering process using their remaining editing profile. We then proceed to put them in batches of 500 users each, and within each batch we construct a graph in which the nodes are users and the weight of the edge between two users is equal to the dot product of their question-space profiles. We then perform matching using the Hungarian algorithm, and then we measure the similarity of the 20 held-out documents for each user to all the documents edited by the other, and taking the average of these similarity values, we obtain a score between -1 and 1, which indicates how similar this pair of users are. We then compare these results to the result of a random matching among the same set of users.

Original Online Experiment

The original online experiment, whose phase 2 ended in failure, was designed as follows:

The experiment begins.
For a week (or if fewer than 50 responses are gathered in the first week, for two weeks), when a new users joins Wikipedia, within 24 hours they receive a message informing them that they have been chosen for an experiment. They are given a link to the questionnaire, which involves the questions, the respective word clouds of every article set, and clickable links.
If the user chooses to answer the questionnaire, they will have to answer every single question in it. At the end of the questionnaire, they provide us with contact information in the form of their Wikipedia username. Their answers and usernames are saved.
Once a sufficient number of answers have been collected or time is up, the matching process begins as follows^[3]:
1. A user graph is formed, in which each user is a node, and the weight of the edge between two users is the dot product of their question-space profiles.
2. Two matching algorithms are run over this graph: one maximal (i.e. covering all nodes) maximum weight matching, and one random matching. Extra care is taken in order to avoid having any edges that exist in both matchings. That is to say, each user is matched to two different users through these two matching schemes.
3. Each participating user is informed about their two matches by an email.
The users are then expected to gain information about their two matches by conversing with them. The email informing them of their matches will also contain a link to a survey, which is meant to get their opinion on their match, both on how much their interests seemed to be overlapping, and on what they think of collaborating with the matched person. In order to make sure that they do actually talk to each other, they will be required to agree on a passphrase, which they will have to provide in the survey that they fill. The survey responses of a pair of matched users will only be kept if their passphrases match. They will be given one week to do this and answer the survey, but if time permits, we may send them a reminder in a week if they don't respond, thus extending their time to two weeks.
The answers to this survey (both the quantitative answers and the free text answers) will be used to compare the random matching scheme to the question-based matching scheme. We will also investigate the textual answers given by users to gain more insights, especially since the number of participants is expected to be low.

Several aspects of this scheme have been discussed, and the results are as follows:

The random baseline and rock-bottom: We are concerned that a random baseline might be quite easy to beat, and thus not very informative. Therefore, an idea would be to instead provide a user with three matches: one random, one in the aforementioned graph with the 20 questions, and another based on a similar graph, but in which the edge weights are calculated only using the first 10 questions (i.e. the latter half is thrown out). This, however, would require more participants, so in the interest of keeping things simple, we will not perform this experiment unless we significant participation is achieved.
What about questionnaire length?: An important aspect of the system is the length of the questionnaire. More questions mean more (and more fine-tuned) information, but more user boredom and frustration. However, that is a separate question which is not in line with the rest of our online experiment, and we have decided against including it.
It may take more than two to tango: With newcomers, diversity of ideas and teamwork are key, not to mention the fact that there will inevitably be some attrition. Therefore, it might be a better idea to match them in groups. However, optimal team-building is a topic that would significantly complicate our method, and would be outside the scope of this specific research project. Therefore, the current decision is to stick to pairs for matching.

The following sections describe the messages that will be sent to users, the contents of the questionnaire, and the survey.

The starting message

This is the first message that newcomers will receive within 24 hours of signing up for Wikipedia.

Subject: Getting to know you: A questionnaire to find you editors of similar interests

Greetings, fellow Wikipedian!

We are glad that you have decided to join this great community, and first and foremost we would like to welcome you!

There are many communities of editors on Wikipedia who collaborate to make Wikipedia more comprehensive and more reliable. However, we have noticed that the act of creating or entering communities tends to be challenging for some newcomers. Whether or not you know what topics you'd like to contribute to, it may be difficult to find like-minded people with whom you could collaborate. In order to facilitate this process, we have created a questionnaire, consisting of 20 questions about Wikipedia articles on different topics. If you choose to answer it, we may be able to find people with similar interests to match you with, in order to get you started. 

Please note that the questionnaire will be conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey's privacy statement at https://foundation.wikimedia.org/wiki/Elicit_New_Editor_Interests_Survey_Privacy_Statement . If you are interested to respond to the questionnaire, please go to http://34.245.220.212:5000/ .

We would like to thank you for your attention, and we hope that you will participate in our experiment, so that we may make Wikipedia a more welcoming place for any and all who wish to contribute to this worldwide vessel of knowledge and learning.

Best regards,
Wikimedia Foundation Research Team
1 Montgomery Street
Suite 1600
San Francisco, CA 94104
USA
Phone: +1-415-839-6885
Fax: +1-415-882-0495

If you no longer wish to receive emails from Wikimedia Foundation Research team about this study, please send an email with the subject "unsubscribe" to the address research-feedback@wikimedia.org.

This message will be sent to users within 24 hours of their signing up. It does not matter whom exactly it goes to, as long as there are many people who receive it, because it is reasonable to assume that many will ignore it.

The questionnaire

The questionnaire will be structured as 20 questions in a sequence. Each question is as in the example picture to the right: the set of top 20 documents of the topic on the left side as set A, the set of bottom 20 documents of that topic on the right as set B, and three choices: set A, set B, or neither. Both sets of questions are accompanied by their word clouds.

A JSON file containing the 20 questions may be found here. You can copy and paste the contents into an online tool such as this to view them in a structured manner. The questions are indexed by their numbers (stringified 0 up to and including 19), and each question has two fields "top" and "bottom", each of which contains 20 articles names. The questions are not displayed on this page because the JSON format is much more practical.

The word clouds can be found here in zipped format. The word cloud file for the top 20 documents in question indexed 5 (i.e. the 6th question since the list is 0-based) is called 5_top.png, and the word cloud file for the bottom 20 of the same question is called 5_bottom.png.

At the beginning of the questionnaire, the following message is displayed:

This questionnaire contains 20 questions, which attempt to capture your topics of interest, based on which we will match you with similar newcomers. A few important points on the questionnaire:

* Please have cookies enabled. If you do not have them enabled, please enable them and refresh this page.
* Each question is of the form "Set A or Set B?", the available answers being A, B, or Neither. Only a single answer is possible for each question.
* Please answer all 20 questions. Because we're evaluating our system based on this, we need full information on how users interact with each question. The form cannot be submitted without having chosen an answer for each question.
* Once you have answered all the questions, we kindly ask you to also enter your username so that we can get back to you with your matched person, and so that you can establish contact with them through the "Email this user" feature.

Once you have submitted the questionnaire, your answers are recorded, and we will get back to you soon to inform you about the users we have matched you to. We will send you further information through the user account you provide to us at the end of the questionnaire.

Thank you for your participation!

Of course, the message assumes that recording the results will require cookies, because it does in the code I've written. That part of the message would be subject to change depending on how the webpage operates and whether or not it requires cookies.

At the end of the questionnaire, the following message is displayed:

Thank you for taking our questionnaire! We would have loved to offer you another cookie, but it's one per person.
We will get back to you with your matched person in the near future!

The matching email

Once the period of time for filling the questionnaire is over and question-based and random matching are performed, each person will receive the following email:


Subject: Getting to know you: We found matches for you!

Dear USERNAME,

You are receiving this message because you participated in a questionnaire to help us find you editors of similar interest. We are glad to inform you that we have found two matches for you! The Wikipedia username of your two potential partners are: USERNAME_1, USERNAME_2.

Now, we want to ask you to consider forming a separate two-person team (and not a joint three-person team) with each of the users, conversing with each of them separately, and figuring out how overlapping your interests with each person are. We are also interested to learn whether you'd be interested in collaborating with them. To this end, we ask you to:

1. Log in to English Wikipedia and click on each of the following links to contact your potential partners: 
To connect to USERNAME_1, please click on: https://en.wikipedia.org/wiki/Special:EmailUser/USERNAME_1
To connect to USERNAME_2, please click on: https://en.wikipedia.org/wiki/Special:EmailUser/USERNAME_2

2. Converse with each user to learn how overlapping your interests are and whether you'd be interested in collaborating with them. When you converse with the user, please agree on a passphrase with each of them. We will ask you about this passphrase in the survey linked from the next step.

3. After conversing with each user, come back to this email and fill the following survey for each of them (Please note that the survey will be conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey's privacy statement at https://foundation.wikimedia.org/wiki/Elicit_New_Editor_Interests_Survey_Privacy_Statement ): https://docs.google.com/forms/d/e/1FAIpQLSeEgsRu0nUxcdwa91xfDmxly96SGysIaYTGLFWMw2zCVfb90g/viewform

We would like to thank you in advance for getting in touch with your matched partners and filling the survey within one week. Your assistance in our efforts to make Wikipedia a more welcoming place are greatly appreciated!

Best regards,
Wikimedia Foundation Research Team
1 Montgomery Street
Suite 1600
San Francisco, CA 94104
USA
Phone: +1-415-839-6885
Fax: +1-415-882-0495

If you no longer wish to receive emails from Wikimedia Foundation Research team about this study, please send an email with the subject "unsubscribe" to the address research-feedback@wikimedia.org.

In this email, the usernames of user 1 and user 2 should be presented in random order, such that their order does not say anything about which one was the question-based match and which one the random match.

The following email was sent as a reminder after one week past the previous email.


Subject: Reminder- Getting to know you: We found matches for you!

Dear USERNAME,

This is a friendly reminder about the following message we sent to you last week.

MESSAGE_FROM_LAST_WEEK INCLUDING THE SIGNATURE.

The survey

The survey, whose link will be provided to the users in the matching email, is as follows:

Greetings, dear participant!

Thank you for participating in our experiment! We are looking forward to hearing your feedback on each of the people you were matched with, and have prepared a set of questions for you to answer about them. We kindly ask you to fill this survey once for each person you were matched with. At the end of the survey, please enter the passphrase you have agreed on with said person. Please bear in mind that for verification reasons, we need the passphrase provided by you and said person to be the same.

* Please enter your own username here.

* Who was the person you were matched with? Please enter their username.

* Based on your interactions with this person, how much do you agree or disagree with the following statements? Please also provide a one-line description for your answer.
# This person is interested in editing similar topics to me.
# This person is enthusiastic about editing Wikipedia.
# (If your answer to the previous question is affirmative) This person would be willing to edit Wikipedia articles with me.
# I would enjoy learning from and collaborating with this person.
# Based on your interactions with this person, what additional observations do you have about this person as an editor and collaborator?

* Please enter the passphrase you have agreed on.

We thank you again for your participation! With your help, we can make Wikipedia a more welcoming environment for newcomers. Thus, your help will contribute to the core strength of Wikipedia - its volunteering participants.

Best regards,
Wikimedia Foundation Research Team
1 Montgomery Street
Suite 1600
San Francisco, CA 94104
USA
Phone: +1-415-839-6885
Fax: +1-415-882-0495

The first four questions have the user give their opinion as a rating from 1 (strongly disagree) to 5 (strongly agree), and optionally provide a one-line explanation as to what influenced their opinion. The final question is an attempt to elicit any additional feedback that we may have missed with the first four questions, and is a free text field. The person's username is necessary for identification, and we require them to enter the username of their match as well, in case they use the same passphrase for both of their matches.

Stats and study condition

Once a user registers on enwiki, they receive an email with a link to confirm their email address. Once the user visits that link their will be set to receive emails from other wiki users, unless they disable this feature explicitly.

08/06/2018: 1,297 users registered and made at least one edit between 08/05/2018, 9PM UTC and 08/06/2018, 9PM UTC. We emailed 247 users who allowed receiving emails. The remaining 1,050 users were not contacted.
08/07/2018: 1,387 users registered and made at least one edit between 08/06/2018, 9PM UTC and 08/07/2018, 9PM UTC. We emailed 272 users who allowed receiving emails. The remaining 1,115 users were not contacted.
08/08/2018: 1,435 users registered and made at least one edit between 08/07/2018, 9PM UTC and 08/08/2018, 9PM UTC. We emailed 287 users who allowed receiving emails. The remaining 1,148 users were not contacted. 25 answers (total) recorded by the end of the third day.
08/09/2018: 4,707 users registered and made at least one edit after 08/05/2018, 9PM UTC. The list excludes the users who we already contacted in the previous 3 days. We emailed 455 users who allowed receiving emails. The remaining 4,252 users were not contacted. 44 answers (total) recorded by the end of the fourth day.
08/10/2018: 45,501 users registered and made at least one edit after 07/10/2018, 0AM UTC. The list excludes the users who we already contacted in the previous 4 days. We emailed 11,351 users who allowed receiving emails. 2,390 usernames/emails were missing. The remaining 31,760 users were not contacted. 190 answers (total) recorded by the end of the fifth day, and approximately 260 responses by the beginning of the second week.
08/13/2018: 9,217 users registered and made at least one edit after 08/06/2018, 0AM UTC. The list excludes the users who we already contacted in the previous days. We emailed 943 users who allowed receiving emails. 2 usernames/emails were missing. The remaining 8,272 users were not contacted. We also removed the hidden space after the links in the email body. 311 responses total at the end of the 8th day.
08/14/2018: 8,557 users registered and made at least one edit after 08/07/2018, 0AM UTC. The list excludes the users who we already contacted in the previous days. We emailed 305 users who allowed receiving emails. 3 usernames/emails were missing. The remaining 8,249 users were not contacted. 336 responses total at the end of the 9th day.
08/16/2018: 9,992 users registered and made at least one edit after 08/08/2018, 0AM UTC. The list excludes the users who we already contacted in the previous days. We emailed 748 users who allowed receiving emails. 2 usernames/emails were missing. The remaining 9,242 users were not contacted. 370 responses total at the end of the 11th day.
08/17/2018: 8,465 users registered and made at least one edit after 08/10/2018, 0AM UTC. The list excludes the users who we already contacted in the previous days. We emailed 339 users who allowed receiving emails. 2 usernames/emails were missing. The remaining 8,124 users were not contacted. 392 responses total at the end of the 12th day.

08/24/2018: Sent out 31 emails to the users that got matches form the first round of emails.
08/27/2018: Sent out 30 emails to the users that got matches form the first round of emails. 1 user disabled receiving emails.
08/27/2018: Sent out 253 emails (last batch) to the users that got matches form the first round of emails. 2 users disabled receiving emails, while another user couldn't be found in the database.
09/04/2018: Sent out 305 reminder emails to the users who haven't completed the task from. 2 users disabled receiving emails, while another user couldn't be found in the database.

Failure and analysis of reasons

The experiment's first phase succeeded in collecting about 400 responses, while the second phase only resulted in 9 responses. This failure could be attributed to several factors:

The design of the experiment had not accounted for the reluctance of newcomers to interact with others.
Newcomers are also generally reluctant to participate in something as time-consuming as a two-phase experiment, and even one person ignoring the second phase would mean the loss of two pairs.
In general, seeing as there are several thousand users signing up every day, but the number of active users is something around 120K, this is indicative of a very low retention rate that points to many people not being motivated enough right from the beginning.

This points to the necessity of a single-phase experiment: if we can get all the responses we need out of a participant in a single round, then we do not run the risk of their losing motivation between the two phases. As all matching experiments would require two phases due to the necessity of the users conversing with each other, this means that we need to resort either to article recommendations, or to passive user recommendations. Since it is easier for a person to judge whether they like or dislike an article, our redesigned experiment will be centered around the recommendation of articles.

Second Online Experiment

As noted before, this experiment will differ from the previous in several pivotal ways:

We will recommend articles to the participant based on their responses to the questionnaire (and also using several baselines and one "roofline", which is a very powerful baseline which we would generally not expect to beat).
All the information will be gathered in a single run. That includes the user's responses to the questionnaire and their responses to the questions regarding the recommendations (i.e. their feedback).
User feedback will be centered around articles instead of other users, and as such will include:
1. A choice of preferred and 2nd preferred article to edit among four recommendations (one using each recommendation method, three of which are baselines).
2. The question of which article (out of 4) they believe to be from their questionnaire responses.
3. Any other question we come up with in the upcoming weeks before the experiment begins.

Study setup

There are four types of recommendations made to the participants once they submit their questionnaire responses:

A recommendation based on their responses to the questionnaire. Their responses form a vector in the reduced (i.e. filtered) article space, which has a weight for every article. The recommendations are taken from the top-weighted articles.
A recommendation based on the most viewed articles in the past month. We use the set of top 1000 articles, and randomly sample several articles.
A recommendation based on the most frequently edited articles in the past month. Again, we have a large set of these and we randomly sample from them.
The "roofline" recommendation, which is based on collaborative filtering using the user's existing editing history. This recommendation uses the reduced user-article matrix, which is still quite substantial.

The study will involve two groups of participants: "relative newcomers" and "complete newcomers".

Complete newcomers are users who have signed up recently and do not have any edits recorded. These people are the primary audience of a cold-start recommender. Given their lack of any editing history, they do not get the "roofline" recommendation.
Relative newcomers are users who have signed up recently, but also have several edits recorded. They will receive all four types of recommendations.

As opposed to last time, where we had three possible responses for every question, this time we will have seven levels of preference, from "Greatly prefer A" to "Neither" to "Greatly prefer B". This will allow for a more nuanced knowledge of a person's preferences, which would lead to a better-tailored set of recommendations and would avoid the domination of the responses by a particular set of articles that appears multiple times in the questionnaire - assuming the user does use the different levels of preference properly.

One important distinction between the two groups, is that the relative newcomers will likely be more motivated than the complete newcomers. They are also likely less numerous. Another important thing to bear in mind is that their edits need to be present in our reduced article space of 2.2 million articles. It is necessary for these users to be sampled from a longer time period, because shorter periods might simply not have the numbers we need.

After filling the questionnaire, each user gets 6 pairs of article lists to compare. In case of the complete newcomers, 6 of these lists are recommendations based on the questionnaire, 3 are view-pop-based, and 3 are edit-pop-based. For relative newcomers, we have 6 questionnaire-based, 2 collaborative filtering-based, 2 view-pop and 2 edit-pop. This ensures that all users receive the same number of questions in the feedback survey, avoiding the potentially biasing effects of longer feedback surveys. In addition, for each user, this provides us with multiple comparisons with each baseline, which is useful to have due to the generally low response rate.

The privacy policy

The privacy policy for the new experiment is available here.

Participation invitation email

There will be two different emails for the two different groups of newcomers, since the text of the email will contain the motivation behind the questionnaire, and for the relative newcomers, it will not be "getting them started" since they have already managed that to some degree.

The two email texts below are drafts and subject to change.

Text for complete newcomers:

Subject: 
Wikipedia: Please help us evaluate an article recommendation system for newcomers

Greetings, fellow editor, and welcome to Wikipedia!

We are currently developing an article recommendation service to help new editors like you find articles you might be interested in editing. It would be great if you help us evaluate how well the system works. If you are interested to do so, please go to the following link and participate in a 20 minute questionnaire (Note that clicking on the link will redirect you to a page hosted on an Amazon server. You can learn more about this study's privacy statement at the end of this email): https://research.wikimedia.org/elicit-new-editor-interests-questionnaire.html

On the last page of the questionnaire, you will be asked for your "participation token". When you get to that step, please copy and paste the following code into the corresponding field in the questionnaire: 
[PARTICIPATION TOKEN]

We thank you for your time.

Best regards,
Wikimedia Foundation Research Team
1 Montgomery Street
Suite 1600
San Francisco, CA 94104
USA
Phone: +1-415-839-6885
Fax: +1-415-882-0495

* If you no longer wish to receive emails from Wikimedia Foundation Research team about this study, please send an email with the subject "unsubscribe" to the address research-feedback@wikimedia.org.

* This research's privacy statement: https://foundation.wikimedia.org/wiki/Elicit_New_Editor_Interests_Survey_T2_Privacy_Statement

Text for relative newcomers:

Subject:
Wikipedia: Please help us evaluate an article recommendation system for newcomers

Greetings, fellow Wikipedia editor! 

We are currently developing an article recommendation service to help new editors like you find articles you might be interested in editing. It would be great if you help us evaluate how well the system works. If you are interested to do so, please go to the following link and participate in a 20 minute questionnaire (Note that clicking on the link will redirect you to a page hosted on an Amazon server. You can learn more about this study's privacy statement at the end of this email): https://research.wikimedia.org/elicit-new-editor-interests-questionnaire.html

On the last page of the questionnaire, you will be asked for your "participation token". When you get to that step, please copy and paste the following code into the corresponding field in the questionnaire: 
[PARTICIPATION TOKEN]

We thank you for your time.

Best regards,
Wikimedia Foundation Research Team
1 Montgomery Street
Suite 1600
San Francisco, CA 94104
USA
Phone: +1-415-839-6885
Fax: +1-415-882-0495

* If you no longer wish to receive emails from Wikimedia Foundation Research team about this study, please send an email with the subject "unsubscribe" to the address research-feedback@wikimedia.org.

* This research's privacy statement: https://foundation.wikimedia.org/wiki/Elicit_New_Editor_Interests_Survey_T2_Privacy_Statement

Choice of participants

Given our need for two sets of participants, we looked at people who had joined in the past three months. The two groups were sampled as follows:

The relative newcomers are all the 26,660 users who had joined in the past three months, had at least three edits in total and at least one edit in namespace 0 (i.e. articles), and whose edited articles appeared in our filtered article set (which is made up of approximately 2.2 million articles). These are the users for whom we pre-generate collaborative filtering edits and they get three baselines.
The complete newcomers are the 26,000 most recently joined editors (as of December 7th, 2018) who had fewer than 3 edits at the beginning of our experiment. For these users, there are only two popularity-based baselines.

For all of these users, we generated tokens that they would receive in their invitation emails, in order to make sure that we could match them to their usernames (for the relative newcomers) and also to keep our options open for the potential measurement of retention in the longer term.

Stats

We shuffled the list of users (separately, for each group) and ...

On 12/12/2018 we attempted sending emails to the first 2,000 users from each group.
- Complete newcomers: could not email 1,506 users (because receiving emails feature was turned off in their settings), 493 users were emailed, 1 user was not found.
- Relative newcomers: could not email 1,449 users, 551 users were emailed.
- A total of 12 responses recorded by the end of the first day.
On 12/13/2018 we attempted sending emails to the rest of the users. We first identified 15 blocked users, and removed them from the list of the users.
- Complete newcomers: could not email 17,993 users, 5,999 users were emailed, 9 users were not found.
- Relative newcomers: could not email 17,894 users, 6,751 users were emailed, 3 users were not found.
- A total of 210 responses recorded by Dec. 16th at 9 P.M. CEST. Given the total email count of 13,794, this means a response rate of about 1.5%, which is (rather unsurprisingly) slightly less than the first phase of the previous experiment. Some technical issues with the server may have caused the loss of a few responses.
- A total of 242 responses recorded by the end of Dec. 19th.

External links

The repository where the whole questionnaire resides is at https://github.com/RamtinYazdanian/question-onlineeval
The extended abstract for the project, presented at Wiki Workshop 2018 at the Web Conference 2018, Lyon, resides here.

References

↑ Hu, Y.; Koren, Y.; Volinsky, C. (2008). Collaborative Filtering for Implicit Feedback Datasets. Eighth IEEE International Conference on Data Mining (ICDM). doi:10.1109/ICDM.2008.22.
↑ The latest dumps can be found at: https://dumps.wikimedia.org/enwiki/latest/
↑ If there are enough responses, we could exclude some users from the matching scheme, and then track the two groups (those who were matched to someone and those who weren't) as part of a longer experiment to measure the effect of community formation. However, this is a separate experiment and for this reason, we have decided against doing so.

[1] Hu, Y.; Koren, Y.; Volinsky, C. (2008). Collaborative Filtering for Implicit Feedback Datasets. Eighth IEEE International Conference on Data Mining (ICDM). doi:10.1109/ICDM.2008.22.

[2] The latest dumps can be found at: https://dumps.wikimedia.org/enwiki/latest/

[3] If there are enough responses, we could exclude some users from the matching scheme, and then track the two groups (those who were matched to someone and those who weren't) as part of a longer experiment to measure the effect of community formation. However, this is a separate experiment and for this reason, we have decided against doing so.

[1]

[2]

[3]