Research:Voice and exit in a voluntary work environment/Elicit new editor interests
Upon signing up, new Wikipedia editors-to-be are confronted with a vast wealth of articles, not knowing what articles and topics exist and in addition, not knowing what articles are in need of contributions. In the framework of a recommendation system for Wikipedia editors, this problem is called the user cold-start problem. One of the primary motivations for this project is the fact that the gender and ethnicity distribution of active Wikipedia users is currently very biased, and having a system for holding the hands of newcomers can help alleviate this problem by encouraging more of them to stay and contribute. This can be achieved through recommendations of articles and more importantly, of users - pairing newcomers with veterans.
Currently, no such system exists in Wikipedia, and users who want to get past the initial barrier will have to get past the cold-start phase by a combination of getting to know other users and sub-communities and also getting to know different topics, which will have to be done without systematic guidance. Therefore, it requires determination and dedication that might not be present in every user. We believe that a bit of systematic hand-holding can go a long way in encouraging users to stay within the system.
Our work aims to create a questionnaire that captures topical dichotomies based on article content and user preferences. The questionnaire will be presented to users upon signing up, and their answers will act as an initial profile that will allow us to recommend articles and their prominent editors to the newcomer. This will allow them to pair up with the said editors, allowing them to get into editing quickly.
Adapting recommender systems to Wikipedia
Creating a recommendation system for Wikipedia poses two challenges in particular:
- Wikipedia editors do not "rate" articles; the only information available about their preferences is their editing history.
- As opposed to most systems which recommend "consumable" items such as movies to watch, recommending Wikipedia articles to editors would not be for their consumption, but for their contribution; and while every movie can be watched, not every Wikipedia article needs contributions (and additionally, not every user will be qualified to contribute to a given article).
The former point prevents us from using most of the existing questionnaire methods, which are built for explicit-feedback systems such as Netflix. The data we have available on users only includes their editing history, based on the revision history data. This data lets us know who has edited which article with what frequency (and it provides additional information such as the size of the article after the edit and whether it was flagged as a minor edit or not), and therefore if a user has a history of repeatedly editing an article, we know that they are interested in that article. The problem of implicit feedback arises when we have few or no interactions between a user and an article; if the interactions exist, but are few in number, we cannot reliably say that the user dislikes the article, but we cannot be sure that they like it either. On the other hand, non-existent interactions are treated as unknown in explicit feedback systems, but if treated as unknown in an implicit feedback system, the system will basically only have like and unknown, instead of like, dislike and unknown. This has prompted existing literature on implicit feedback systems  to treat unknown data as dislikes, while incorporating a confidence value that increases with the number of interactions, therefore putting a higher emphasis on known and frequent interactions rather than infrequent ones.
The latter problem can be solved by pairing newcomers with veterans who have significant experience in Wikipedia and expertise in the topic in question. This way, experienced users who know which articles require edits can point the newcomers towards them, in addition to assessing their qualifications.
In this section we will discuss our questionnaire generation method in brief, and we will elaborate on our recommendation and evaluation methods. A flowchart of the entire pipeline is displayed on the right.
The data we use consists of two parts:
- The content of Wikipedia articles.
- The revision history of Wikipedia articles.
The former is used to generate a bag of words representation of each article, which allows us to represent each article as a vector whose dimensions correspond to terms in the vocabulary and the value of the each vector element is the count of the corresponding term in the article. This will be the content-based element in our recommendation system.
The revision history data  informs us of the editing history of all users. Out of the raw revision data, we extract a user by article 'editing matrix', whose each element is the number of times that user has edited that article. This is the collaborative element of our recommendation system, and will allow us to capture user preferences and detect groups of users with similar interests.
Throughout this document, the words 'article' and 'document' will both refer to Wikipedia articles, and may be used exchangeably.
Extracting the topics
Our basic idea is founded on Latent Semantic Analysis (LSA), a method that extracts topics from a set of documents using Singular Value Decomposition (SVD) on their bag of words (or TF-IDF) representation. Since we want to accommodate both the content of the articles and user preferences in our system, we will use a joint topic extraction method that utilises both of these data sources. Our method is a matrix factorisation method that attempts to factorise the editing matrix as a multiplication of a user latent matrix by a document latent matrix, while attempting to keep the document latent matrix close to the result of LSA on the content TF-IDF matrix. Our desired output from this algorithm is a document latent matrix, whose each row is the representation of the corresponding article in our latent space. We consider each dimension of the latent space to be a 'topic', as determined by the distribution of words in documents and the editing behaviour of users regarding different documents.
In our tests, we will primarily consider two different systems: one using the original LSA on the document content matrix, and another using the joint method. In our offline tests we have also used an SVD on the editing matrix and a random recommender baseline.
To generate questions, we take each dimension of the latent space (i.e. each topic), and look at the top 20 and bottom 20 articles. The top 20 are the articles with the highest positive weights in that dimension, and the bottom 20 are articles with the highest (in terms of absolute value) negative weights. The question generated from this topic takes the form of "Which of these two sets of documents would you be more interested in editing?", with a 'neither' answer also being possible. The user's answers to these questions (which will be presented to the user as a list) will function as an initial profile for the user, based on which we will provide initial recommendations, which we will discuss in the next section.
We have two options for generation of the recommendations. The simpler one is to use the article-space profile of the new user, in which selecting one of the sets in each question will lead to the user's initial article-space profile to have 1s in the entries corresponding to the articles in the selected set. The entries of the articles that do not appear in any of the user's selected sets are zero. Using this profile, we can give the user recommendations using a nearest-neighbour search, finding other users whose article-space profiles match most closely with this user's. Then we can recommend those users, or recommend the articles most edited by the set of K nearest users.
The other option is to use the latent representations we have for the articles. Using the user's answers, we can construct a latent space profile for this user, and we can calculate the most similar articles to this user's profile using the dot product. We can also calculate the dot product between this user's profile and other users' profiles, therefore directly recommending users. This option is the superior one, because our latent space captures connections that the original article space does not. For example, if a set in a question has articles on several types of cars, a user who has edited articles about many types of cars but not the ones in the question's set will not be considered similar to a new user who chooses that set in the aforementioned question, even though they are very similar. Due to this, we choose to use the latent space for recommendations.
According to our experiments, directly recommending users using the latent space profiles does not lead to satisfactory results. Recommending users directly has two issues:
- It is much easier for a user to see and 'feel' that we have captured their interests by looking at an article, than seeing a username and a list of their most edited or most recently edited articles, especially since according to our tests, the most edited articles by the directly recommended users tend to be different from the answers given to the questions.
- Generally, the document latent vectors are of higher quality, because they are initialised to relatively good quality vectors (namely, the results of LSA on the content TF-IDF matrix). Since our objective is non-convex and may fall into local minima, the better-initialised vectors tend to have better quality in the end.
Therefore, we recommend users indirectly, by recommending documents to the user, and then listing their top recent editors.
We use two types of evaluation: offline and online. The offline evaluation is performed by holding out a set of users, then holding out a set of their edits, simulating the question-answering process using their remaining edits, and then giving them recommendations. Evaluation is based on precision and recall of the recommendations, treating the held-out edits as 'positives' and everything else as 'negatives'. Obviously, the edits are a weak substitute for a real user's mental model, and therefore we have only used the offline evaluation as proof of concept. Our online evaluation will be the more realistic test of the system. A plot of the results of the offline evaluation (recall) is shown on the right.