Research:Task recommendations/Qualitative evaluation of morelike
The purpose of this study is to determine whether the quality of the "morelike" feature in mw:Extension:CirrusSearch would be useful for identifying articles that are in a similar topic area.
- Base article
- The article from which similar article recommendations will be based
- Similar article
- An article returned by morelike for a base article
- A feature of mw:Extension:ElasticSearch that finds "more articles" that are "like" a base article
Sampling base articles
In order to get a sense for how morelike will work for recommending articles to newcomers, we wanted to get a representative sample of articles that newcomers are likely to land on -- and potentially edit -- after registering their account. To do this, we used the
returnTo field in Schema:ServerSideAccountCreation. We gathered a random sample of returnTo articles for newly registered users that were in the article namespace during the 7 day period between 2014-07-16 and 2014-07-23.
Gathering similar articles
In order to replicate the proposed behavior of the mw:Task_recommendations behavior, we implemented the following filters:
- article_length > 0: Sanity check that it's not blank
- Filter en:Category:Living people: no biographies of living people -- too difficult for newbies to edit without being reverted
Subsamples and snippets
Hand-coding all of the top 50 articles for all 195 base articles (9750) would be a lot to ask for, so we subsample further at this point to a random set of 5 similar articles per base article to get a total of 975 (base article, similar article) pairs.
In order to aid in hand-coding, we took a second pass over this dataset to generate snippets of relevant content. For similar articles, morelike returned a snippet. For base articles, we gathered wikimarkup from the first section and filtered out templates using mwparserfromhell then saved the first 200 characters.
These (base article, similar article) with snippets were split evenly between hand-coders with a 54 item overlap to test for inter-rater reliability. Items were loaded into google spreadsheets and hand-coders were asked to rate each items as similar or not based on (1) only looking at the titles and (2) after reviewing the text snippets. Handcoders were volunteers from the Wikimedia Foundation staff: Halfak (WMF), Steven (WMF) and Maryana (WMF).
We performed an Fleiss Kappa test on the 54 item overlap to look for evidence of reliability and found very little agreement between assessments of title (Kappa = 0.257) or text (Kappa = 0.156). So the assessments of individuals were examined separately.
Fleiss Kappa test outputs
> kappam.fleiss(overlap[,list(title.aaron>0, title.maryana>0, title.steven>0)]) Fleiss' Kappa for m Raters Subjects = 54 Raters = 3 Kappa = 0.257 z = 3.28 p-value = 0.00105 > kappam.fleiss(overlap[,list(text.aaron>0, text.maryana>0, text.steven>0)]) Fleiss' Kappa for m Raters Subjects = 54 Raters = 3 Kappa = 0.156 z = 1.98 p-value = 0.0477
Since our inter-rater reliability scores were low, we could not combine the codings of our three separate coders into a single dataset. So, instead, we analyse each coder separately. Since there was such a lot number of observations per rank, we combine ranks into 5 rank buckets. For example, ranks 1-5 appear in bucket 1. Ranks 6-10 appear in bucket 6.
#Similarity by rank suggects that there's a clear, high rate of similarity once hand-coders reviewed the text of the base article and similar article. As rank approaches 50, about 75% of articles returned by morelike are considered similar by the hand-coders. That means, if 3 articles are recommended by morelike between rank 45 and 50, we have a 25% * 25% * 25% = 1.6% chance that none of them will actually be similar.
For the first 5 similar articles, raters found them to be about 94% were similar. That means we have a 6% * 6% * 6% = 0.02% chance that none of the top five recommended articles will be similar.