Research talk:Automated classification of article quality/Work log/2016-03-25

From Meta, a Wikimedia project coordination wiki

Friday, March 25, 2016[edit]

Today, I'm trying to find out if some modifications that User:Nettrom and I made to the label extraction process improves the fitness of our English Wikipedia model.

So, if I use nettrom's dataset that he extracted for [1], then I get ~60% accuracy. But if I use my extractor, I get 54% accuracy.

After getting on a call on Monday, we worked out that Morten used the state of the article the *first* time that it was classified in a particular way. Let's look at a theoretical example. Here's a few pretend edits to en:Waffle:

  1. Oct 2015, true_article_quality=C, ratings=(WikiProject_Breakfast:C)
  2. Nov 2015, true_article_quality=C+, ratings=(WikiProject_Breakfast:C, WikiProject_Food_and_drink:C)
  3. Dec 2015, true_article_quality=B, ratings=(WikiProject_Breakfast:B, WikiProject_Food_and_drink:C)
  4. Jan 2016, true_article_quality=GA-, ratings=(WikiProject_Breakfast:B, WikiProject_Food_and_drink:B)

Using my old strategy, I would include an observation in the set for every new project/article/quality_class triplet. So, I'd have observations for (Oct 2015, WikiProject_Breakfast:C), (Nov 2015, WikiProject_Food_and_drink:C), (Dec 2015, WikiProject_Breakfast:B), (Jan 2016, WikiProject_Food_and_drink:B). We suspect that a big part of the lost accuracy comes from the fact that (Nov 2015, WikiProject_Food_and_drink:C) corresponds to a C+ article quality and (WikiProject_Food_and_drink:B) corresponds to a GA- article quality -- confusing the model and making testing look bad.

So, we've changed the extractor so that it now only includes an observation for the changes against the last-applied quality_class. So, in the example above, that would limit our observations to (Oct 2015, WikiProject_Breakfast:C), (Dec 2015, WikiProject_Breakfast:B) -- since they were the first classification change that happened. This removes the (theoretically) problematic observations from the dataset and should be able to help us train and test better.

Here's my new dataset:

$ wc enwiki.observations.first_labelings.20160204.json
  5481779  54747936 607000580 enwiki.observations.first_labelings.20160204.json
$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | wc
   8042   81132  864339
$ cat enwiki.observations.first_labelings.20160204.json | grep '"fa"' | wc
  15442  159230 1687140
$ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | wc
  60564  627865 6628472
$ cat enwiki.observations.first_labelings.20160204.json | grep '"b"' | wc
 201920 2061693 21895177
$ cat enwiki.observations.first_labelings.20160204.json | grep '"c"' | wc
 306668 3124785 33202345
$ cat enwiki.observations.first_labelings.20160204.json | grep '"start"' | wc
1681712 17104298 189119614
$ cat enwiki.observations.first_labelings.20160204.json | grep '"stub"' | wc
3207585 31590444 353618867

OK. Now for some quick spot-checking.

$ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | head
{"project": "biography", "label": "ga", "timestamp": "20071205084657", "page_title": "Caligula"}
{"project": "milhist", "label": "ga", "timestamp": "20150128113516", "page_title": "Caligula"}
{"project": "politics", "label": "ga", "timestamp": "20090723230422", "page_title": "Caligula"}
{"project": "lgbt studies", "label": "ga", "timestamp": "20101019092951", "page_title": "Caligula"}
{"project": "technology", "label": "ga", "timestamp": "20131012212120", "page_title": "Fat Man"}
{"project": "military history", "label": "ga", "timestamp": "20131123125448", "page_title": "Fat Man"}
{"project": "aviation", "label": "ga", "timestamp": "20131012212120", "page_title": "Fat Man"}
{"project": "1.0", "label": "ga", "timestamp": "20131012212120", "page_title": "Fat Man"}
{"project": "united states", "label": "ga", "timestamp": "20131012212120", "page_title": "Fat Man"}
{"project": "engineering", "label": "ga", "timestamp": "20131012212120", "page_title": "Fat Man"}

Darn it. It looks like I mist have made some mistake during the process. We can see many "ga" assessments of en:Fat Man and even one duplicate "ga" assessment by WikiProject_Military_history at 20131123125448. Back to the extractor code to see what's going on. :/ --EpochFail (talk) 14:29, 25 March 2016 (UTC)[reply]

  1. Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. "The Success and Failure of Quality Improvement Projects in Peer Production Communities". In the proceedings of CSCW 2015.