User:EpochFail/Journal/2011-07-25

From Meta, a Wikimedia project coordination wiki

Monday, July 25th[edit]

It looks like the solution was to just re-run the aggregation process and index the new table. Right now, I've got most of the indexes on the table and I'm going to try to produce Fabian's dataset today. I'm also hoping to finish up my sprint on first edit sessions today. I might need to run a couple more plots before I can package up my results. --Ahalfaker 16:16, 25 July 2011 (UTC)

Tuesday, July 26th[edit]

I'm starting to get a little lax on my journaling. I've been skipping around between projects. Right now I've got 3 fronts:

  1. Check Fabian's work I'm trying to add a column to rev_len_changed table so that I can set up my massive group by index, but it is slow going. This is really frustrating me because I planned to be ready next week.
  2. Gather data about hugglings I'm writing a script to track the huggle messages that were posted for the experiment that Stu and I are running. I've got some of the hard parts finished. Specifically, I can use the Wiki-diff system to pull out our experimental templates. I'm just putting the parts together now and I think I'll be able to run a test by the end of today.
  3. Test my conclusions about rejection This is in preparation for user modeling and just to make sure I know what I am talking about. I think the reason that new editors are being rejected more is because they are editing a more complete encyclopedia. That should mean that they are editing more complete pages on average and that should be a predictor of rejection. It turns out that this is exactly what my first sprint showed, but I still need to form a solid case around it. This task is more long running and will require a bit of writing. For right now, I am running a script that is re-aggregating my user sample with the average length of pages that users edited when they first came to the site.

This is all keeping me pretty busy minute to minute so I'm not looking for more tasks, although I am interested in the WikiProject work that Shawn and Jonathan are working on. I'll talk to them if I suddenly find I have free time. --Ahalfaker 22:58, 26 July 2011 (UTC)

I just finished writing the script that uses the Wikipedia API's diffing system to gather the talk page posts for the huggle experiment. I have to add a line of code that will deal with random gateway errors when trying to contact Wikipedia though :(. They tend to happen once/1000 calls. --Ahalfaker 16:17, 27 July 2011 (UTC)

Wednesday, July 27th[edit]

I just finished another regression over non-vandal new editors in my sample. This time, I have included the length of the page when they save their edit. I've made this one a little easier to read too by changing the independent variable names. All independent variables have been scaled by their standard deviation to allow their coefficients to be compared.

  • investment: The number of edits performed in the first edit session
  • rejection: The proportion of edits in the first three edit sessions that were deleted or reverted
  • page_len: The average length of articles edited by an editor in their first session
  • year: Number of years since 2001 (decimal to capture sub-year effects)


Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4909  -1.0986   0.6339   0.9081   1.7906  

Coefficients:
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         0.662354   0.018791  35.248  < 2e-16 ***
investment                          0.392645   0.039513   9.937  < 2e-16 ***
page_len                           -0.058795   0.028392  -2.071  0.03838 *  
year                               -0.467640   0.018902 -24.740  < 2e-16 ***
rejection                          -0.476950   0.020222 -23.586  < 2e-16 ***
investment:page_len                -0.009701   0.059504  -0.163  0.87049    
investment:year                     0.018595   0.038824   0.479  0.63197    
page_len:year                       0.002801   0.030128   0.093  0.92594    
investment:rejection               -0.071488   0.037836  -1.889  0.05884 .  
page_len:rejection                  0.051593   0.031682   1.628  0.10343    
year:rejection                      0.091330   0.020459   4.464 8.04e-06 ***
investment:page_len:year           -0.106218   0.070901  -1.498  0.13410    
investment:page_len:rejection      -0.013098   0.038033  -0.344  0.73056    
investment:year:rejection           0.124372   0.041041   3.030  0.00244 ** 
page_len:year:rejection            -0.038006   0.035750  -1.063  0.28774    
investment:page_len:year:rejection -0.018580   0.081368  -0.228  0.81938    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22199  on 17037  degrees of freedom
Residual deviance: 20060  on 17022  degrees of freedom
AIC: 20092

Number of Fisher Scoring iterations: 5

I'm surprised about a couple of things in the result of this regression.

  • The effect of initial investment is still substantial, but it has been discounted.
  • The length of pages edited has a significant negative effect on survival, but is not as substantial as I expected.
  • There is still a strong temporal effect based on year. In other words, there are still a lot that has been changing over time that this model captures with this (stupid) term.
  • Rejection is still a very strong predictor of survival. Although it is slightly stronger than the temporal effect (year), the difference is small enough to be the luck of the sample.
  • Rejection is still has more of an effect on editors who are most more heavily invested
  • Rejection still has less of an effect on survival as it used to.
  • Rejection of heavily invested editors is less of a big deal than it used to be.

It's occurred to me that page_len could be stupid. Since I'm looking at *all* pages edited, I could be pulling in the page_len of WP:HD and other long talk pages. I'm currently looking at my script to find a way to break these edits out. --Ahalfaker 18:17, 27 July 2011 (UTC)

I just re-wrote the script to aggregate session activity based on main/other namespaces. I'd need to add a lot of columns/rows to account for the mass of other namespaces, so this will have to do for now. If I'm going to learn something interesting, I'll find it in this dataset. I'm also waiting on MySQL for getting Fabian's dataset and I've finished getting all the experimental hugglings for the last 8 days. --EpochFail 20:17, 27 July 2011 (UTC)

Since I am waiting on queries and aggregation, I created a template and reformatted the WSOR11 page to be a bit more inviting, but possible less scroll bar friendly.

I've mostly been waiting on getting the damn namespace column in my rev_len_changed table so I can group over it. Every time I do an update on the table, it locks up replication and pauses logging for the huggle experiment (!!!). I've now created a duplicate page table and I'm working with that instead. Hopefully replication doesn't get blocked again. --EpochFail 22:59, 27 July 2011 (UTC)

Thursday, July 28th[edit]

My update still hasn't finished! I'm just trying to add a column for the page namespace to my rev_len_changed table and it has been running for 24 hours. This is a simple inner join with a primary key and it should be completed in a matter of minutes... or at worst, hours. This is ridiculous. I'm going to produce Fabian's dataset in python immediately and then aggregate the whole table in python for indexing. --EpochFail 16:39, 28 July 2011 (UTC)

I just switched back to my own laptop because I finally got it back from Lenovo, tested it and upgrade the memory. Also, the lender crashed on me so it seemed like a good time. Now that I've gotten back to my running server processes I see that my join between revision and page is creating an temp table on disk. So, that's not going to work.  :( I guess I'm going to re-write this bastard to do the join in the way it should have... by using the primary key on page and foreign key on revision (rev_page). I expect a runtime of less than 24 hours. I should *not* be able to beat MySQL at it's own game, but I'm going to annihilate it by an order of magnitude or more. --EpochFail 18:53, 28 July 2011 (UTC)

I just kicked off the re-aggregation that is getting page namespace into the table. It's as fast as expected. :) I'm running through pages in order and then querying for revisions. It turns out that is very fast. I'm already through 20 million revisions. --EpochFail 20:31, 28 July 2011 (UTC)

I spent some time looking through the hugglings dataset. It looks like there's only a tiny bit of weirdness (woot!) with editors getting more than one 1st level warning posted on their talk page. In those cases, I think it will be reasonable to simply ignore that the second message occurred. It's the first warning that we want. Most importantly, we need to make sure that the warnings posted are the first warning on their talk pages. This could be difficult since anon talk pages are weird. I think I'll limit it to the first post in 32 hours. Huggle uses the same wait time to reset the warnings. --EpochFail 23:15, 28 July 2011 (UTC)

I just finished loading the hugglings into a table. MySQL doesn't understand booleans and converts them to "tinyint"s. I wouldn't be mad, but mysqlimport doesn't understand that "False" = 0 and "True" = 1. Oh well. That was a simple problem to work out. I'm ~120 million revisions through my rev_len_changed dataset. That should be ready in the morning--now with namespaces! I'm excited. --EpochFail 00:07, 29 July 2011 (UTC)

Friday, July 29th[edit]

This is the third time I'm writing this due to stability problems with VirtualBox in Windows. Working from my laptop yesterday was amazing. Today, virtualbox has a demon.

Otherwise, the script that was processing stuff for Fabian stopped last night because, for some reason, querying for pages from the page table sometimes returns nothing when there is most definitely something to return. I've already made modifications to pick up where I left off. --EpochFail 16:55, 29 July 2011 (UTC)

I've rebooted into pure linux mode and things are better. I had a chat about Mechanical Turk with Jonathan and Huggle/Twinkle/ClueBot with Staeiou. It looks like the best direction for me to focus at the moment is:

  1. Getting this dataset ready for Fabian. It was embarrassing when it wasn't done in a couple of days. Now, I'm starting to question my methods.
  2. Link hugglings of anon's up to account creations.

The dataset is being generated. I expect that it will be ready to load in the afternoon, and if MySQL participates, I should have it indexed by the end of the day. --EpochFail 18:14, 29 July 2011 (UTC)