User:EpochFail/Journal/2011-08-01

Monday, August 1st[edit]

I'm writing this entry after-the-fact. I was sick on Monday, but I did get some things done. I found out that MySQL won't perform efficient grouping operations with some aggregate functions. Most notably, SUM(). I just... I can't understand... I guess MySQL wasn't really built for performing the kinds of aggregations that I'd like to do. So... I wrote a script that will perform the aggregation in a way that I suspect will be a little more efficiently, but sadly, it still won't use the index. I plan to test my assumption by racing the queries. I don't think this will be wasteful since I think that MySQL is going to write a table to the disk, perform a merge sort and then aggregate. If I am right, this will result in at least an order of magnitude more time. If not, I'll be very happy to be wrong :) --EpochFail 17:56, 2 August 2011 (UTC)

Tuesday, August 2nd[edit]

I started today by trying to finishing loading the data into MySQL for Fabian's dataset so I can race MySQL vs. my aggregation script. Zack had a query running from the table that blocked my load, so while waiting for that to finish or him to shut it down, I started loading data into one of the new analytics machines. This is a perfect platform for testing the capabilities of MySQL aggregation w/o the index vs. my own aggregation script, since they'll be querying from different machines. Once this finishes and I can get things moving (about lunchtime), I'll start working on huggle outcomes and ANOVA. --EpochFail 17:56, 2 August 2011 (UTC)

Data loaded into MySQL and indexed. Query running. I worked on some huggle data today and am planning to do some hand coding tomorrow to check out outcomes for editors that continue to edit after reading their huggle warning. I hope to find a simple way to algorithmically predict success. Either way, I'll be producing and ANOVA of registration (as a metric of huggled anon editor success) tomorrow. --EpochFail 00:51, 3 August 2011 (UTC)

Wednesday, August 3rd[edit]

MySQL to the rescue! The large GROUP BY finished *much* faster than expected when I ran by myself on one of the new analytical database machines. I now have a dataset that groups edit activity by user_id, rev_year, rev_month and namespace with the following fields:

first_edit: An editor's first edit
first_edit_year: An editor's first edit year (int)
first_edit_month: An editor's first edit month (int)
reverting_edits: The number of reverting edits performed
noop_edits: The number of edits that did not changed the length of an article
add_edits: The number of edits that increased the length of an article
remove_edits: The number of edits that reduced the length of an article
len_added: The total length added to articles
len_removed: The total length removed articles

It's time to check Fabian's work! --EpochFail 16:39, 3 August 2011 (UTC)