User:Halfak (WMF)/WMF research libraries

From Meta, a Wikimedia project coordination wiki

I'd like to perform a substantial upgrade and consolidation of our (WMF's) python code for research in preparation for some dramatic improvements to my analysis/development environment. I'll use this page to document some of those ideas.

Python 3[edit]

Transitioning from 2.7 to 3 is annoying, so I plan to bundle it with a larger transition. I'm also hoping to transition from R (love the community, hate the language) for statistical work too. This transition will rely heavily on support from numpy and scipy.

Python as an analysis environment[edit]

IPython Notebook[edit]

The environment is relatively straightforward. I found myself picking up markdown in a matter of minutes. It's fun to run code and then complain about what happened. There are a few complaints that I have. For example, I have to reach for my mouse to switch from code mode to markdown mode. However, the system mostly just works and it's much smarter and cleaner than an R document. 22:24, 4 November 2013 (UTC)

Pandas for data tables[edit]

I just finished a quick run through the Pandas documentation and checked for some of the functionality that I regularly use in R. I found that most of it was intact, but quite a lot of the transformations and filtering I'd like to do are a little quirky and over-convoluted. I'm finding myself missing data.tables from R a lot, but I can do what I need to do. 22:24, 4 November 2013 (UTC)

Plotting with Bokeh[edit]

I ran through a little bit of the set of examples for plotting in IPython notebook. It seems like the library is quite capable, but it's not ready. For example, geom_errorbar, one of my favorite functions, is missing. That's just one example. I think I'll be trying out bokeh another time, but I'm worried that reverting from my awesome R plotting environment will make me less productive. 22:24, 4 November 2013 (UTC)

Map reduce[edit]

Python streaming[edit]

Utilities[edit]

There are two sets of problems that I'd like to solve in a set of utilities.

Common scripts (see clize)
A set of utility scripts for extracting information from the database/dumps/etc and performing operation & transformations or gathering stats.
Common utilities
A set of python modules that supports extension of these common actions (e.g. XML dump processing scripts & Persistence) or data transformations (e.g.

User stats ~ Wikimetrics[edit]

It's important that any statistics generation/extraction is closely tied to m:Wikimetrics or we'll end up duplicating a bunch of work.