This page covers the daily progresses with the research sprint on editor lifecycle. The main sprint page is here.
Defined cohorts (E1000+ users who registered in the same year), fetched time stamp of revision from database for those users, first stab at the data. Plots to come soon.
Preliminary plots of daily editing activity performed on 2002 cohort (65 editors) show a high degree of variability among editors -- as expected. Moreover, the distribution of account ages show most user in the sample have still performed edit this year, hence most might still be active. This kind of user is not particularly suited for our analysis. Modified data collection queries to filter out bots (looking at the user_groups table). Will modify the plotting script to filter out edits that have been active recently (last 6 months has been commonly used in other studies too).
Performed first comparison between two cohorts of E1000+ users from 2004 and 2007, respectively. Trends seem to make sense (i.e, people who have been inactive for the past 6 months tend to have a higher initial activity and then tend to burnout) but the comparison is biased in two ways: first, it biases between active and inactive, since inactive users have the same activity level but in a shorter time span (i.e. 6 months less), and second, it biases towards showing a faster burnout for the 2007 cohort, since these users have the same minimum editcount (1000+) but over a shorter span (i.e. 7 years for 2004 versus 4 years for 2007).
For the first bias I will need to modify the script so to discount the 6 months of inactivity from user_editcount when selecting inactive users. For the second bias I have redefined cohorts so to have the same yearly activity rate.
As an extra safety measure against bot accounts who are not present in user_groups table, I am adding a second form of filtering. I am removing from the cohorts all users whose user_name field has the case-insensitive sequence 'bot' either as a word prefix or suffix.
The work of this morning yielded these little two graphs:
The inset plot is the inverse cumulative distribution of account lifetimes (i.e. time elapsed since registration to last revision in the data), and gives an idea of the size of the samples on which average daily rates are computed.
The plot for 2007 has a very smooth decay that reminds either an exponential law or a power law. Would be worth to do some model fitting. Also would be nice to plot the full sample to get a sense of the individual variability.
After a day of (failed) efforts spent at trying to pack more and more information in a single plot, I shifted my attention towards more fruitful topics, that is, model fitting. First attempt at fitting an exponential distribution yielded very poor results, which was kind of expected. I tried with a power-law function but haven't managed to get a decent least squares solution out of it (besides that I shouldn't be using a least squares to fit a power-law at all, but for now I just need to get an idea if there might possibly be a decent fit). Finally tried with the stretched exponential function, and results look interesting, at least on the 2007 cohort. The stretched exponential is an interesting model for this kind of data, because it was originally devised to explain relaxation phenomena in disordered systems (e.g. the discharge of capacitor), so this could provide an interesting theoretical framework and an intuitive metaphor for explaining the underlying social dynamics.
Developing the fitting code has taken me more than expected, but now I have a fairly flexible script that lets me define new parametric models, fit them via least squares (i.e. Chi-squared fitting), constrain any parameter of the model to a specific value, and compute goodness-of-fit measures. I tried three simple models of a monotonically decreasing curve: the exponential function, the power-law function, and the stretched exponential function.
The results of the fit are the following:
Where "inact" stands for the group of inactive users (i.e. no activity in the past 6 months). I marked in bold the fit for the 2007 data to the inactive group, which the only one that attains statistical significance (p>0.1). The Exponential and the Power-law models do pretty bad, compared to the Stretched Exponential function. Morover, for the 2007 data on inactive editors the K2 from a normality test on the residuals gives a value of 4.60 (p>0.1), therefore, we can say that the 2007 data support this model.
The fit (left) and the residuals (right) shown in the two figures below:
Even though I have neglected this page, I have made substantial progresses for this sprint -- or should I better say marathon at this point -- in the past few weeks. Therefore, at least this time, I am directly posting stuff to to main page.
Grand presentation coming up in two days, and I am trying to round up the flood of data and charts I produced. I drastically changed the cohort definition, since the previous one had a fundamental error in that it did not take into account the actual lifespan of activity of users. Cohorts are now defined by average activity rate and month of first edit. This means that from two cohorts I had initially, now I have 10 years x 7 orders of magnitude of editing activity = 70 different cohorts (I am binning activity logarithmically)! What is interesting is that the activity lifecycle is very different in high-activity cohorts, much similar to my initial hypothesis: users start editing slowly, reach a peak of productivity, and then slowly wear off. This means that the stretched exponential assumption holds only for low ranges of the editing activity -- even though, I suspect, the decay after the productivity peak must have its own characteristic time. I thus moved to a nonparametric estimation approach to regress activity over time, and I am going to present how peak activity has changed over time. In the meanwhile, I made some changes to the main page to document the cohort composition approach.
Presentation coming tomorrow. Last graphs prepared and packed into slides. I am going to update the page tomorrow with the latest plots and the full results.