Research:Usage of talk pages/2019-11-11
In discussion with Editing team, it became clear that the previous analysis was too coarse-grained. Specifically, the Editing team's focus is on understanding junior contributors (100 or less edits).
Therefore, we only consider the first 90 days of editors after registering (in fact, after the first edit was made) for all editors that registered in the period 2018-01-01 -- 2019-01-01.
Number of edits: talk vs subject
We count the number of edits per user and namespace
|Spark query to count number of edits per namespace|
import os, sys import numpy as np import datetime import calendar import time import pandas as pd from pyspark.sql import functions as F, types as T, Window snapshot = '2019-09' wiki = 'dewiki' date_start_reg = datetime.datetime(2018, 1, 1, 0) date_end_reg = datetime.datetime(2018, 2, 1, 0) n_days = 90 ## get edits for this number of days after registration row_timestamp_edit = F.unix_timestamp(F.col('event_timestamp')) ## time-window to consider for editing. ts_start_reg = calendar.timegm(date_start_reg.timetuple()) ts_end_reg = calendar.timegm(date_end_reg.timetuple()) row_timestamp_reg = F.unix_timestamp(F.col('event_user_first_edit_timestamp')) df = ( ## select table spark.read.table('wmf.mediawiki_history') ## select wiki project .where( F.col('wiki_db') == wiki ) .where(F.col('snapshot')==snapshot) ## time window registration (first edit) .where(row_timestamp_reg >= ts_start_reg) .where(row_timestamp_reg < ts_end_reg) ## edits .where( F.col('event_entity')=='revision' ) ## whether user is detected as bot ## https://phabricator.wikimedia.org/T219177 ## filter bots (array is empty) ## only registered users .where( F.col('event_user_is_anonymous')==False ) .where( F.size(F.col( 'event_user_is_bot_by' ))==0 ) .withColumn('delta_t_sec', row_timestamp_edit-row_timestamp_reg ) .where(F.col('delta_t_sec')<n_days*24*3600) .groupBy(F.col('event_user_id')) .pivot('page_namespace') .count() ) df= df.toPandas() df = df.drop('null',1)
From this we can easily show the relation between number of edits to talk-pages and number of edits to subject-pages. As in the previous analysis we see that an increase in the number of edits to talk-pages is related to a disproportional increase in the number of edits to subject pages across all wikis considered here. Note that the plots are in double-logarithmic scale such that a steeper slope implies a faster-than-linear growth.
This non-linear effect is similar to 'buy 2 and get 1 for free'. Interestingly, the effect only holds for editors with small number of edits (10-50 edits to talk pages).
Predicting future editing behaviour
The weak point of the analysis above is that it is purely correlational. In order to take one further step towards a causal mechanism as to how much interactions on talk-pages drive overall editing behaviour, we try to predict future editing activity. We split the first 90 days into two parts, and based on the editing activity in different namespaces in the 1st half, we predict the number of edits to the main namespace in the second half. More specifically, we use as features the number of edits in the first half: i) to each namespace (e.g., N1_ns_0), the total number of edits (N1_total), and the combined number of edits to talk (N1_ns_subject) and subject namespaces (N1_ns_talk), respectively; the target variable is the number of edits to namespace 0 in the 2nd half (N2_ns_0). In order to take into account the wide variation on the number of edits, we consider log (count+1). (the latter +1 in order to also consider 0).
For the prediction, we use sklearn's  with default parameters. We learn the algorithm on 90% of the data and evaluate on the remaining 10%, and report averages of 10 different realizations of this split.
Comparing the predicted vs the true number of edits, we get a pearson-correlation of ~0.5 indicating that we can predict future editing activity based on the past to some degree. Accuracy is typically larger for larger wikis where we have more data.
We quantify the effect of different features (and in particular that of talk-pages) by looking at the feature importance score and depedence plots indicating the effect of a feature on the target variable.
- The feature importance is a measure of how much the feature has an influence on the predicted variable, see e.g., post: The measures are based on the number of times a variable is selected for splitting, weighted by the squared improvement to the model as a result of each split, and averaged over all trees
- The partial dependence plot shows how changing the value of a feature influences the target-variable (magnitude and direction)
Key for features used * N1 or N2 refer to the number of edits made in the 1st or 2nd half of the observation window * _ns_<name> refers to number of edits made to particular wikipedia namespace * <name> = 0,1,2,...: single namespace * <name> = talk: all talk namespaces combined (1,3,5,7,...) * <name> = subject: all subject namespaces combined (0,2,4,6,...)
The most important feature is, unsurprisingly, N1_ns_0. The first talk-page feature is N1_ns_3 (user talk pages), however, the partial dependence suggests that the more interactions on user talk-pages, the fewer edits the user will make to the main namespace in the future. The article talk-pages have a small (positive) effect.
The most predictive feature is similarly N1_ns_0. However, the effect of talk-pages is very different. First, edits to article talk-pages (N1_ns_1) usually has a larger feature importance. Second, the effect of user talk-pages is positive (or at least not negative); one exception is kowiki where it is slightly negative.