Research:Wiki Participation Challenge prognoZit

From Meta, a Wikimedia project coordination wiki

About the Algorithm[edit]

Authors: Ben Roth and Fridolin Roth

Team Name: prognoZit

Source code: dumps.wikimedia.org

Ranking in Wikichallenge: 1st

Dependencies[edit]

  • Python
  • Octave

Basic Approach[edit]

We obtain our prediction vector y, i.e. column 2 of the submitted data, by where

  • B∈Mn,m is a real matrix with one row for each of the n editors and is a number of features we use for the prediction (e.g. number of edits in a certain time interval)
  • is a certain weight vector

Features extracted from the data set[edit]

The columns of the matrix B are as follows:

  • Column 1: Sum of the deltas of edits from 2001-11-01 and before 2010-04-01
  • Column 2: Sum of the deltas of edits from 2010-04-01 and before 2010-09-01
  • Column 3: Number of reverted edits from 2001-11-01 and before 2010-04-01
  • Column 4: Number of reverted edits from 2010-04-01 and before 2010-09-01
  • Column 5: Number of edits from 2001-11-01 and before 2006-05-01
  • Column 6: Number of edits from 2006-05-01 and before 2008-11-01
  • Column 7: Number of edits from 2008-11-01 and before 2009-12-01
  • Column 8: Number of edits from 2009-12-01 and before 2010-03-01
  • Column 9: Number of edits from 2010-03-01 and before 2010-06-01
  • Column 10: Number of edits from 2010-06-01 and before 2010-07-01
  • Column 11: Number of edits from 2010-07-01 and before 2010-08-01
  • Column 12: Number of edits from 2010-08-01 and before 2010-08-15
  • Column 13: Number of edits from 2010-08-15 and before 2010-09-01

An extra column of ones is added (as an offset of the linear regression, see Determination of the weight vector). These are the features for testing (submission, matrix B). For training (determining the weight vector ) the same features are used but the time periods are shifted backwards by 5 months. We call this training matrix .

Determination of the weight vector[edit]

To determine the weight vector we use the time-shifted matrix and the vector containing the number of edits each editor made within the five months prior to the five months to be predicted. The weight vector is then obtained as the least square linear regression using the octave function .

Working in the log space[edit]

We have obtained our solution after transforming all the data to the log space, i.e. we have modified all entries by the function before estimating the weight vector. Accordingly, the solution was retransformed by the function

Partitioning the set of editors[edit]

We have obtained our solution by applying the approach described above not to the whole set of editors at once but to two subsets of editors separately. These subsets are of equal size and are obtained by grouping editors with data provided in the even respectively odd lines of the provided data set.

Data used[edit]

We have only used the data set provided by the organizers of the competition:

  • Number of edits in certain time intervals prior to 2010-09-01
  • “Deltas” in certain time intervals prior to 2010-09-01
  • Number of reverts in certain time intervals prior to 2010-09-01

Pseudo-code explanation[edit]

function training_features(i,j) 
	Return a vector of counts (edits, deltas, reverts - dependent on feature j), for a certain training time interval and editor i from the provided dataset.
endfunction

function testing_features(i,j) 
	Return a vector of counts (edits, deltas, reverts - dependent on feature j), for a certain testing time interval and editor i.
endfunction

n <- number of editors 
m <- number of features

let A, B be n x (m+1) matrices

for i = 1 to n: 
	A(i,1) <- 1 
	B(i,1) <- 1

for j = 2 to m + 1: 
	A(i,j) <- training_feature(i,j) 
	B(i,j) <- testing_feature(i,j)
	
	Take the logarithm of all entries of A and B: 
	A(i,j) <- log((A(i,j) + 1) 
	B(i,j) <- log((B(i,j) + 1)
	
	Let z be an n vector containing the number of edits done in the 5 months prior to the prediction interval (= the 5 months after the last training time interval).
	
z <- log(z+1)
	
for i = 1 to n / 2: 
	A_even(i,j) <- A(2i,j) 
	A_odd(i,j) <- A(2i - 1,j) 
	B_even(i,j) <- B(2i,j) 
	B_odd(i,j) <- B(2i - 1,j) 
	z_even(i) <- z(2i) 
	z_odd(i) <- z(2i - 1)
	
Solve least square linear regression for A and z, even and odd indices separately: 
w_even <- A_even \ z_even 
w_odd <- A_odd \ z_odd
	
Calculate the prediction vector in the log space
for i = 1 to n / 2: 
	y(2i) <- B_even(i,:)*w_even 
	y(2i - 1) <- B_odd(i,:)*w_odd

Retransform to the non-log space: y <- max(0, exp(y)-1)

return y