Research:Wiki Participation Challenge Aardvarks

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search

About the Algorithm[edit]

Authors: Roopesh Ranjan and Kalpit Desai

Team: Aardvarks

Source code: dumps.wikimedia.org

Ranking in Wikichallenge: Honorable Mention (see leaderboard)

Dependencies[edit]

  • Python
  • R


Synopsis[edit]

This document describes the Algorithm used by team – Aardvarks. The final model is an ensemble of eight individual models.

How to Run:

  • Using the files located in “FeatureCreation” folder generate all features
    1. Import “training.tsv” and “regdate.tsv” in the folder FeatureCreation
    2. Run the three python scripts located in the folder
    3. Run all the “.r” files located in the folder in the order in which they were modified
  • Run the R code “EnsembleCombination.R” located in the main folder to generate the final result file “Pick3.csv”. The file “EnsembleCombination.R” runs all the 8 constituent models and generates the final result.
  • “Models” folder contains all the eight constituent files used in modelling.

Model Description[edit]

1. RF model[edit]

Input -- Features113_

Output -- "RF_29Aug11_2nd.csv"

This model trains a separate Random Forest Model for people who have joined before "2009-09-01 0:0:0" (i.e. OLD guys), versus those who have joined after that time (NEW guys). This partition is motivated by the potential sampling effect that may exist in the data because only those editors who have made at least one edit between 2009-09-01 to 2010-09-01 was a part of the dataset. The RF model for new guys uses 14 variables, whereas the RF model used for old guys uses 19 variables.

2. not_so_simple_9a[edit]

Input -- Features113

Output -- "withoptim_not_so_Simple9a_nested_segs_LB.csv"

This model applies two levels of segmentations on users. First those who have been in the system for at least 1 year, vs those who are newer than 1 year. Then it segments based on the number of unique edit days that the user was active in last 5 months. For each of these segments a linear model is fit on chosen features (different for each segment), by running an nonlinear optimiser that attempts minimizing the RMSLE loss. Also, 25 models for each segment are fitted, and their predictions are aggregated by taking median.

3. Seven Segs V10[edit]

Input -- Featureskd_wrevs

Output -- Seven_Segs_v10_corrected_bagging.csv

This is very similar to not_so_simple9a, except the predictions are aggregated by taking geometric mean. Also some differences in choice of features.

Feature Generation Process[edit]

This document outlines the steps for creating three featuresets.

1. OrigFeatures_*.csv 2. Features113_*.csv 3. Featureskd_wrev_*.csv

Preprocessing[edit]

1. Create a text file with comma-separated-timestamps for edits made by each editor. One line per editor, so total 44514 lines. The timestamp is stored as a number that represents the days elapsed since Midnight Jan 01, 2001.

  • eid_time_table.py: rounds each timestamp to nearest day (floor operation), creates edit_times.csv
  • eid_time_table_unrounded.py: Stores unrounded timestamp with 6 decimal point precision, creates edit_times_unrounded.csv
  • editors_mnthly_edits.r: Computes monthly edits for each editor, based on rounded time stamps. Each month is of equal length, aprx 30.4 days. Also creates a file regdate.csv, which contains registration dates for each editor; when regdate is missing, it is replaced by the FSDT.
  • editors_mnthly_edits_unrounded.r: Same thing, but based on rounded time stamps.

2. Create a csv file that has three columns: user_id, FirstEditTime(unrounded), RegistrationTime

  • eid_fsdt_regdate_table_creation.r --> eid_fsdt_regdate_table.csv

3. Reverts related features: revert_features.py --> reverts_related_features_training_and_LB.csv

4. Read the whole training.tsv file, parse useful columns and dump a csv file with chosen columns

  • create_raw_parsed_dump.r --> Parsed_RawDump_Full.csv

OrigFeatures[edit]

5. Create the whole feature matrix and populate a subset of features that are based on edit_times_unrounded.csv

  • OrigFeatures_subset1.r --> OrigFeatures_subset1_train.csv, OrigFeatures_subset1_lead.csv

6. Create what we called "Original Feature set"

  • create_orig_features.r --> OrigFeatures_xinp1p2_yinp3.csv, OrigFeatures_xinp1p2p3.csv

Features113[edit]

7. OrigFeatures_to_Feature113.r (self explanatory)

Featureskd_wrev[edit]

8. OrigFeatures_to_Featurekd_wrev.r (self explanatory)