About the Algorithm
Authors: Roopesh Ranjan and Kalpit Desai
Source code: dumps.wikimedia.org
Ranking in Wikichallenge: Honorable Mention (see leaderboard)
This document describes the Algorithm used by team – Aardvarks. The final model is an ensemble of eight individual models.
How to Run:
- Using the files located in “FeatureCreation” folder generate all features
- Import “training.tsv” and “regdate.tsv” in the folder FeatureCreation
- Run the three python scripts located in the folder
- Run all the “.r” files located in the folder in the order in which they were modified
- Run the R code “EnsembleCombination.R” located in the main folder to generate the final result file “Pick3.csv”. The file “EnsembleCombination.R” runs all the 8 constituent models and generates the final result.
- “Models” folder contains all the eight constituent files used in modelling.
1. RF model
Input -- Features113_
Output -- "RF_29Aug11_2nd.csv"
This model trains a separate Random Forest Model for people who have joined before "2009-09-01 0:0:0" (i.e. OLD guys), versus those who have joined after that time (NEW guys). This partition is motivated by the potential sampling effect that may exist in the data because only those editors who have made at least one edit between 2009-09-01 to 2010-09-01 was a part of the dataset. The RF model for new guys uses 14 variables, whereas the RF model used for old guys uses 19 variables.
Input -- Features113
Output -- "withoptim_not_so_Simple9a_nested_segs_LB.csv"
This model applies two levels of segmentations on users. First those who have been in the system for at least 1 year, vs those who are newer than 1 year. Then it segments based on the number of unique edit days that the user was active in last 5 months. For each of these segments a linear model is fit on chosen features (different for each segment), by running an nonlinear optimiser that attempts minimizing the RMSLE loss. Also, 25 models for each segment are fitted, and their predictions are aggregated by taking median.
3. Seven Segs V10
Input -- Featureskd_wrevs
Output -- Seven_Segs_v10_corrected_bagging.csv
This is very similar to not_so_simple9a, except the predictions are aggregated by taking geometric mean. Also some differences in choice of features.
Feature Generation Process
This document outlines the steps for creating three featuresets.
1. OrigFeatures_*.csv 2. Features113_*.csv 3. Featureskd_wrev_*.csv
1. Create a text file with comma-separated-timestamps for edits made by each editor. One line per editor, so total 44514 lines. The timestamp is stored as a number that represents the days elapsed since Midnight Jan 01, 2001.
- eid_time_table.py: rounds each timestamp to nearest day (floor operation), creates edit_times.csv
- eid_time_table_unrounded.py: Stores unrounded timestamp with 6 decimal point precision, creates edit_times_unrounded.csv
- editors_mnthly_edits.r: Computes monthly edits for each editor, based on rounded time stamps. Each month is of equal length, aprx 30.4 days. Also creates a file regdate.csv, which contains registration dates for each editor; when regdate is missing, it is replaced by the FSDT.
- editors_mnthly_edits_unrounded.r: Same thing, but based on rounded time stamps.
2. Create a csv file that has three columns: user_id, FirstEditTime(unrounded), RegistrationTime
- eid_fsdt_regdate_table_creation.r --> eid_fsdt_regdate_table.csv
3. Reverts related features: revert_features.py --> reverts_related_features_training_and_LB.csv
4. Read the whole training.tsv file, parse useful columns and dump a csv file with chosen columns
- create_raw_parsed_dump.r --> Parsed_RawDump_Full.csv
5. Create the whole feature matrix and populate a subset of features that are based on edit_times_unrounded.csv
- OrigFeatures_subset1.r --> OrigFeatures_subset1_train.csv, OrigFeatures_subset1_lead.csv
6. Create what we called "Original Feature set"
- create_orig_features.r --> OrigFeatures_xinp1p2_yinp3.csv, OrigFeatures_xinp1p2p3.csv
7. OrigFeatures_to_Feature113.r (self explanatory)
8. OrigFeatures_to_Featurekd_wrev.r (self explanatory)