From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

This page documents technical information, procedures, processes, and other knowledge that Discovery's Analysts should be aware of. An Employee Operations Manual (EOM), if you will. For onboarding steps, please refer to Wikimedia Discovery/Team/Analyst onboarding page on MediaWiki. Certain information has been withheld and is available as a supplement on the internal Office wiki. This EOM, its supplement, and the onboarding page provide all the necessary documentation and instructions for a newly hired Data Analyst in the Discovery Department.

Data Sources, Databases, and Datasets[edit]

Be sure to check out Discovery data access guidelines and data retention guidelines.

Data Sources[edit]

Web requests[edit]

  • Web requests are generated when a browser (or API client) navigates to a Wikimedia site.
  • There may then be subsequent requests for additional content, e.g. images, other data from the page
  • Some subset of those web requests are considered "page views" based on the profile of the requests
  • "Page views" is a metric that attempts to capture how many times a page has been viewed by a real human being
  • Page views tools:


  • Almost all, but not *all*, EventLogging requires Javascript be enabled in the client
  • It's often used to track user interactions with features to determine how well the feature serves their need
  • As you perform different actions, those actions fire events from the Javascript engine in your browser which are sent to our EventLogging servers



Once SSH'd to stat1002 ("stat2") or stat1003 ("stat3"), connect to the MySQL server with the following command: mysql -h analytics-store.eqiad.wmnet


Contains event logging tables, defined by schemas. The following tables are of particular interest:

  • Search_* (Schema:Search) captures the events from the autocomplete text field in the top right corner of wikis on desktop.
  • MobileWikiAppSearch_* (Schema:MobileWikiAppSearch) captures events from people searching on the Wikipedia app on Android and iOS.
  • MobileWebSearch_* (Schema:MobileWebSearch) captures events from people searching on the *.m.* domains.
  • TestSearchSatisfaction2_* (Schema:TestSearchSatisfaction2) which allows us to track user sessions and derive certain metrics.
  • WikipediaPortal_* (Schema:WikipediaPortal) captures events from people going to the Wikipedia Portal (
  • PrefUpdate_* (Schema:PrefUpdate) contains users' preference change events (e.g. opting in to / out of a beta feature).

Hadoop Cluster[edit]

The cluster contains many databases, some of which we'll make a note of and describe in this section. It has A LOT of data. So much, in fact, that we can only retain about a month of it at any point.


The WebRequest table in the wmf database contains refined request data stored in the Parquet column-based format (as opposed to the raw JSON imported directly from Kafka to wmf_raw.WebRequest). Refined means some fields (columns) are computed from the raw data using user-defined functions (UDFs) in Analytics' Refinery. For example, client_ip is computed from ip and x_forwarded_for to reveal the true IP of the request, and geocoded_data (country code, etc.) is computed from client_ip so we can easily write queries that fetch requests from a specific country.

Due to the volume of the data, the requests are written out to specific partitions indexed by webrequest_source (e.g. "text" or "misc"), year, month, day, and hour. It is important to include a WHERE webrequest_source = "__" AND year = YYYY AND etc. clause in your Hive query to avoid querying all partitions, which will take a very, very, VERY long time.


The CirrusSearchRequestSet table in the wmf_raw database contains raw JSON of Cirrus searches.

User-defined Functions[edit]

User-defined functions (UDFs) are custom functions written in Java that can be called in a Hive query to perform complex tasks that can't be done with the built-in Hive QL functions. To see examples of UDFs, look in Analytics' Refinery source (e.g. IsPageview UDF).

To get started with writing your own UDFs, git clone, install maven (e.g. brew install maven on Mac OS X if you have Homebrew installed), and then package the codebase into a Java .jar file (via mvn package) that you can import into Hive:

ADD JAR /home/bearloga/Code/analytics-refinery-jars/refinery-hive.jar;

USE [Database];
SELECT my_udf([Field]) AS [UDF-processed Field] FROM [Table]
  WHERE year = YYYY AND month = MM AND day = DD;

I have a clone of the analytics refinery source repository in /home/bearloga/Code and have an update script scheduled to run (12 20 * * * cd /home/bearloga/Code && bash that checks if origin/master is ahead of the clone and then pulls the updates & mvn package-es it up and copies the latest JARs to /home/bearloga/Code/analytics-refinery-jars, so our data collection scripts can use those (as of T130083) and always have the latest and greatest. The script is:

cd analytics-refinery-source

MERGES=$(git log HEAD..origin/master --oneline)
if [ ! -z "$MERGES" ]; then
  git pull origin master
  mvn package
  for refinery in {'core','tools','hive','camus','job','cassandra'}
    jar=$(ls -t "/home/bearloga/Code/analytics-refinery-source/refinery-${refinery}/target" | grep 'SNAPSHOT.jar' | head -1)
    cp "/home/bearloga/Code/analytics-refinery-source/refinery-${refinery}/target/${jar}" "/home/bearloga/Code/analytics-refinery-jars/refinery-${refinery}.jar"

Public Datasets[edit]

Our golden (data) retriever codebase fetches data from the above-described MySQL and Hadoop databases, does additional munging & tidying-up, and then writes the data out to /a/aggregate-datasets/[search|portal|maps|wdqs|external_traffic] directory on stat1002, which rsyncs to It is a collection of R scripts that runs MySQL and HiveQL queries. All the scripts are executed daily by run.R, which is scheduled as a crontab job (in Mikhail's bearloga account) that runs at 12:20AM UTC:

0 20 * * * cd /a/discovery/golden/ && sh

We still need to get it Puppet-ized so it's not dependent on any staff account. Note that this means that any time golden is updated, it needs to be git pull'd in /a/discovery/


Process Overview[edit]

Typically, tasks go through Product Manager approval before they go into our backlog (currently the "Analysis" column on Discovery Phabricator board). During sprint planning meetings, we pull tasks from that into our sprint board to work on them during that sprint. Occasionally we sidestep that process and add emergency ("Unbreak now!") tasks directly to the sprint. For a more thorough description of process at Discovery, see this article on MediaWiki.

Analysis with R[edit]

In the past, we've focused on doing our analyses in R (statistical analysis software and programming language). In this section we describe packages we've developed internally, as well as important packages we tend to use to accomplish our tasks. Remember to set proxies, though, if you're gonna be downloading packages from CRAN and GitHub:

Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")

Internal Codebases[edit]

  • polloi contains common functions used by the dashboards
  • golden isn't a package but contains all the data collection scripts that retrieve and aggregate data from the MySQL and Hadoop databases
  • wmf contains common functions used in analyses and data collection (e.g. querying Hive/MySQL)

By the way, all of our Gerrit-hosted repositories are mirrored to GitHub/wikimedia. So you can install the wmf package (GitHub mirror) you could run one of the following:


Common Packages[edit]

install.packages(c("data.table", "magrittr", "dplyr", "tidyr", "broom",
                   "ggplot2", "cowplot", "RColorBrewer", "ggthemes",
                   "readr", "haven", "tibble", "rvest", "lubridate", "httr",
                   "devtools", "xml2", "git2r", "testthat",
                   "Rcpp", "RcppEigen", "RcppParallel", "doMC",
                   "shiny", "shinydashboard", "dygraphs", "shinythemes",
                   "xts", "forecast", "rstan", "gbm", "lme4",
                   "randomForest", "xgboost", "C50", "caret",
                   "xtable", "knitr", "rmarkdown", "toOrdinal",
                   "urltools", "rgeolocate", "iptools"),
                 repos = c(CRAN = ""))

devtools::install_github("aoles/shinyURL") # for Metrics dashboard

# RStudio Add-ons
install.packages(c("addinslist", # Browse and install RStudio addins
                   "shinyjs", # Colour picker
                   "ggExtra", # Add marginal plots to ggplot2
                   "ggThemeAssist"), # Customize your ggplot theme
                 repos = c(CRAN = ""))

Also uaparser for parsing user agents but installing it is awful (note: is really slow and should only be used on data extracted from MySQL event logs since we have a UA-parsing UDF already and the refined webrequests already contain parsed UAs)

Statistical Computing[edit]

Sometimes we may need to run computationally intensive jobs (e.g. ML, MCMC) without wanting to worry about hogging up stat100x. For tasks like that, we have a project on Wikimedia Labs called discovery-stats that we can create 2-core, 4-core, and 8-core (with 16GB of RAM!) instances under. The instances must be managed through Horizon and can be set up with the following shell script after SHH-ing (ssh <LDAP username>@<instance name>.eqiad.wmflabs):


sudo sh -c 'echo "deb jessie-cran3/" >> /etc/apt/sources.list'
sudo apt-key adv --keyserver --recv-key 6212B7B7931C4BB16280BA1306F90DE5381BA480

sudo apt-get update --fix-missing && sudo apt-get -y upgrade
sudo apt-get -fy install gcc-4.8 g++-4.8 gfortran-4.8 make \
 libxml2-dev libssl-dev libcurl4-openssl-dev \
 libopenblas-dev libnlopt-dev libeigen3-dev libarmadillo-dev libboost-all-dev \
 liblapack-dev libmlpack-dev libdlib18 libdlib-dev libdlib-data \
 libgsl0ldbl gsl-bin libgsl0-dev \
 libcairo2-dev libyaml-cpp-dev \
 r-base r-base-dev r-recommended

sudo su - -c "R -e \"dotR <- file.path(Sys.getenv('HOME'), '.R'); \
if (!dir.exists(dotR)) { dir.create(dotR) }; \
M <- file.path(dotR, 'Makevars'); \
if (!file.exists(M)) { file.create(M) }; \
cat('\nCXXFLAGS+=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function', file = M, sep = '\n', append = TRUE); \
cat('\nCXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs', file = M, sep = '\n', append = TRUE)\""

sudo su - -c "R -e \"install.packages(c('arm', 'bayesplot', 'bclust', 'betareg', 'bfast', 'BH', 'BMS', 'brms', 'bsts', 'C50', 'caret', 'coda', 'countrycode', 'data.table', 'data.tree', 'deepnet', 'devtools', 'e1071', 'ElemStatLearn', 'forecast', 'gbm', 'ggExtra', 'ggfortify', 'ggthemes', 'glmnet', 'Hmisc', 'import', 'inline', 'iptools', 'irlba', 'ISOcodes', 'knitr', 'lda', 'LearnBayes', 'lme4', 'magrittr', 'markdown', 'mclust', 'mcmc', 'MCMCpack', 'mcmcplots', 'mice', 'mlbench', 'mlr', 'nlme', 'nloptr', 'NLP', 'nnet', 'neuralnet', 'prettyunits', 'pROC', 'progress', 'quanteda', 'randomForest', 'randomForestSRC', 'Rcpp', 'RcppArmadillo', 'RcppDE', 'RcppDL', 'RcppEigen', 'RcppGSL', 'RcppParallel', 'reconstructr', 'rgeolocate', 'rstan', 'rstanarm', 'scales', 'sde', 'tidytext', 'tidyverse', 'tm', 'triebeard', 'urltools', 'viridis', 'xgboost', 'xtable', 'xts', 'zoo'), repos = c(CRAN = ''))\""

This will install all the necessary software packages, libraries, and R packages at an instance-level. Additional R packages can then be installed at a user-level in the user's home dir.

PAWS Internal[edit]

We can perform data analyses using Jupyter notebooks (with R and Python kernels) via PAWS Internal.

Assuming you’ve got production access and SSH configured (see Discovery/Analytics on Office wiki for more examples of SSH configs), you need to create an SSH tunnel like you would if you wanted to query Analytics-Store on your local machine:

ssh -N notebook1001.eqiad.wmnet -L 8000:

Then navigate to localhost:8000 in your favorite browser and login with your LDAP credentials (username/password that you use to login to Wikitech).

By the way, if you want a quick way to get into PAWS Internal, you can make an alias (in, say, ~/.bash_profile) that creates the SSH tunnel and opens the browser:

alias PAWS="ssh -N notebook1001.eqiad.wmnet -L 8000: & open http://localhost:8000/"

Then in terminal: PAWS

This will launch your default browser and output a numeric process ID. When you want to close the tunnel: kill [pid]

R in PAWS Internal[edit]

Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")
options(repos = c(CRAN = ""))

If you want, you can put those 3 lines in ~/.Rprofile and they will be executed every time you launch R.

We need to install the devtools and Discovery’s wmf packages. If you want to work with user agents in R, I’ve included the command to install the uaparser package.


# uaparser:
devtools::install_github("ua-parser/uap-r", configure.args = "-I/usr/include/yaml-cpp -I/usr/include/boost")

MySQL in PAWS Internal with R[edit]

The mysql_connect function in wmf will try to look for some common MySQL config files (those vary between stat1002, stat1003, and notebook1001). It’ll let you know if it encounters any problems but I doubt you’ll have any. Try querying with mysql_read:

log_tables <- wmf::mysql_read("SHOW TABLES;", "log")
Fetched 375 rows and 1 columns.

Hive in PAWS Internal with R[edit]

Since hive has been configured on notebook1001, we don’t need to do anything extra to use the query_hive function in wmf:

wmf_tables <- wmf::query_hive("USE wmf; SHOW TABLES;")

Installing Python modules on PAWS Internal[edit]

Madhu said that the global version of pip is out of data and needs to be updated on a per-user basis.

Upgrading should get you to pip 8+, and then wheels (the new python distribution format) instead of eggs should get installed.

You can update & install within the notebook, but if you prefer to do it in Terminal after SSH’ing to notebook1001.eqiad.wmnet, you can add the path to your ~/.bash_profile:

[[ -r ~/.bashrc ]] && . ~/.bashrc
export PATH=${PATH}:~/venv/bin
export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080

Then you can use and upgrade pip:

!pip install --upgrade pip
Downloading/unpacking pip from
  Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB): 1.3MB downloaded
Installing collected packages: pip
  Found existing installation: pip 1.5.6
    Uninstalling pip:
      Successfully uninstalled pip
Successfully installed pip
Cleaning up...

Then we can install (for example):

  • Data
  • Visualization
  • Statistical Modeling and Machine Learning
    • StatsModels for statistical analysis
    • Scikit-Learn for machine learning
    • PyStan interface to Stan probabilistic programming language for Bayesian inference
    • PyMC3 for Bayesian modeling and probabilistic machine learning
    • Patsy for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. (Patsy brings the convenience of R “formulas” to Python)
    • TensorFlow for machine learning using data flow graphs
    • Edward for probabilistic modeling, inference, and criticism
pip install \
    pandas pandas-datareader requests beautifulsoup4 feather-format \ 
    seaborn bokeh \
    statsmodels scikit-learn pystan pymc3 patsy

Warning: TensorFlow v0.12.0 and 0.12.1 broke compatibility with Edward. Use at most TensorFlow v0.11.0 for now:

pip install $TF_BINARY_URL
pip install edward

This nifty command to update installed Python modules comes to us courtesy of rbp at Stack Overflow:

pip freeze --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Event Logging[edit]

MediaWiki-powered websites use the EventLogging extension, while Wikipedia Portal ( has a "lite" version with components copied from the MW extension.

Wikipedia Portal[edit]


The relevant files are located in dev/


Refer to for detailed instructions on contributing to the Wikimedia Portals repo. Here are the instructions (assuming you're using a Mac):

First, download and install node.js 0.12.7 pkg because apparently node-gyp isn't supported anymore and doesn't work under Node.js 4[1]. Install Homebrew if you don't have it already.

# Run the following after installing node.js 0.12.7:
brew update && brew upgrade && brew install npm casperjs

# Repository set up:
sudo apachectl start
cd /Library/WebServer/Documents
# (give yourself R+W permissions on Documents)
git clone ssh://
cd portals

# Install NPM modules:
npm config set python python2.7
npm install

# Do not use: `npm install --python=python2.7 node-gyp postcss cssnext handlebars imagemin jshint jscs lwip sprity-lwip sprity gulp`

git checkout -b patch_nickname

gulp watch --portal
# ^ Watches for changes in dev/ and generates an index.html file at dev/

# ...coding...

# Test and debug by browsing to: http://localhost/portals/dev/

# Generate the production version with minified JS & CSS assets:
gulp --portal
# Test the production version by browsing to: http://localhost/portals/prod/

git commit -a -m "message"
# git commit --amend # detailed patch notes
git review

There are some sort-of unit tests in the tests/ folder which can be run using npm.


As of 25 Feb 2016, our team maintains 5 dashboards (Search Metrics, Portal Metrics, WDQS Metrics, Maps Metrics, and External Referral Metrics). Each dashboard has its own repository on Gerrit. The links to the repositories can be found on the Discovery Dashboards homepage. The dashboards are powered by Shiny, a web-development framework written in R, and run on Shiny Server inside a Vagrant container on a Labs instance. We've documented most of the dashboarding process in the Building a Shiny Dashboard article on Wikitech. To ensure a unique but uniform look, we might use the shinythemes package for the dashboard's overall appearance.

Discovery Dashboards[edit]

The dashboards live in /srv/dashboards/shiny-server on the discovery-production ( and discovery-testing ( instances, and deploying new versions of the dashboards is different between beta and production. By the way, adding dashboards is covered in the Discovery Dashboards README.

Experimental Dashboards[edit]

We have a space for deploying prototypes of dashboards, so that we can try them out very quickly without adding them to our existing dashboards that are very beefy. The homepage for those is

Dashboard Development[edit]

mkdir Discovery\ Dashboards

# Data Retrieval:
git clone ssh://[USERNAME] Discovery\ Dashboards/Data\ Retrieval

# Internal R package:
mkdir R\ Packages
git clone ssh://[USERNAME] R\ Packages/wmf
git clone R\ Packages/BCDA

# Common Files and Functions for Dashboards:
git clone ssh://[USERNAME] Discovery\ Dashboards/Common

# mwvagrant and shiny-server (cloned on our search-datavis instances):
git clone ssh://[USERNAME] Discovery\ Dashboards/Server

# Dashboards:

## Search Metrics
git clone ssh://[USERNAME] Discovery\ Dashboards/Metrics\ Dashboard

## Portal
git clone ssh://[USERNAME] Discovery\ Dashboards/Portal\ Dashboard

git clone ssh://[USERNAME] Discovery\ Dashboards/WDQS\ Dashboard

## External Traffic
git clone ssh://[USERNAME] Discovery\ Dashboards/External\ Traffic\ Dashboard

## Maps
git clone ssh://[USERNAME] Discovery\ Dashboards/Maps\ Dashboard

Deploying to Beta[edit]

All this step involves is +2'ing a patch. Once a patch has been merged into the dashboard's master branch, a regularly scheduled script on the beta instance automatically pulls the new version of the dashboard and restarts the shiny server service. Note the use of comments. This team is very much for commenting your code.

cd /srv/dashboards

# We need to check if the dashboards repo has been
# updated (e.g. packages added to Only
# then can we pull latest versions of dashboards.

# Bring remote references up to date:
git fetch origin

# Check if there are remote changes that
# need to be merged in:
MERGES=$(git log HEAD..origin/master --oneline)
if [ ! -z "$MERGES" ]; then
  CHANGES=$(git diff --shortstat)
  if [ ! -z "$CHANGES" ]; then
    # Clean out uncommitted references to dashboards:
    git submodule update --init
    # ^ avoids conflicts when pulling origin/master
  # Bring this repo up-to-date:
  git pull origin master
  # Re-provision the vagrant container:
  vagrant provision
  vagrant reload

# Pull latest version of each dashboard:
git submodule foreach git pull origin master

# Check if newer ver's of dashboards were downloaded...
# ...but first let's ensure we don't get an error:
if [ ! -e '/home/bearloga/submodules_status.txt' ]; then
  touch /home/bearloga/submodules_status.txt
# ...okay, let's actually do the check now:
OLDSTATUS=`cat /home/bearloga/submodules_status.txt`
NEWSTATUS=$(git submodule status)
if [ "$OLDSTATUS" != "$NEWSTATUS" ]; then
  # Restart because different (newer?) dashboards were dl'd:
  mwvagrant ssh -c "sudo service shiny-server restart"
  # Save hashes for next checkup:
  git submodule status > /home/bearloga/submodules_status.txt

Updating polloi is also handled via a regularly scheduled script:

LOCAL_SHA1=$(cat /usr/local/lib/R/site-library/polloi/DESCRIPTION | grep 'RemoteSha' | sed 's/RemoteSha: //' | sed 's/\s//g')
URL=$(cat /usr/local/lib/R/site-library/polloi/DESCRIPTION | grep 'RemoteUrl' | sed 's/RemoteUrl: //')
REMOTE_SHA1=$(git ls-remote -h ${URL} | grep 'refs/heads/master' | sed 's=refs/heads/master==' | sed 's/\s//g')
if [[ "$LOCAL_SHA1" != "$REMOTE_SHA1" ]]; then
  /usr/bin/R -e "devtools::install_git('${URL}')"
  service shiny-server restart

As of 30 September 2016, we actually just include update_r_package polloi in when vagrant container is re-provisioned. It is a call to devtools::update_packages("polloi"). The code above is just for archive and reference.

Deploying to Production[edit]

Once the dashboards have been sufficiently tested on the beta instance where it's easy to deploy patches to fix bugs and make minor changes, it is time to update the hash pointers (references) on the main dashboard repository via git submodule foreach git pull origin master which goes through the dashboards – "rainbow" (metrics), "twilightsparql" (wdqs), "prince" (portal), etc. – and updates the submodule references in the .gitmodules file. Then submit these reference changes to gerrit (add, commit, review) for +2'ing by another Discovery Analyst (if available). Once you've done that and the patch is merged, ssh to the discovery-production instance, cd to /srv/dashboards, and run sudo make update. This will make the appropriate changes, reprovision the vagrant container, and restart the shiny server service inside the vagrant container.

Specifically, here are the steps (from scratch):

  1. From scratch: git clone ssh://[LDAP username] Shiny\ Server && cd Shiny\ Server && git submodule update --init --recursive
  2. If already did step 1:
    1. cd Shiny\ Server (or whatever you decided to call it)
    2. git submodule foreach git pull origin master
    3. Update and list what's new in this deployment
    4. git commit -a -m "Deploying [...]"
    5. git review
  3. Have yourself or colleague +2 the patch on gerrit and merge it.
  4. Deploy the submodule updates OR if you've added R packages to
    1. ssh [LDAP username]@discovery-production.eqiad.wmflabs
    2. cd /srv/dashboards
    3. sudo make update
    4. mwvagrant ssh -c "sudo service shiny-server restart"

If you want to update the version of polloi that's installed on the production instance:

# Assumes SSH config is set correctly.
$> ssh [LDAP username]@search-datavis.eqiad.wmflabs
ssh> cd /srv/dashboards
ssh> mwvagrant ssh
vagrant> sudo R
R> devtools::install_git('')
vagrant> sudo service shiny-server restart

Remember that deploying to the production server should not be done lightly. Product Managers and Leads use the dashboards to make data-informed decisions, so it's really important that they are stable and always up.

New Labs Instances[edit]

  1. Spin up an instance: Wikitech → Manage Instances → Select "shiny-r" as the project → Set filter → Add instance (e.g. search-datavis.eqiad.wmflabs)
  2. Create a proxy: Wikitech → Manage Web Proxies → Select "shiny-r" as the project → Set filter → Create proxy (e.g.

Shiny/Dashboarding Resources[edit]

Research and Testing[edit]

We prefer to upload our analysis codebases and report sources (we mostly write our reports in RMarkdown and compile/knit them into PDFs to be uploaded to Wikimedia Commons) to GitHub where we have the wikimedia-research organization. We use the following naming convention for the repositories: Discovery-[Research|Search|Portal|WDQS]-[Test|Adhoc]-NameOrDescription.

User Satisfaction[edit]

We have an ongoing research project to measure user satisfaction, which is one of our Key Performance Indicators (KPIs). We currently have an event logging schema that tracks randomly-selected users' sessions. We hope to use the Reading team's Quick Surveys technology to gather qualitative feedback from users and then build a predictive model that uses the event logging data to predict users' satisfaction with our search system. The current method combines the clickthrough rate with session dwell time, and track the metric – which we're currently calling "user engagement" (or "augmented clickthroughs") – daily on the Search Metrics dashboard.

A/B Tests[edit]

Our past and current A/B tests – and process guidelines for future tests – are documented on Discovery's Testing page. We perform the analyses of clickthroughs using our BCDA R package. We prefer the Bayesian approach because traditional, null hypothesis significance testing methods do not work with the volume of data we usually generate from our A/B tests.

Forecasting Usage[edit]

We have an ongoing research project to forecast usage volume. A prototype dashboard for this endeavor is live on the experimental instance. Mikhail has started working on a Bayesian approach.

Miscellaneous Responsibilities[edit]

Monthly Metrics[edit]

We try to keep Discovery's KPI table on Wikimedia Product page updated with previous month's metrics, namely before the 15th of the month. To that end, we created the Monthly Metrics module on the Search Metrics dashboard, which allows us to very quickly fill out the table with the latest numbers. The table also has sparklines, which can be generated from scratch using the following R code:


  sha1 = "de913ed700673554f2f6bd584fb47f322b381b74"


smoothed_load_times <- list(
    Desktop = desktop_load_data,
    Mobile = mobile_load_data,
    Android = android_load_data,
    iOS = ios_load_data
  ) %>%
  dplyr::bind_rows(.id = "platform") %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(Median = median(Median)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = Median)
smoothed_zrr <- polloi::smoother(failure_data_with_automata, "month", rename = FALSE) %>%
  dplyr::rename(value = rate)
smoothed_api <- split_dataset %>%
  dplyr::bind_rows(.id = "api") %>%
  dplyr::group_by(date) %>%
  dplyr::summarize(total = sum(calls)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = total)
smoothed_engagement <- augmented_clickthroughs %>%
  dplyr::select(c(date, `User engagement`)) %>%
  polloi::smoother("month", rename = FALSE) %>%
  dplyr::rename(value = `User engagement`)

smoothed_data <- dplyr::bind_rows(list(
  `user engagement` = smoothed_engagement,
  `zero rate` = smoothed_zrr,
  `api usage` = smoothed_api,
  `load times` = smoothed_load_times
), .id = "KPI") %>%
  dplyr::arrange(date, KPI) %>%
  dplyr::distinct(KPI, date, .keep_all = TRUE) %>%
  tidyr::spread(KPI, value, fill = NA) %>%
  dplyr::filter(lubridate::mday(date) == 1) %>%
  dplyr::mutate(unix_time = as.numeric(as.POSIXct(date))) %>%
  dplyr::filter(date < lubridate::floor_date(Sys.Date(), "month"))

sparkline <- function(dates, values) {
  unix_time <- as.numeric(as.POSIXct(dates)) # "x values should be Unix epoch timestamps"
  if (any(, 10)))) {
    offset <- max(which(
    unix_time <- unix_time[(offset+1):length(values)]
    values <- values[(offset+1):length(values)]
  # mw:Template:Sparkline supports up to 24 values
  unix_time <- tail(unix_time, 24); values <- tail(values, 24)
  points <- paste0("x", 1:length(unix_time), " = ", unix_time, " | ", "y", 1:length(unix_time), " = ", values, "|")
  return(paste0(c("|{{Sparkline|", paste0(points, collapse = "\n"), "}}"), collapse = "\n"))

sparkline(smoothed_data$date, smoothed_data$`user engagement`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`zero rate`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`api usage`) %>% cat("\n")
sparkline(smoothed_data$date, smoothed_data$`load times`) %>% cat("\n")

This generates wiki markup for inserting Sparklines into tables.

Responsible Development[edit]

Version control is very important on this team. All our codebases use version control which allows us to track changes, collaborate without clashing, and review each other's work before it is deployed. Git is a tool that tracks changes to your code and shares those changes with others. You no longer have to email files back and forth, or fight over who's editing which file in Dropbox. Instead, you can work independently, and trust Git to combine (aka merge) your work. Git allows you to back in time to before you made that horrific mistake. You can replay history to see exactly what you did, and track a bug back to the moment of its creation. If you haven't used Git before but have 15 minutes and want to learn Git, try this interactive lesson inside your web browser.


Code pipeline: commitfetchreview (make sure to install git-review)

Resources for learning how to use Gerrit:

Someone on the team (senior analyst?) should add themselves to the wikimedia/discovery/* section on the Git/Reviewers page so that they are automatically added as a reviewer to all wm/discovery repositories.


Code pipeline after forking or branching: commitpushpull request.

If somebody else owns the repository ("repo"), you fork to have your own copy of the repo that you can experiment with. Once you are ready to submit a commit for review (and potentially deployment), you create a pull request that allows the owner to accept or reject your proposed changes.

Resources for learning how to use GitHub:

Style Guide[edit]

This section provides tips for consistent and efficient aesthetics and code.


ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species), data = iris) +
  geom_point(size = 3) +
  ggtitle("Fisher's or Anderson's iris data") +
  xlab("Sepal length (cm)") +
  ylab("Sepal width (cm)") +

Programming Style[edit]

We recommend reading and adopting Hadley Wickham's style guide (based on Google's style guide for R). The following are some of our additional suggestions:

# Okay, but could be better:
foo <- function(x) {
  if (x > 0) {
  } else {

# Better:
foo <- function(x) {
  if (x > 0) {

Because if x > 0, return("positive") prevents execution of return("negative") anyway.

x <- TRUE

# Unnecessary comparison:
if ( x == TRUE ) return("x is true")

# Efficient:
if (x) return("x is true")

Same with ifelse(<condition>, TRUE, FALSE) :P

Learning Resources and References[edit]

Data Visualization[edit]

Machine Learning[edit]


Statistics / Data Science[edit]

Blogs by Statisticians / Data Scientists[edit]