User:MPopov (WMF)/Notes/ETL User Guide

From Meta, a Wikimedia project coordination wiki

This page instructs how to schedule ETL jobs using the Product Analytics jobs repo and Puppet systemd timer. There are other ways to schedule ETL jobs, but this approach exists because the others are currently impractical.

Prerequisites[edit]

You need to have production data access and be in the analytics-product-users user group. You can verify membership by reviewing this file. This allows you to run commands on the stat boxes as analytics-product system user.

You also need a Gerrit account and SSH configured (see page on advanced usage) and git-review installed.

You need to have clones of the following repos:

Step 1: Create Job[edit]

These instructions assume you are using wmfdata and a Jupyter notebook.

Structure[edit]

You want the following structure:

jobs/
  |- movement_metrics/
  |- ...
  |- your_job_name/
       |- main.sh
       |- hive/
       |- notebooks/

It is strongly recommended that you also create a CHANGES.md and a README.md and document everything thoroughly.

The notebooks/ directory is for, well, notebook files having "ipynb" file extension – that's important. If you want your notebooks to be executed in a specific order, name them appropriately.

Create table queries[edit]

The hive/ directory is for CREATE TABLE queries. Do not specify the database in the query as this will require the user to specify it when they execute the query via Hive CLI. As a best practice, you should name the file as create_table_name.hql For example, to create a table called my_table:

hive -f hive/create_my_table.hql --database your_database

Example query[edit]

As you review the query below, pay attention to the style (e.g. spacing, capitalization, every non-obvious column documented).

-- Usage
--   hive -f create_my_table.hql --database wmf_product

CREATE TABLE IF NOT EXISTS `my_table` (
    `action`    STRING  COMMENT 'Type of interaction',
    `sessions`  BIGINT  COMMENT 'Total number of sessions with a specific interaction',
    `events`    BIGINT  COMMENT 'Total number of interactions'
)
PARTITIONED BY (
    `year`   INT,
    `month`  INT
)
STORED AS PARQUET

main.sh[edit]

The contents of main.sh will be:

#!/bin/bash
notebooks_dir="/srv/product_analytics/jobs/your_job_name/notebooks"

# Clean up nbconvert notebooks from *previous* run:
rm $notebooks_dir/*.nbconvert.ipynb

# Activate Anaconda-WMF base environment:
source /usr/lib/anaconda-wmf/bin/activate
for notebook in $notebooks_dir/*.ipynb
do
  python -m jupyter nbconvert --ExecutePreprocessor.timeout=None --to notebook --execute $notebook || exit 1
done

The only thing you should change is your_job_name in the second line.

Run the following command to make that shell script executable:

chmod +x main.sh

Permissions[edit]

 Most jobs will load data into our team's centralized database wmf_product.

In the last cell of the notebook you should update the permissions on your table in HDFS so that it can be read in Superset by other users.

import os

os.system("hdfs dfs -chmod -R o+r /user/hive/warehouse/wmf_product.db/my_table")

We can now upload the patch to Gerrit for review.

Step 2: Test Notebook[edit]

Before merging you'll need to verify that the ETL actually works. To do that, run the `CREATE TABLE` query using your own database. For example:

hive -f hive/create_my_table.hql --database bearloga

Change all the references to wmf_product in the queries in the notebook(s) to bearloga (but YOUR database, not mine). Don't forget to also adjust that last permissions-fixing cell, too!

Then run ./main.sh to see if all the notebooks can be executed successfully. Afterwards, query all the affected tables to check that the data in them is what you expect.

Step 3: Prepare Database[edit]

Now that you've verified the ETL works, it's time to ready the centralized database. Following the Kerberos user guide, on stat1007.eqiad.wmnet:

sudo -u analytics-product kerberos-run-command analytics-product hive -f hive/create_my_table.hql --database wmf_product

Step 4: Scheduling[edit]

To schedule the job, you will need to add it to the product_analytics manifest.

kerberos::systemd_timer { 'product-analytics-your-job-name':
    ensure                    => 'present',
    description               => 'Product Analytics monthly ???? run',
    command                   => "${jobs_dir}/your_job_name/main.sh",
    interval                  => '*-*-1 00:00:00',
    user                      => $user,
    logfile_basedir           => $log_dir,
    logfile_name              => 'your-job-name.log',
    logfile_owner             => $user,
    logfile_group             => $group,
    monitoring_enabled        => true,
    monitoring_contact_groups => 'team-product-analytics',
    syslog_force_stop         => true,
    syslog_identifier         => 'product-analytics-your-job-name',
    slice                     => 'user.slice',
    require                   => [
        Class['::statistics::compute'],
        Git::Clone['analytics/wmf-product/jobs']
    ],
}

The lines to change:

  • Line 1: product-analytics-your-job-name
    • Note the use of hyphens instead of underscores
  • Line 3: Product Analytics monthly ???? run
    • Change 'monthly' to 'daily' if appropriate
  • Line 4: ${jobs_dir}/your_job_name/main.sh
  • Line 5: *-*-1 00:00:00
  • Line 8: your-job-name.log
  • Line 14: product-analytics-your-job-name
    • This should be the same as line 1

After jenkins-bot verifies the patch, ping an SRE from Data Engineering for code review. For now, that is only Ben Tullis.

Step 5: Backfill[edit]

Step 6: QA[edit]