User:GModena (WMF)/Pipelines Repo Structure

From Meta, a Wikimedia project coordination wiki

This page contains design notes for https://phabricator.wikimedia.org/T292743.

Assumptions[edit]

This document describes a monorepo where Airflow dags and project code live together. This repo structure matches how we currently deploy software to our own airflow instance.

A broader discussion on how to organise and share airflow DAGs at WMF (across teams) is happening at https://phabricator.wikimedia.org/T290664. For us this means that `dags` and `tests` top-level directories will move out of the repo described in this document.

Their deployment process will be owned by Data Engineering. We’ll be responsible for the lifecycle (e.g. software development, CI, code standards, artefacts publication, dependency management etc.) of project code bases.

In the future production scenario we’ll end up with two repos:

1. A central (monorepo) place to keep `dags`. This will be owned by DE.

2. A repo for project code (potentially broken down in multiple sub-repos). This will be owned by us.

Needs decision[edit]

My preference would be to look at our use case and the DE shared airflow repo as parallel tracks.

Most of the decisions we'll make at team level impact the project code (code standards, CI, etc) that is orthogonal to dags. To me it is important that we don't break the dev experience. Right now we have (simple) tooling to deploy changes to our airflow instance. I'd rather not spend time revisiting that to eventual repo changes, till we'll be able to piggyback on a shared deployment tooling.

README.md[edit]

https://gitlab.wikimedia.org/gmodena/platform-airflow-dags This repo contains data pipelines operationalised by the Generated Datasets Platform team.

Data pipelines[edit]

> […] a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. […] > https://en.wikipedia.org/wiki/Pipeline_(computing)

A Generated Datasets Platform data pipeline is made up by two components:

1. Project specific tasks and data transformation that operate on input output. We depend on Apache Spark for elastic compute. Currently we support Python based projects. Scala support is planned.

2. An [Airflow DAG](https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html), that is a thin orchestration layer that composes and executes tasks.


Repo layout[edit]

We structure code with a [monorepo](https://en.wikipedia.org/wiki/Monorepo) strategy. Its structure matches the layout of `AIRFLOW_HOME` on the PET owned [an-airflow1003.eqiad.wmnet](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng) airflow instance.

  • `dags` contains [Airflow dags] for all projects. Each DAG schedules a data pipeline. No business logic is contained in the dag.
  • `tests/` contain the `dags` validation test suite. Project specific tests are implemented under `<project-name>`
  • `<project-name>` directories contain tasks and data transformation that operate on input output. For an example, see `image-matching`.

Note that these are currently physical directories, but we could split and modularise the repo using git submodules. For a (conceptual) example of this approach see: https://gitlab.wikimedia.org/gmodena/platform-data-pipelines. This is how a third-party could contribute a project to the repo https://gitlab.wikimedia.org/gmodena/platform-data-pipelines/-/tree/add-project-b

TODO: should we move project code under these under `projects/`?


Environments[edit]

Data pipelines are executed on Hadoop. Elastic compute is provided by Spark (jobs are deployed in cluster mode). Scheduling and orchestration is delegated to Apache Airflow.

We don’t have a canonical Airflow production environment yet. Generated Data Platform pipelines are currently targeting an `Airflow Development` instance.

Airflow Development[edit]

DAGs are currently deployed and scheduled on a PET airflow instance at [an-airflow1003.eqiad.wmnet](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#platform_eng). We treat this as our target development and experimentation environment. This host runs both a Scheduler as well as a LocalExecutor instance.

Airflow Production[edit]

There is an initiative to migrate all production `dags`, across tech teams, to a central [wmf monorepo](). This will break the current layout and deployment approach. For more details see: https://docs.google.com/document/d/1hp6JYVy3SLRgTx1BYfnNOCPk5VFJeZ4jMpxD8WJKVB0/edit#heading=h.b8kl66xfhz9v.

Python projects structure[edit]

TODO

[] Automate project skeleton creation with cookiecutter-like templates. [] Add a template for Scala/Java projects.

Python projects are structured as follows:

• `conf` contains job specific config files.
• `spark` contains Spark based data processing tasks.
• `sql` contains SQL/HQL based data processing tasks.
• `test` contains a test suite

For an example of this layout see the `image-matching` project.

Coding standards[edit]

TODO [] Add Tox support


We favour test-driven development with `pytest`, lint with `flake8` and type check with `mypy`. We encourage the use of `isort` and `black` for code formatting code. We log errors and information messages with the Python logging library.

For more details refer to https://phabricator.wikimedia.org/T292220

WIP:

* https://phabricator.wikimedia.org/T293382
* https://phabricator.wikimedia.org/T292741


JVM projects structure[edit]

TBD

Contribution[edit]

For more details on code contribution and project onboarding see …

TODO: we need to write this. See https://phabricator.wikimedia.org/T292738


Deployment[edit]

See https://phabricator.wikimedia.org/T292739

References[edit]