User talk:GModena (WMF)/Pipelines Repo Structure

Deployment process of shared airflow-dags repository

The document mentions that the shared repository's deployment process will be owned by the Data Engineering team, but I think it would be better that teams can deploy by themselves. One idea is that each team has access to deployment.eqiad.wmnet and can run a scap job with their team name as a parameter. Scap would recognize the team name, and deploy airflow-dags to the corresponding instance, with some puppet config that would indicate the corresponding DAG_FOLDER for that team. Note that this way, a given team's airflow instance will only execute the DAGs specified in their corresponding directory. Like this, we wouldn't have any team dependency when it comes to deployment process, no? That said, the Data Engineering team would indeed own the scap deployment code (puppet and scripts), so if a team needed a change in the deployment process itself, then yes, the Data Engineering team would be the one responsible for it. Mforns (WMF) (talk) 14:45, 8 November 2021 (UTC)Reply

Thanks for clarifying. Being able to trigger deployment ourselves would be a requirement, though I still can't picture how that would work on a monorepo with a layout split by team. That's on me for not knowing how scap works :). The last part re puppet is what I mean by "owning" the process. GModena (WMF) (talk)

Development experience and environments

The document mentions that the dev experience is very important, and that you prefer not to break the current setup until there's tooling in place for the shared airflow-dags repo. I agree with that. Otherwise we're blocking teams while potentially making rushed decisions. Regarding environments, the idea that we're trying to implement for the shared airflow-dags repository is the following:

Production environment: Currently the Airflow instances that we have are not yet called production, because we're still missing some pieces (scap deployment, dependency sync script, etc.). But Data Engineering team's idea is to add those missing pieces to the current instances and make them production, not to add more production instances on top of those. Would that be OK?

 * No objection from me about reusing current hosts, provided we can carry on development in one of the envs you describe below. However, I'd like it if we could be a bit specific about SLOs and RACI for prod systems and processes. Our goal is to provide a (semi) self-service platform for onboarding data pipelines, and we'll need to manage expectations (and provide guarantees) to client teams. Does not have to be something fancy, and there are templates we can use. What's your take on this? GModena (WMF) (talk)

Development environment: The shared airflow-dags repository has a script (or set of scripts) maintained by the Data Engineering team that allow to quickly spin up an Airflow instance. This could be used on a stats100* machine to quickly test and troubleshoot DAGs at dev time. For instance: ssh into stats machine, git clone the repo (if it's not already there), get the latest code there (either fetch branch or scp code), execute the script to spin up an airflow instance, create an ssh tunnel, run the job from the UI.

 * This sounds great, and something we could easily integrate into our workflow (we already orchestrate via ssh). I would like it if the dev envs could match (to some degree) prod/staging conventions (e.g. file system hierarchy, binaries, etc). Would this be feasible? GModena (WMF) (talk)

Staging environment (tentative): Each team has a dedicated permanent branch (i.e. platform-staging) that is automatically deployed every say 10 minutes to a secondary Airflow instance running in the same machine as the production instance (or maybe better, an Airflow staging machine combined for all teams - but each team would have a separate instance there). Whenever a new DAG has been tested in development, it can be merged to the staging branch, and the job activated in the corresponding staging instance.

* +1. Do you already plan any sandboxing at YARN/HDFS level? Is it something we'd be responsible for? GModena (WMF) (talk)

CI and code standards: I think it would be cool that we have a single CI and code standard for all code inside the shared airflow-dags repo. We could, for starters, copy all your ideas to the shared repo, no? Let's discuss with other teams, but I like the ideas mentioned in the doc (pytest, flake8, mypy, etc.).

Mforns (WMF) (talk) 15:19, 8 November 2021 (UTC)Reply

* +1 for sharing standards in the airflow-dags repo. Note that we don't make many assumption re dags themselves at this stage, other than having a validation suite. The discussion (and code checks) re linting/type checking/coverage applies to project code, that will be owned/operationalised by the Generated Data Platform team. It's not mean to be prescriptive for anything airflow-dags (though I'd be happy to chip on shared standards). GModena (WMF) (talk)

SQL code

An interesting topic is: where does SQL code go? Historically we Data Engineering have placed the SQL code in the same repo as the DAG code. The reason is probably that it doesn't need to be compiled, and on the other hand it's difficult to be treated as a packaged dependency (or at least, it's easier if we don't). So, I'd say one easy option is to allow for SQL code to live in the shared airflow-dags repo. The downside being it makes the frontier between DAG code and data-processing code fuzzy... But I think we can not forbid non-DAG code to exist in the airflow-DAGs repo, because there will be many small utilities and jobs that will need some python code or small SQL, and it would be a bit discouraging to be forced to package those as a dependency etc, no? Mforns (WMF) (talk) 16:18, 8 November 2021 (UTC)Reply

My current proposal draws a hard line between Airflow DAGs and "project code", and SQL is part of the latter. As a norm, my line of thinking would be to only put DAGs in the shared `airflow-dags` repo. This suits our vendoring model and current practices. This said, I can definitely see SQL / Python code that is used cross-team, and is not bound to a specific Generated Data Platform project, to live in the shared airflow-dags. I'm thinking Operators, data loaders & c. Does it make sense? GModena (WMF) (talk)