g41·concept
Apache Airflow (data pipeline orchestration)
Apache Airflow (data pipeline orchestration)
A workflow scheduler that runs multi-step data pipelines on a schedule and in the right order — each pipeline is a DAG (directed acyclic graph) of tasks: fetch, cloud-optimize, validate, publish, register.
Why it matters
Getting raw data into a cloud platform isn’t one step — it’s a chain (download → convert to COG/Zarr → write STAC metadata → load into the catalog). Airflow runs that chain reliably, retries failures, and re-runs when new data arrives, so the catalog and stores stay current without anyone babysitting them.
Where you’ll meet it
- VEDA’s
veda-data-airfloworchestrates the ingestion that feeds the PgSTAC catalog and the ARCO data store. - The “Tools for cloud-optimizing & publishing data” box in VEDA’s architecture is Airflow-driven.
- Most cloud-EO platforms run Airflow (or a similar orchestrator like Prefect/Argo) behind the scenes for ingestion.
- It’s the piece that turns a provider’s new granule into a searchable, renderable item automatically.
In plain terms
The receiving-and-shelving crew of the library, on a timetable: every shipment gets unpacked, labelled, catalogued, and put on the right shelf — in order, every time, on its own.