g41·concept

Apache Airflow (data pipeline orchestration)

Apache Airflow (data pipeline orchestration)

A workflow scheduler that runs multi-step data pipelines on a schedule and in the right order — each pipeline is a DAG (directed acyclic graph) of tasks: fetch, cloud-optimize, validate, publish, register.

Why it matters

Getting raw data into a cloud platform isn’t one step — it’s a chain (download → convert to COG/Zarr → write STAC metadata → load into the catalog). Airflow runs that chain reliably, retries failures, and re-runs when new data arrives, so the catalog and stores stay current without anyone babysitting them.

Where you’ll meet it

  • VEDA’s veda-data-airflow orchestrates the ingestion that feeds the PgSTAC catalog and the ARCO data store.
  • The “Tools for cloud-optimizing & publishing data” box in VEDA’s architecture is Airflow-driven.
  • Most cloud-EO platforms run Airflow (or a similar orchestrator like Prefect/Argo) behind the scenes for ingestion.
  • It’s the piece that turns a provider’s new granule into a searchable, renderable item automatically.

In plain terms

The receiving-and-shelving crew of the library, on a timetable: every shipment gets unpacked, labelled, catalogued, and put on the right shelf — in order, every time, on its own.