Cross-DAAC composition — joining HLS + GEDI + ICESat-2 + SMAP in one workflow
Joining HLS (LP) + GEDI (ORNL) + ICESat-2 (NSIDC) + SMAP (NSIDC) for a single science question means juggling 4 access libraries, 3 auth flows, and inconsistent formats. The pattern below dissolves that friction.
Cross-DAAC composition
“Joining HLS (LP) + GEDI (ORNL) + ICESat-2 (NSIDC) + SMAP (NSIDC) for a single science question requires juggling 4 access libraries, 3 auth flows, and inconsistent formats.” — Research Agent D, identifying this as Pattern B (one of three recurring friction patterns across the NASA Earth-data ecosystem)
The pattern below dissolves that friction.
The pattern in 5 steps
- Authenticate once. Earthdata Login covers all 12 DAACs. Use
earthaccess.login(strategy="netrc")and never log in again per-DAAC. (If you see a tutorial that has you log in 4 times, it’s out of date.) - Search once, federated.
earthaccess.search_data(short_name=..., bounding_box=..., temporal=...)queries CMR which federates across DAACs. You don’t pick a DAAC; you pick a dataset. - Open via the format-appropriate library, not the DAAC-appropriate one. HLS = COG →
rioxarray.open_rasterio. GEDI = HDF5 →h5pyorearthaccess.open+ custom reader. ICESat-2 = HDF5 →h5py+icepyxhelpers. SMAP = HDF5 →xarraywithh5netcdfengine. The format dictates the reader, not the DAAC. - Align in a single xarray Dataset or DataFrame. Resample temporally (HLS is observation-time, GEDI is footprint, SMAP is daily, ICESat-2 is orbit-time). Resample spatially (HLS 30m, GEDI footprint, SMAP 9km, ICESat-2 along-track). Pick the coarsest common spatial grid for analysis (usually SMAP at 9km) and aggregate the finer products up; OR pick a sparse-vector representation (one row per GEDI footprint, with HLS / SMAP / ICESat-2 values sampled at the footprint).
- Cache aggressively. Cloud-direct from
s3://is fast in us-west-2 (where most NASA-EO cloud data lives). Out of region, download once locally and re-read. Don’t pay egress on every script run.
Minimal worked example
import earthaccess
import xarray as xr
import h5py
import pandas as pd
earthaccess.login(strategy="netrc")
aoi = (-105, 38, -102, 41) # Front Range CO
window = ("2022-06-01", "2022-08-31")
# --- HLS L30 (LP DAAC) ---
hls = earthaccess.search_data(short_name="HLSL30", bounding_box=aoi, temporal=window, cloud_cover=20)
hls_da = xr.open_mfdataset([earthaccess.open([g])[0] for g in hls[:5]], engine="rasterio")
# compute NDVI from B4 (Red) and B5 (NIR), per-tile per-date
# --- GEDI L4A (ORNL DAAC) ---
gedi = earthaccess.search_data(short_name="GEDI_L4A_AGB_Density_V2_1_2056", bounding_box=aoi, temporal=window)
gedi_records = []
for fh in earthaccess.open(gedi[:5]):
with h5py.File(fh, "r") as f:
for beam in [k for k in f.keys() if k.startswith("BEAM")]:
lats = f[f"{beam}/lat_lowestmode"][:]
lons = f[f"{beam}/lon_lowestmode"][:]
agbd = f[f"{beam}/agbd"][:]
gedi_records.append(pd.DataFrame({"lat": lats, "lon": lons, "agbd": agbd, "beam": beam}))
gedi_df = pd.concat(gedi_records)
gedi_df = gedi_df[(gedi_df.lat.between(aoi[1], aoi[3])) & (gedi_df.lon.between(aoi[0], aoi[2]))]
# --- SMAP L3 (NSIDC DAAC) ---
smap = earthaccess.search_data(short_name="SPL3SMP", bounding_box=aoi, temporal=window)
smap_ds = xr.open_mfdataset([earthaccess.open([g])[0] for g in smap[:10]], engine="h5netcdf")
# extract `soil_moisture` per 9km pixel, daily
# --- Sample SMAP at GEDI footprint locations + dates ---
# (left as exercise; pseudo-code below)
# for each gedi row, find nearest SMAP pixel + nearest SMAP date → join
# Now you have one DataFrame: lat · lon · date · agbd (GEDI) · ndvi (HLS) · sm (SMAP)
# Cross-dataset analysis ready.
Common gotchas
earthaccess.openreturns file-like handles, not paths. Some readers (rioxarray,h5py) accept them directly; others need a download first.- us-west-2 S3 credentials expire in 1 hour and are only valid in us-west-2. If your script takes >1 hr, refresh with
earthaccess.get_s3_credentials(...). If you’re outside us-west-2, you’re using HTTPS, not S3 — fine for small jobs but slow + egress-billed for large. - GEDI footprint coords are not on a grid. They’re discrete along-orbit samples. Don’t try to
xarray-stack them; treat as a sparse vector layer. - CMR pagination caps at 2000 per page but cmr-stac caps at 100. If you use cmr-stac for search, expect ~28× slower than python-cmr / earthaccess for the same query (per issue #411).
- Temporal alignment is brutal. HLS revisit is 2–3 days. GEDI footprints don’t revisit at all. SMAP is daily. ICESat-2 revisit is 91 days. Pick the question’s time resolution and aggregate accordingly.
When this pattern fails (and what to do)
- If you need sub-daily. GEO satellites (GOES, TEMPO, Himawari) live in the same CMR index but are hosted differently. The pattern adapts but the time resolution shifts.
- If you need pre-2000. Some legacy MODIS, AIRS, MERRA-2 archives are still in non-cloud-optimized formats. The
earthaccess.openpath may not work; you’ll needearthaccess.downloadfirst. - If your AOI is huge. Cloud-direct stops being faster than mass-download around ~10° square per query. Switch to Harmony async + S3 download.
Power-user variant (Claude Code + MCP)
For repeated cross-DAAC composition with an agent loop, see recipes/r0X-mcp-power-user.mdx (TBD) — the same dataset access exposed as MCP tools so claude or Cursor can run these joins for you. (This is the subordinated MCP-server work from paths/run-2026-05-25-001.)
The steps, code, and sources below are kept in the original English for technical accuracy.