the Ability To Duplicate
The problem
Nothing lasts forever. In the digital age, nothing lasts.
Companies come and go, platforms and services are stood up and then abandoned, and new government administrations limit access to public data. When data become available, it's only natural to want to get it and save it, just in case it goes away.
Data duplication isn't just for hoarders. Commercial companies copy data across continents to enable faster access for nearby customers. Small, focused stores of data might be useful when you don't want to rely on large, potentially flakey government services for your core business operations (here's to you, NASA DAACs).
Tooling
There's a couple tools that help with geospatial data duplication:
- STAC: The SpatioTemporal Asset Catalog (STAC) specification is designed for search and discovery. Even if a data provider doesn't have a STAC catalog for their data, you can make one yourself to help learn what's there.
- obstore: Whether it's one of the big three cloud providers in the US, or smaller providers in Europe, obstore is a new Python library that is designed to get and put data from all of them. obstore is built with Rust for high performance.
- stac-geoparquet: One of the core tenants of cloud-native geospatial is that data at rest should still be useful. stac-geoparquet is a small specification for putting STAC into geoparquet files. This enables both compressed storage and queryability via DuckDB.
Putting it together
I've made a proof-of-concept script at developmentseed/atd to demonstrate how you might bring these concepts together to duplicate data quickly and efficiently. It's annotated source code is available at http://developmentseed.org/atd/ if you want to check it out. The script isn't intended for production usage, but more to demonstrate how you might make a duplication pipeline using these modern tools.
Motivation
This work was motivated by an excellent conversation with our partners during Development Seed Team Week 2025 in Washington, DC. In response to the question "How do we get away from making unnecessary copies of data?", one partner replied (paraphrasing):
While we don't always want to duplicate data, we always want to maintain the ability to duplicate.
That phrase stuck with me, and here we are.