Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,12 @@ website:
- section: PMTiles
contents:
- pmtiles/intro.qmd
- section: Data Producer Guide
contents:
- data-producer-guide/index.qmd
- data-producer-guide/data-preparation.qmd
- data-producer-guide/providing-data.qmd
- data-producer-guide/inclusion-workflow.qmd
- href: glossary.qmd
text: Glossary
- href: cookbooks/index.qmd
Expand Down
274 changes: 274 additions & 0 deletions data-producer-guide/data-preparation.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
---
title: "Data Preparation"
---

For best data use within any VEDA instance, the format and structure of data can either hinder visualization or ensure success and user satisfaction. This section lays out the optimal formats, file sizes, and data structure for the best performance.

## Suggested File Formats

### Raster Data

**1. Cloud Optimized GeoTIFF (COG)**

- ***Best for:***
- 2D gridded data (e.g., NO₂, AOD, precipitation snapshots)
- ***Why use it:***
- Enables HTTP range requests for fast map visualization
- Optimized for web-based tile access
- ***Recommendations::***
- Use internal tiling
- Generate overviews for multi-resolution access

Additional [COG details](/cloud-optimized-geotiffs/cogs-details.qmd) that can be helpful.

**2. Zarr (Preferred for Multidimensional Data)**

- ***Best for:***
- Time-series and multi-variable datasets (e.g., GEOS-CF, IMERG)
- Large-scale analytics and cloud-native workflows

- ***Why use it:***
- Designed for cloud object storage with parallel, chunk-based access
- Enables efficient subsetting across time and space
- Works seamlessly with tools like `xarray`, `dask`, and Python analytics ecosystems

- ***Recommendations:***
- Align chunking with access patterns (e.g., time vs spatial queries)
- Store data in cloud object storage (e.g., S3) with clear directory structure
- Consolidate metadata for faster access

Additional [Zarr details](/zarr/intro.qmd) that can be helpful.


**3. NetCDF4 (CF-compliant)**

- ***Best for:***
- Widely distributed scientific datasets and legacy data formats
- Interoperability with existing Earth science tools and workflows

- ***Why use it:***
- Mature, widely supported format across the scientific community
- Supports chunking and compression (via HDF5)
- Compatible with most analysis tools and libraries

- ***Recommendations:***
- Ensure chunking is properly configured (avoid default chunking)
- Use compression (e.g., gzip with shuffle filter)
- Consider complementing with a **Virtual Zarr layer (e.g., Kerchunk)** for improved cloud performance and scalability

Additional [NetCDF details](/kerchunk/intro.qmd) that can be helpful.
### **Vector Data**

**1. GeoJSON (Preferred)**

- ***Best for:***
- Small to medium-sized vector datasets
- Web-based visualization and APIs

- ***Why use it:***
- Widely supported across web mapping tools
- Human-readable and easy to debug
- Native support in most visualization libraries

- ***Recommendations:***
- Avoid for large datasets due to performance limitations


**2. Shapefiles**

- ***Best for:***
- Legacy GIS workflows
- Interoperability with traditional GIS tools

- ***Why use it:***
- Broad compatibility across GIS software
- Common exchange format in many datasets

- ***Recommendations:***
- Avoid for cloud-native workflows
- Consider converting to GeoJSON or GeoParquet for better performance


**3. GeoParquet (Emerging / Future Recommended)**

- ***Best for:***
- Large-scale vector datasets
- Analytical and cloud-native workflows

- ***Why use it:***
- Columnar format optimized for performance
- Efficient storage and query capabilities
- Well-suited for big data processing

- ***Recommendations:***
- Use for large datasets where performance is critical
- Partition data appropriately for scalable access


---

### **Tabular Data**

**1. CSV**

- ***Best for:***
- Simple, small tabular datasets
- Data exchange and quick inspection

- ***Why use it:***
- Universally supported
- Easy to read and use

- ***Recommendations:***
- Avoid for large datasets due to size and performance limitations


**2. JSON (Structured as Tabular)**

- ***Best for:***
- Lightweight structured data
- API responses and metadata

- ***Why use it:***
- Flexible schema
- Easy integration with web applications

- ***Recommendations:***
- Ensure data is structured as records (row-based)
- Avoid deeply nested structures for tabular use cases
- Not recommended for large datasets


**3. Parquet (Preferred for Large Tabular Data)**

- ***Best for:***
- Large-scale tabular datasets
- Analytical and cloud-native workflows

- ***Why use it:***
- Columnar storage enables efficient queries
- Compressed and optimized for performance
- Works well with analytics tools (e.g., Spark, Pandas)

- ***Recommendations:***
- Use for large datasets instead of CSV or JSON
- Partition data for scalable access in cloud environments

## **Optimal File Sizes**

### **Cloud Optimized GeoTIFF (COG)**

- ***Recommended:***
- **10 MB – 1 GB** per file

- ***Avoid:***
- Very small files (<10 MB) — high request overhead
- Very large files (>2–5 GB) — reduced interactivity and slower partial reads


### **Zarr Stores**

- ***Recommended:***
- Total dataset size can scale to TBs or more
- Individual chunk sizes: **~1–10 MB**

- ***Notes:***
- Performance depends on chunking strategy rather than total size
- Designed for scalable, cloud-native access


### **NetCDF4 (HDF5-based)**

- ***Recommended:***
- File size should align with dataset scope and access patterns
- Moderate file sizes (e.g., ~x MB – x GB) are often practical

- ***Notes:***
- Performance is driven primarily by chunking and metadata layout
- Large files are acceptable if properly chunked and cloud-optimized (e.g., via Kerchunk)


---

## **Optimal Chunks for Visualization**

### **Chunking Guidelines**

- ***Chunk Size Target:***
- **~1–10 MB per chunk**

- ***Avoid:***
- Very small chunks (<256 KB) — request overhead dominates
- Very large chunks (>50–100 MB) — slow reads and poor interactivity


### **Key Considerations**

- Chunking should reflect **expected access patterns**:
- Spatial access (map visualization)
- Temporal access (time-series analysis)

- Balanced chunking across dimensions is recommended for mixed-use workloads (e.g., dashboards)

- Many legacy datasets use:
- Very small chunks
- Inefficient file layouts

→ This leads to **severely degraded cloud performance**

- Cloud-native best practices emphasize:
- Minimizing the number of network requests
- While still enabling efficient subsetting of data

### **Balanced Strategy**

The chunking strategy should ultimately be determined by the dataset’s expected use cases. Below is a balanced approach that supports both map visualization and time-series access:

| Dimension | Recommended |
|-----------|-------------|
| Time | 1–24 timesteps per chunk |
| Spatial | 256 × 256 or 512 × 512 pixels |
| Chunk size | ~1–10 MB per chunk |

**Why this matters:**

- Enables efficient subsetting across both time and space
- Reduces unnecessary data reads (minimizes I/O overhead)
- Improves responsiveness in interactive dashboards (e.g., VEDA)


---

## **Compression**

### **Best Practices**

- Use **DEFLATE (gzip)** compression
- Use moderate compression levels (e.g., **~x–y**) to balance size and performance
- Apply a **shuffle filter** to improve compression efficiency


---

### **HDF/NetCDF-Specific Considerations**

When preparing HDF/NetCDF data for cloud environments:

- Use formats that support **chunking, compression, and consolidated metadata**
*(HDF5 / NetCDF4; not supported in HDF4 or NetCDF3)*

- Consolidate metadata to enable efficient access in a **single request**

- Tune chunk sizes appropriately:
- Preferred: **~1–10 MB per chunk**
- Acceptable range: **100 KB – 16 MB**

- Design chunk shapes based on expected access patterns:
- Spatial visualization (map-based access)
- Time-series analysis

- Apply **gzip (deflate) compression with shuffle filter**

- Include in data product documentation:
- Instructions for **direct cloud access**
- Guidance on using client libraries (e.g., `xarray`, `kerchunk`) for efficient data access
64 changes: 64 additions & 0 deletions data-producer-guide/inclusion-workflow.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: "Data Inclusion Workflow"
---

## **What the Data Producer Should Do**

1. **Prepare Data**
- Convert data to a cloud-optimized format (**COG**, **Zarr**, or **CF-compliant NetCDF4**)
- Apply appropriate **chunking and compression** (especially for NetCDF4)

2. **Validate Data**
- Verify coordinate reference systems (CRS)
- Ensure metadata is complete, consistent, and CF-compliant (where applicable)

3. **Upload to Cloud Storage**
- Upload data to:
- The S3 bucket/prefix provided by the VEDA team, **or**
- A publicly accessible S3 bucket managed by the data provider
- Ensure correct access permissions (e.g., public-read if applicable)

4. **Provide Metadata**
- Dataset description and purpose
- Variables and units
- Temporal and spatial coverage
- Preferred colormaps and rescaling parameters (for visualization)
- Citation information (DOI, authors, version, etc.)


---

## **What the ODSI/DSE Team Does**

1. **Data Review**
- Validate format, accessibility, and performance
- Ensure compatibility with VEDA infrastructure

2. **Ingestion**
- Ingest the dataset in the **STAC catalog**
- Create a **Virtual Zarr store** (e.g., via Kerchunk) if needed

3. **Optimization** *(if needed)*
- Rechunking
- Format conversion (e.g., NetCDF → Zarr)

4. **Integration**
- Integrate dataset into the AIR4US platform
- Configure visualization layers and access endpoints

5. **QA/QC**
- Verify rendering and map performance
- Validate query and analytics workflows


---

## **Timeline for Data Inclusion**

- Timelines vary based on:
- Dataset size
- Format readiness
- Required optimization steps
- Team capacity

- For current estimates, please contact the ODSI/DSE team.
29 changes: 29 additions & 0 deletions data-producer-guide/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "Data Producer Guide"
---

This guide provides recommendations for preparing and submitting data for inclusion in VEDA instances. Follow the sections below to get your dataset ready for ingestion.

## **VEDA Instances**

Data prepared using this guide can be integrated into the following VEDA-powered platforms:

- [U.S. Greenhouse Gas Center](https://earth.gov/ghgcenter)
- [MAAP (Multi-Mission Algorithm and Analysis Platform)](https://maap-project.org/)
- [Disasters Learning Portal](https://disasters.nasa.gov/)
- [Earth Information Center](https://earth.gov/)
- [Water Insight](https://earth.gov/water)
- [Air Quality (AIR4US)](https://www.earth.gov/air4us/)

---

## **Contents**

### [Data Preparation](data-preparation.qmd)
Guidance on recommended file formats (COG, Zarr, NetCDF, GeoParquet), optimal file sizes, chunking strategies, and compression best practices.

### [Providing Data](providing-data.qmd)
Information on where to host your data, storage location guidelines, open data requirements, and how to include proper data citation.

### [Data Inclusion Workflow](inclusion-workflow.qmd)
Step-by-step workflow for both data producers and the ODSI/DSE team—from data preparation and upload through ingestion, optimization, and QA/QC.
Loading
Loading