Data Processing & Format Conversion
Purpose:
This document describes the data processing operations performed by Riskwolf's automated extraction services to convert raw environmental data into standardized formats for parametric insurance applications.
1. Overview
Data processing occurs within each of the 12 specialized extraction services during the download and conversion phase. These operations transform raw data from authoritative sources into consistent, standardized formats before uploading to cloud storage. The processing bridges the gap between data extraction and data analytics & modelling.
Processing Objectives
- Format Standardization: Convert diverse source formats to consistent output formats
- Coordinate System Alignment: Standardize all spatial data to EPSG:4326 (WGS84)
- Temporal Filtering: Extract data within specified date ranges
- Geographic Filtering: Limit data to relevant geographic regions
- File Validation: Ensure data integrity and completeness before storage
2. Processing Operations by Data Type
2.1 Weather & Climate Data Processing
ECMWF Data (ERA5, IFS)
- Direct Download: NetCDF files retrieved via CDS API with authentication
- Date Range Filtering: Downloads limited to specified temporal windows (minimum 2-day ranges for ERA5)
- Variable Selection: Extract specific parameters (precipitation, temperature, wind components)
- Geographic Subsetting: Focus on target regions (Indonesia, Philippines, New Zealand, USA)
- Format Preservation: Maintain original NetCDF4 format with metadata
CHIRPS Precipitation Data
- FTP Download: Monthly NetCDF files from Climate Hazards Center servers
- Temporal Aggregation: Daily data organized by year-month structure
- Coverage Validation: Verify quasi-global coverage (50°S to 50°N)
- Format Standardization: Preserve NetCDF4 format for climate applications
IMD Regional Data
- API Integration: Direct connection to India Meteorological Department services
- Regional Filtering: Focus on Indian subcontinent geographic bounds
- Daily Processing: Process daily rainfall and temperature observations
- Coordinate Validation: Ensure proper geographic referencing for regional data
2.2 Natural Disaster Data Processing
Earthquake Data (USGS)
- Real-time API Calls: GeoJSON feeds from USGS Earthquake Hazards Program
- Magnitude Filtering: Focus on 2.5+ magnitude events for relevant impact assessment
- Temporal Windows: Hourly/daily updates based on data availability
- Format Conversion: Convert from GeoJSON to GeoParquet for efficient spatial queries
- PGA Data Integration: Link earthquake events with Peak Ground Acceleration maps
Wildfire Data (NIFC)
- ArcGIS REST API: Live feeds from National Interagency Fire Center
- Date Filtering: Extract events within specified time ranges using pandas datetime operations
- Geometry Processing: Convert from ArcGIS JSON to standard GeoDataFrame format
- Spatial Indexing: Optimize data for geographic queries and analysis
Tropical Storm Data (NOAA IBTrACS)
- Batch Processing: Download cyclone track data in monthly archives
- Track Reconstruction: Process individual storm paths with wind speed and pressure data
- Event-based Organization: Structure data by individual storm events and tracks
2.3 Geographic Reference Data Processing
Administrative Boundaries (GADM/GeoBoundaries)
- Shapefile Processing: Download and convert administrative boundary shapefiles
- Level Processing: Handle multiple administrative levels (0-5) separately
- Format Conversion: Convert to GeoJSON for web-compatible applications
- Country Organization: Structure data by ISO country codes for efficient access
3. Format Conversion & Standardization
3.1 Coordinate System Processing
Standard Projection
- Target CRS: All spatial data standardized to EPSG:4326 (WGS84)
- CRS Transformation: Automatic conversion from source coordinate systems
- Longitude Range: Standardized to -180° to +180° decimal degrees
- Latitude Range: Standardized to -90° to +90° decimal degrees
Spatial Data Handling
- Point Data: Earthquake epicenters, wildfire incidents processed as geographic points
- Polygon Data: Administrative boundaries and storm tracks maintained as vector geometries
- Grid Data: Weather and climate data preserved in original grid structure with proper CRS metadata
3.2 Temporal Processing
Date Range Filtering
- Environment Configuration: Date ranges specified via RW_DATE_FROM and RW_DATE_TO parameters
- Pandas DateTime: Temporal filtering using pandas datetime operations for precise date matching
- UTC Standardization: All timestamps converted to UTC for consistency
- Date Splitting: Large date ranges split into manageable chunks for processing efficiency
Time Series Handling
- Preservation of Original Frequency: Maintain data at source temporal resolution (hourly, daily, event-based)
- No Temporal Aggregation: Raw temporal resolution preserved for downstream analytics
- Metadata Retention: Original time zone and temporal metadata preserved in output files
3.3 Geographic Boundary Processing
Regional Filtering
- Country-based Filtering: Focus on target regions (Indonesia, Philippines, New Zealand, USA)
- Bounding Box Application: Geographic subsetting using coordinate bounds
- Spatial Masking: Remove data outside areas of parametric insurance interest
- Coverage Validation: Ensure complete coverage for target geographic regions
4. Data Validation & Quality Checks
4.1 File Integrity Validation
Download Verification
- HTTP Status Checking: Verify successful API responses (200 status codes)
- Content Length Validation: Confirm expected file sizes match downloaded content
- Format Validation: Ensure downloaded files match expected formats (GeoJSON, NetCDF, Shapefile)
- Corruption Detection: Basic file structure validation before processing
Data Completeness Checks
- Feature Count Validation: Verify expected number of records/features in downloaded data
- Date Range Verification: Confirm data covers requested temporal periods
- Geographic Coverage: Validate spatial extent matches expected boundaries
- Variable Presence: Ensure required data variables are present in downloaded files
4.2 Processing Validation
Conversion Success Tracking
- Format Conversion Monitoring: Track successful transformation from source to target formats
- Geometry Validation: Verify spatial data maintains valid geometries after processing
- Coordinate System Verification: Confirm proper CRS transformation to EPSG:4326
- Error Logging: Capture and log processing failures for investigation
Output Quality Assurance
- File Size Validation: Ensure processed files are within expected size ranges
- Schema Compliance: Verify output files conform to expected data schemas
- Metadata Preservation: Confirm essential metadata is retained through processing
- Upload Verification: Validate successful upload to AWS S3 cloud storage
5. Cloud Storage Integration
5.1 Automated Upload Process
AWS S3 Organization
- Standardized Directory Structure: Organized by provider/dataset/variable/date hierarchy
- File Naming Conventions: Consistent naming patterns for automated discovery
- Metadata Files: Companion metadata files for data lineage and provenance
- Compression Optimization: Efficient storage using appropriate compression methods
Upload Validation
- Transfer Verification: Confirm successful upload to cloud storage
- Checksum Validation: Verify file integrity during transfer process
- Access Permissions: Ensure proper security settings for data access
- Cleanup Operations: Remove temporary local files after successful upload
6. Output Format Specifications
6.1 Standardized Output Formats
GeoParquet
- Point Event Data: Earthquake epicenters, wildfire incidents stored as efficient spatial format
- Spatial Indexing: Optimized for geographic queries and spatial analysis
- Metadata Retention: Event attributes, magnitudes, timestamps preserved
- Compression: Efficient storage with fast query performance
NetCDF4
- Climate Data Standard: Weather and precipitation datasets in CF-compliant format
- Temporal Series: Gridded time series data with proper temporal dimensions
- Metadata Preservation: Units, coordinate systems, and data provenance retained
- Compression: Chunking and compression for efficient access
GeoJSON
- Administrative Boundaries: Vector boundary data in web-compatible format
- Storm Tracks: Tropical cyclone paths and intensity data
- Feature Properties: Rich attribute data for each geographic feature
- Standards Compliance: RFC 7946 compliant for web mapping applications
6.2 File Organization Standards
Directory Structure
- Provider-based Organization: Data organized by source agency (USGS, ECMWF, NOAA)
- Dataset Separation: Different data types stored in separate directory trees
- Date-based Partitioning: Temporal organization for efficient data discovery
- Variable Segregation: Different measured variables stored separately
File Naming Conventions
- Timestamp Integration: File names include data date ranges and processing timestamps
- Provider Identification: Clear indication of data source and dataset type
- Version Control: As-at timestamps for data lineage and version tracking
- Format Identification: File extensions clearly indicate format type
7. Integration with Analytics Pipeline
The processed data from extraction services provides standardized inputs for Riskwolf's Data Analytics & Modelling framework:
Ready-to-Use Data Products
- Spatially Consistent: All data in EPSG:4326 coordinate system for seamless integration
- Temporally Aligned: UTC timestamps enable cross-dataset temporal analysis
- Format Standardized: Consistent output formats reduce downstream processing complexity
- Quality Validated: Basic integrity checks ensure reliable data for modeling
Preserved Data Characteristics
- Original Resolution: Source temporal and spatial resolution maintained for analytical flexibility
- Complete Metadata: Data lineage and source information preserved for audit trails
- Error Tracking: Processing logs available for quality assessment and troubleshooting
For detailed information on data extraction processes and source specifications, refer to our Data Extraction System documentation.