Onboarding Custom Datasets to Riskwolf Platform
Document Version: 1.0
Date: 2026-01-06
Purpose: This guide provides step-by-step instructions for onboarding custom point-wise or polygon-wise datasets as raster data to the Riskwolf platform.
Overview
The Riskwolf platform requires all custom datasets to be:
- Converted from vector (point or polygon) data to raster format
- Rasterized to a standard 0.1° × 0.1° grid resolution
- Exported as NetCDF4 (
.nc4) files, organized by year - Accompanied by standardized metadata in JSON format
Prerequisites
Required Python Packages
import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from pyproj import CRS
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image
Install dependencies:
pip install numpy pandas xarray geopandas netcdf4 geocube shapely
Step 1: File Structure
1.1 Required Format
- Format: NetCDF4 (
.nc4extension) - Compression: zlib level 1
1.2 Dataset Structure
<xarray.Dataset>
Dimensions: (time: N, lat: M, lon: L)
Coordinates:
* time (time) datetime64[ns]
* lat (lat) float64
* lon (lon) float64
Data variables:
value (time, lat, lon) float64
Step 2: Prepare Input Data
Input Data Format
Your input data must be a pandas DataFrame with the following required columns:
| Column | Type | Description | Example |
|---|---|---|---|
time |
datetime | Timestamp for each observation | 2005-01-01 |
year |
int | Year of the observation | 2005 |
lat |
float | Latitude in decimal degrees (WGS84) | 7.507 |
lon |
float | Longitude in decimal degrees (WGS84) | 93.597 |
value |
float | The measured/observed value | 27.31 |
Example Input Data
>>> data
time year lat lon value
0 2005-01-01 2005 7.507 93.597 27.31
1 2005-01-02 2005 7.507 93.597 27.01
2 2005-01-03 2005 7.507 93.597 27.29
3 2005-01-04 2005 7.507 93.597 27.22
4 2005-01-05 2005 7.507 93.597 27.41
... ... ... ... ... ...
22822 2025-10-27 2025 11.508 92.644 27.35
22823 2025-10-28 2025 11.508 92.644 28.02
22824 2025-10-29 2025 11.508 92.644 27.72
22825 2025-10-30 2025 11.508 92.644 27.56
22826 2025-10-31 2025 11.508 92.644 27.30
[22827 rows x 5 columns]
Notes:
- Coordinates must be in WGS84 (EPSG:4326)
- The
timecolumn should be datetime objects or pandas Timestamps - Missing values are not permitted and must be handled before proceeding (remove or impute as needed; NaN values in the final input are not allowed)
Step 3: Prepare Metadata
Create a JSON metadata file that describes your dataset. This metadata is required for platform integration.
Metadata Schema
{
"country": "ind",
"provider": "custom-provider",
"dataset": "archive",
"resolution": "0.1x0.1",
"tz": "UTC",
"variable": "temp-min-24hr",
"created_at": "2026-01-16T22:11:59Z",
"modified_at": "2026-01-16T22:11:59Z"
}
Metadata Field Descriptions
| Field | Required | Description |
|---|---|---|
country |
Yes | ISO3 country code (e.g., "ind", "phl", "idn") |
provider |
Yes | Provider identifier - must be requested from admin |
dataset |
Yes | Dataset identifier - must be requested from admin |
resolution |
Yes | Always "0.1x0.1" (standard grid resolution) |
tz |
Yes | Timezone of the source data (IANA timezone name or "UTC") |
variable |
Yes | Variable name - must be requested from admin |
created_at |
Yes | ISO 8601 timestamp in UTC |
modified_at |
Yes | ISO 8601 timestamp in UTC |
Important: Before onboarding, contact the admin to obtain approved values for provider, dataset, and variable fields.
Step 4: Determine Bounding Box
You need to specify the geographic bounding box for rasterization. Use the bounding box corresponding to your country from the reference list below.
Supported Country Bounding Boxes
| ID | Country | xmin | xmax | ymin | ymax |
|---|---|---|---|---|---|
| phl | Philippines | 116 | 127 | 4 | 21.5 |
| ind | India | 68 | 97.5 | 6.5 | 36 |
| idn | Indonesia | 94 | 142 | -12 | 7 |
| mys | Malaysia | 99 | 121 | 0 | 8 |
| tha | Thailand | 95 | 107 | 5 | 21 |
| col | Colombia | -82 | -66 | -5 | 17 |
| nga | Nigeria | 2 | 15 | 4 | 14 |
| bra | Brazil | -75 | -28.5 | -34 | 6 |
| bgd | Bangladesh | 87 | 93 | 20 | 27 |
| alb | Albania | 19 | 21.5 | 39.5 | 43 |
| aut | Austria | 9 | 17.5 | 46 | 49.5 |
| che | Switzerland | 5.5 | 11 | 45.5 | 48 |
| khm | Cambodia | 102 | 108 | 9.8 | 15 |
| arg | Argentina | -74 | -53 | -56 | -21 |
| bol | Bolivia | -70 | -57 | -23 | -9 |
| civ | Ivory Coast | -9 | -2 | 4 | 11 |
| gtm | Guatemala | -93 | -88 | 13.5 | 18 |
| hnd | Honduras | -90 | -83 | 12.5 | 17 |
| ken | Kenya | 33 | 42 | -5 | 6 |
| kgz | Kyrgyzstan | 69 | 81 | 39 | 44 |
| mex | Mexico | -119 | -86 | 14 | 33 |
| moz | Mozambique | 30 | 41 | -27 | -10 |
| mwi | Malawi | 32 | 36 | -18 | -9 |
| per | Peru | -82 | -68 | -19 | 1 |
| rwa | Rwanda | 28.5 | 31 | -3 | -1 |
| tjk | Tajikistan | 67 | 76 | 36 | 41.5 |
| zwe | Zimbabwe | 25 | 34 | -23 | -15 |
Note: If your country is not in this list, contact the admin to request a bounding box definition.
Step 5: Standard Variable Names
| Variable ID | Variable Name |
|---|---|
temp-mean-24hr |
Temperature mean daily |
temp-max-24hr |
Temperature maximum daily |
temp-min-24hr |
Temperature minimum daily |
relhum-mean-24hr |
Relative humidity mean daily |
precip-sum-24hr |
Precipitation sum daily |
wind-mean-24hr |
Wind speed mean daily |
wind-max-24hr |
Wind speed maximum daily |
wind-min-24hr |
Wind speed minimum daily |
Note: For any new variables please contact Riskwolf support.
Step 6: Rasterization Process
Rasterization Function
The following function converts vector (point or polygon) data into a raster grid:
import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
from pyproj import CRS
import geopandas as gpd
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image
def rasterize(
vector_data: gpd.GeoDataFrame,
bbox_dict: dict[str, float],
min_date: dt.date,
max_date: dt.date,
metadata: dict[str, str]
) -> xr.Dataset:
"""
Rasterize vector data to a 0.1° × 0.1° grid.
Parameters:
-----------
vector_data : gpd.GeoDataFrame
GeoDataFrame containing point or polygon geometries with 'value' column
bbox_dict : dict[str, float]
Bounding box dictionary with keys: 'xmin', 'xmax', 'ymin', 'ymax'
min_date : dt.date
Minimum date for temporal filtering
max_date : dt.date
Maximum date for temporal filtering
metadata : dict[str, str]
Metadata dictionary containing dataset information (country, provider, dataset, variable, etc.)
Returns:
--------
xr.Dataset
Rasterized dataset with dimensions (time, lat, lon)
"""
def rasterize_function(**kwargs):
return rasterize_image(all_touched=True, **kwargs)
# create bounding box geometry
bounding_box = box(
minx=bbox_dict['xmin'],
miny=bbox_dict['ymin'],
maxx=bbox_dict['xmax'],
maxy=bbox_dict['ymax']
)
# create geocube (rasterize)
resolution = 0.1
time = pd.date_range(start=min_date, end=max_date, freq='D')
lat = np.arange(bbox_dict['ymin'], bbox_dict['ymax'] + resolution, resolution)
lon = np.arange(bbox_dict['xmin'], bbox_dict['xmax'] + resolution, resolution)
template = xr.Dataset(coords={"time": time, "y": lat, "x": lon})
template = template.assign(value=np.nan)
template.attrs["crs"] = CRS.from_epsg(4326).to_wkt()
grid = make_geocube(
vector_data=vector_data,
measurements=["value"],
rasterize_function=rasterize_function,
datetime_measurements=["time"],
geom=bounding_box,
fill=np.nan,
like=template,
group_by="time",
)
# rename coordinates and clean up
grid = grid \
.rename({"x": "lon", "y": "lat"}) \
.reset_coords(["spatial_ref"], drop=True)
# reindex to ensure complete grid coverage
grid = grid.reindex(time=time, lon=lon, lat=lat, method="nearest")
# sort by time
grid = grid.sortby("time")
grid.attrs = metadata
grid.attrs["crs"] = template.attrs["crs"]
return grid
Key Rasterization Parameters
- Resolution: Fixed at
(-0.1, 0.1)degrees (0.1° × 0.1° grid) - CRS: WGS84 (EPSG:4326)
- Rasterization method:
all_touched=True(includes all cells touched by geometries) - Fill value:
np.nanfor missing data
Step 7: Complete Processing Pipeline
Full Processing Script
Here's the complete script to process your data from input to output:
import os
import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image
# load your data (adjust this based on your data source)
# data = pd.read_csv("your_data.csv")
# data['time'] = pd.to_datetime(data['time'])
# define your bounding box (use the appropriate country bbox)
bbox_dict = {
"xmin": 68, # Example for India
"xmax": 97.5,
"ymin": 6.5,
"ymax": 36
}
# define output directory
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)
# variable name from metadata (for logging)
variable = "temp-min-24hr" # Replace with your variable name
# process data year by year
for year, subset in data.groupby("year"):
# create output file path
file = f"{output_dir}/{year}.nc4"
# convert to GeoDataFrame
vector_data: gpd.GeoDataFrame = gpd.GeoDataFrame(
subset,
geometry=gpd.points_from_xy(subset["lon"], subset["lat"]),
crs="EPSG:4326"
)
# rasterize the data
raster_data: xr.Dataset = rasterize(
vector_data=vector_data,
bbox_dict=bbox_dict,
min_date=dt.date(year, 1, 1),
max_date=dt.date(year, 12, 31)
)
# export to NetCDF4
raster_data.to_netcdf(file)
print(f"[{variable} | {year}] Saved: {file}")
Processing Steps Summary
- Load Data: Read your input data into a pandas DataFrame
- Group by Year: Process data year-by-year for efficient handling
- Convert to GeoDataFrame: Create a GeoDataFrame with point geometries
- Rasterize: Apply the rasterization function to create a grid
- Export: Save each year's raster as a separate NetCDF4 file
Step 8: Output Format
Output File Structure
- Format: NetCDF4 (
.nc4extension) - Organization: One file per variable per year (e.g.,
2005.nc4,2006.nc4, ...) - Location: All files in the
outputs/directory (or your specified directory)
Output Dataset Structure
Each NetCDF4 file contains an xarray.Dataset with:
- Dimensions:
time: Temporal dimension (daily timestamps)lat: Latitude dimension (0.1° intervals)-
lon: Longitude dimension (0.1° intervals) -
Data Variables:
-
value: Rasterized values (float, with NaN for missing data) -
Coordinates:
time: Array of timestampslat: Array of latitude valueslon: Array of longitude values
Example Output File
outputs/
├── 2005.nc4
├── 2006.nc4
├── 2007.nc4
└── ...
Step 9: Validation Checklist
Format:
- [ ] NetCDF4 format (
.nc4extension) - [ ] Compression applied
Structure:
- [ ] Variables:
time,lat,lon,value - [ ] Data types:
datetime64[ns],float64,float64,float64 - [ ] Dimension order:
(time, lat, lon)
Data:
- [ ] No NaN in coordinates
- [ ] Precision: 6 decimals (coords), 4 decimals (values)
- [ ] Valid coordinate ranges
Metadata:
- [ ] All required attributes present
- [ ] Valid variable name
- [ ] Correct attribute format
Summary
The onboarding process transforms vector (point/polygon) data into standardized raster grids:
- Input: Pandas DataFrame with time, coordinates, and values
- Metadata: JSON file with dataset information
- Processing: Rasterization to 0.1° × 0.1° grid using geocube
- Output: NetCDF4 files organized by year
This standardized format ensures compatibility with the Riskwolf platform's data ingestion and analysis pipelines.