Onboarding Custom Datasets to Riskwolf Platform

Document Version: 1.0

Date: 2026-01-06

Purpose: This guide provides step-by-step instructions for onboarding custom point-wise or polygon-wise datasets as raster data to the Riskwolf platform.

Overview

The Riskwolf platform requires all custom datasets to be:

Converted from vector (point or polygon) data to raster format
Rasterized to a standard 0.1° × 0.1° grid resolution
Exported as NetCDF4 (.nc4) files, organized by year
Accompanied by standardized metadata in JSON format

Prerequisites

Required Python Packages

import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from pyproj import CRS
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

Install dependencies:

pip install numpy pandas xarray geopandas netcdf4 geocube shapely

Step 1: File Structure

1.1 Required Format

Format: NetCDF4 (.nc4 extension)
Compression: zlib level 1

1.2 Dataset Structure

<xarray.Dataset>
Dimensions:  (time: N, lat: M, lon: L)
Coordinates:
  * time     (time) datetime64[ns]
  * lat      (lat) float64
  * lon      (lon) float64
Data variables:
    value    (time, lat, lon) float64

Step 2: Prepare Input Data

Input Data Format

Your input data must be a pandas DataFrame with the following required columns:

Column	Type	Description	Example
`time`	datetime	Timestamp for each observation	`2005-01-01`
`year`	int	Year of the observation	`2005`
`lat`	float	Latitude in decimal degrees (WGS84)	`7.507`
`lon`	float	Longitude in decimal degrees (WGS84)	`93.597`
`value`	float	The measured/observed value	`27.31`

Example Input Data

>>> data
            time  year     lat     lon  value
0     2005-01-01  2005   7.507  93.597  27.31
1     2005-01-02  2005   7.507  93.597  27.01
2     2005-01-03  2005   7.507  93.597  27.29
3     2005-01-04  2005   7.507  93.597  27.22
4     2005-01-05  2005   7.507  93.597  27.41
...          ...   ...     ...     ...    ...
22822 2025-10-27  2025  11.508  92.644  27.35
22823 2025-10-28  2025  11.508  92.644  28.02
22824 2025-10-29  2025  11.508  92.644  27.72
22825 2025-10-30  2025  11.508  92.644  27.56
22826 2025-10-31  2025  11.508  92.644  27.30

[22827 rows x 5 columns]

Notes:

Coordinates must be in WGS84 (EPSG:4326)
The time column should be datetime objects or pandas Timestamps
Missing values are not permitted and must be handled before proceeding (remove or impute as needed; NaN values in the final input are not allowed)

Step 3: Prepare Metadata

Create a JSON metadata file that describes your dataset. This metadata is required for platform integration.

Metadata Schema

{
    "country": "ind",
    "provider": "custom-provider",
    "dataset": "archive",
    "resolution": "0.1x0.1",
    "tz": "UTC",
    "variable": "temp-min-24hr",
    "created_at": "2026-01-16T22:11:59Z",
    "modified_at": "2026-01-16T22:11:59Z"
}

Metadata Field Descriptions

Field	Required	Description
`country`	Yes	ISO3 country code (e.g., "ind", "phl", "idn")
`provider`	Yes	Provider identifier - must be requested from admin
`dataset`	Yes	Dataset identifier - must be requested from admin
`resolution`	Yes	Always `"0.1x0.1"` (standard grid resolution)
`tz`	Yes	Timezone of the source data (IANA timezone name or "UTC")
`variable`	Yes	Variable name - must be requested from admin
`created_at`	Yes	ISO 8601 timestamp in UTC
`modified_at`	Yes	ISO 8601 timestamp in UTC

Important: Before onboarding, contact the admin to obtain approved values for provider, dataset, and variable fields.

Step 4: Determine Bounding Box

You need to specify the geographic bounding box for rasterization. Use the bounding box corresponding to your country from the reference list below.

Supported Country Bounding Boxes

ID	Country	xmin	xmax	ymin	ymax
phl	Philippines	116	127	4	21.5
ind	India	68	97.5	6.5	36
idn	Indonesia	94	142	-12	7
mys	Malaysia	99	121	0	8
tha	Thailand	95	107	5	21
col	Colombia	-82	-66	-5	17
nga	Nigeria	2	15	4	14
bra	Brazil	-75	-28.5	-34	6
bgd	Bangladesh	87	93	20	27
alb	Albania	19	21.5	39.5	43
aut	Austria	9	17.5	46	49.5
che	Switzerland	5.5	11	45.5	48
khm	Cambodia	102	108	9.8	15
arg	Argentina	-74	-53	-56	-21
bol	Bolivia	-70	-57	-23	-9
civ	Ivory Coast	-9	-2	4	11
gtm	Guatemala	-93	-88	13.5	18
hnd	Honduras	-90	-83	12.5	17
ken	Kenya	33	42	-5	6
kgz	Kyrgyzstan	69	81	39	44
mex	Mexico	-119	-86	14	33
moz	Mozambique	30	41	-27	-10
mwi	Malawi	32	36	-18	-9
per	Peru	-82	-68	-19	1
rwa	Rwanda	28.5	31	-3	-1
tjk	Tajikistan	67	76	36	41.5
zwe	Zimbabwe	25	34	-23	-15

Note: If your country is not in this list, contact the admin to request a bounding box definition.

Step 5: Standard Variable Names

Variable ID	Variable Name
`temp-mean-24hr`	Temperature mean daily
`temp-max-24hr`	Temperature maximum daily
`temp-min-24hr`	Temperature minimum daily
`relhum-mean-24hr`	Relative humidity mean daily
`precip-sum-24hr`	Precipitation sum daily
`wind-mean-24hr`	Wind speed mean daily
`wind-max-24hr`	Wind speed maximum daily
`wind-min-24hr`	Wind speed minimum daily

Note: For any new variables please contact Riskwolf support.

Step 6: Rasterization Process

Rasterization Function

The following function converts vector (point or polygon) data into a raster grid:

import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
from pyproj import CRS
import geopandas as gpd
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

def rasterize(
    vector_data: gpd.GeoDataFrame,
    bbox_dict: dict[str, float],
    min_date: dt.date,
    max_date: dt.date,
    metadata: dict[str, str]
) -> xr.Dataset:
    """
    Rasterize vector data to a 0.1° × 0.1° grid.

    Parameters:
    -----------
    vector_data : gpd.GeoDataFrame
        GeoDataFrame containing point or polygon geometries with 'value' column
    bbox_dict : dict[str, float]
        Bounding box dictionary with keys: 'xmin', 'xmax', 'ymin', 'ymax'
    min_date : dt.date
        Minimum date for temporal filtering
    max_date : dt.date
        Maximum date for temporal filtering
    metadata : dict[str, str]
        Metadata dictionary containing dataset information (country, provider, dataset, variable, etc.)

    Returns:
    --------
    xr.Dataset
        Rasterized dataset with dimensions (time, lat, lon)
    """
    def rasterize_function(**kwargs):
        return rasterize_image(all_touched=True, **kwargs)

    # create bounding box geometry
    bounding_box = box(
        minx=bbox_dict['xmin'],
        miny=bbox_dict['ymin'], 
        maxx=bbox_dict['xmax'],
        maxy=bbox_dict['ymax']
    )

    # create geocube (rasterize)
    resolution = 0.1
    time = pd.date_range(start=min_date, end=max_date, freq='D')
    lat = np.arange(bbox_dict['ymin'], bbox_dict['ymax'] + resolution, resolution)
    lon = np.arange(bbox_dict['xmin'], bbox_dict['xmax'] + resolution, resolution)
    template = xr.Dataset(coords={"time": time, "y": lat, "x": lon})
    template = template.assign(value=np.nan)
    template.attrs["crs"] = CRS.from_epsg(4326).to_wkt()

    grid = make_geocube(
        vector_data=vector_data,
        measurements=["value"],
        rasterize_function=rasterize_function,
        datetime_measurements=["time"],
        geom=bounding_box,
        fill=np.nan,
        like=template,
        group_by="time",
    )

    # rename coordinates and clean up
    grid = grid \
        .rename({"x": "lon", "y": "lat"}) \
        .reset_coords(["spatial_ref"], drop=True)

    # reindex to ensure complete grid coverage
    grid = grid.reindex(time=time, lon=lon, lat=lat, method="nearest")

    # sort by time
    grid = grid.sortby("time")

    grid.attrs = metadata
    grid.attrs["crs"] = template.attrs["crs"]

    return grid

Key Rasterization Parameters

Resolution: Fixed at (-0.1, 0.1) degrees (0.1° × 0.1° grid)
CRS: WGS84 (EPSG:4326)
Rasterization method: all_touched=True (includes all cells touched by geometries)
Fill value: np.nan for missing data

Step 7: Complete Processing Pipeline

Full Processing Script

Here's the complete script to process your data from input to output:

import os
import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

# load your data (adjust this based on your data source)
# data = pd.read_csv("your_data.csv")
# data['time'] = pd.to_datetime(data['time'])

# define your bounding box (use the appropriate country bbox)
bbox_dict = {
    "xmin": 68,    # Example for India
    "xmax": 97.5,
    "ymin": 6.5,
    "ymax": 36
}

# define output directory
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)

# variable name from metadata (for logging)
variable = "temp-min-24hr"  # Replace with your variable name

# process data year by year
for year, subset in data.groupby("year"):

    # create output file path
    file = f"{output_dir}/{year}.nc4"

    # convert to GeoDataFrame
    vector_data: gpd.GeoDataFrame = gpd.GeoDataFrame(
        subset,
        geometry=gpd.points_from_xy(subset["lon"], subset["lat"]),
        crs="EPSG:4326"
    )

    # rasterize the data
    raster_data: xr.Dataset = rasterize(
        vector_data=vector_data,
        bbox_dict=bbox_dict,
        min_date=dt.date(year, 1, 1),
        max_date=dt.date(year, 12, 31)
    )

    # export to NetCDF4
    raster_data.to_netcdf(file)
    print(f"[{variable} | {year}] Saved: {file}")

Processing Steps Summary

Load Data: Read your input data into a pandas DataFrame
Group by Year: Process data year-by-year for efficient handling
Convert to GeoDataFrame: Create a GeoDataFrame with point geometries
Rasterize: Apply the rasterization function to create a grid
Export: Save each year's raster as a separate NetCDF4 file

Step 8: Output Format

Output File Structure

Format: NetCDF4 (.nc4 extension)
Organization: One file per variable per year (e.g., 2005.nc4, 2006.nc4, ...)
Location: All files in the outputs/ directory (or your specified directory)

Output Dataset Structure

Each NetCDF4 file contains an xarray.Dataset with:

Dimensions:
time: Temporal dimension (daily timestamps)
lat: Latitude dimension (0.1° intervals)
lon: Longitude dimension (0.1° intervals)
Data Variables:
value: Rasterized values (float, with NaN for missing data)
Coordinates:
time: Array of timestamps
lat: Array of latitude values
lon: Array of longitude values

Example Output File

outputs/
├── 2005.nc4
├── 2006.nc4
├── 2007.nc4
└── ...

Step 9: Validation Checklist

Format:

[ ] NetCDF4 format (.nc4 extension)
[ ] Compression applied

Structure:

[ ] Variables: time, lat, lon, value
[ ] Data types: datetime64[ns], float64, float64, float64
[ ] Dimension order: (time, lat, lon)

Data:

[ ] No NaN in coordinates
[ ] Precision: 6 decimals (coords), 4 decimals (values)
[ ] Valid coordinate ranges

Metadata:

[ ] All required attributes present
[ ] Valid variable name
[ ] Correct attribute format

Summary

The onboarding process transforms vector (point/polygon) data into standardized raster grids:

Input: Pandas DataFrame with time, coordinates, and values
Metadata: JSON file with dataset information
Processing: Rasterization to 0.1° × 0.1° grid using geocube
Output: NetCDF4 files organized by year

This standardized format ensures compatibility with the Riskwolf platform's data ingestion and analysis pipelines.