Skip to content

Onboarding Custom Datasets to Riskwolf Platform

Document Version: 1.0

Date: 2026-01-06

Purpose: This guide provides step-by-step instructions for onboarding custom point-wise or polygon-wise datasets as raster data to the Riskwolf platform.

Overview

The Riskwolf platform requires all custom datasets to be:

  • Converted from vector (point or polygon) data to raster format
  • Rasterized to a standard 0.1° × 0.1° grid resolution
  • Exported as NetCDF4 (.nc4) files, organized by year
  • Accompanied by standardized metadata in JSON format

Prerequisites

Required Python Packages

import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from pyproj import CRS
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

Install dependencies:

pip install numpy pandas xarray geopandas netcdf4 geocube shapely

Step 1: File Structure

1.1 Required Format

  • Format: NetCDF4 (.nc4 extension)
  • Compression: zlib level 1

1.2 Dataset Structure

<xarray.Dataset>
Dimensions:  (time: N, lat: M, lon: L)
Coordinates:
  * time     (time) datetime64[ns]
  * lat      (lat) float64
  * lon      (lon) float64
Data variables:
    value    (time, lat, lon) float64

Step 2: Prepare Input Data

Input Data Format

Your input data must be a pandas DataFrame with the following required columns:

Column Type Description Example
time datetime Timestamp for each observation 2005-01-01
year int Year of the observation 2005
lat float Latitude in decimal degrees (WGS84) 7.507
lon float Longitude in decimal degrees (WGS84) 93.597
value float The measured/observed value 27.31

Example Input Data

>>> data
            time  year     lat     lon  value
0     2005-01-01  2005   7.507  93.597  27.31
1     2005-01-02  2005   7.507  93.597  27.01
2     2005-01-03  2005   7.507  93.597  27.29
3     2005-01-04  2005   7.507  93.597  27.22
4     2005-01-05  2005   7.507  93.597  27.41
...          ...   ...     ...     ...    ...
22822 2025-10-27  2025  11.508  92.644  27.35
22823 2025-10-28  2025  11.508  92.644  28.02
22824 2025-10-29  2025  11.508  92.644  27.72
22825 2025-10-30  2025  11.508  92.644  27.56
22826 2025-10-31  2025  11.508  92.644  27.30

[22827 rows x 5 columns]

Notes:

  • Coordinates must be in WGS84 (EPSG:4326)
  • The time column should be datetime objects or pandas Timestamps
  • Missing values are not permitted and must be handled before proceeding (remove or impute as needed; NaN values in the final input are not allowed)

Step 3: Prepare Metadata

Create a JSON metadata file that describes your dataset. This metadata is required for platform integration.

Metadata Schema

{
    "country": "ind",
    "provider": "custom-provider",
    "dataset": "archive",
    "resolution": "0.1x0.1",
    "tz": "UTC",
    "variable": "temp-min-24hr",
    "created_at": "2026-01-16T22:11:59Z",
    "modified_at": "2026-01-16T22:11:59Z"
}

Metadata Field Descriptions

Field Required Description
country Yes ISO3 country code (e.g., "ind", "phl", "idn")
provider Yes Provider identifier - must be requested from admin
dataset Yes Dataset identifier - must be requested from admin
resolution Yes Always "0.1x0.1" (standard grid resolution)
tz Yes Timezone of the source data (IANA timezone name or "UTC")
variable Yes Variable name - must be requested from admin
created_at Yes ISO 8601 timestamp in UTC
modified_at Yes ISO 8601 timestamp in UTC

Important: Before onboarding, contact the admin to obtain approved values for provider, dataset, and variable fields.


Step 4: Determine Bounding Box

You need to specify the geographic bounding box for rasterization. Use the bounding box corresponding to your country from the reference list below.

Supported Country Bounding Boxes

ID Country xmin xmax ymin ymax
phl Philippines 116 127 4 21.5
ind India 68 97.5 6.5 36
idn Indonesia 94 142 -12 7
mys Malaysia 99 121 0 8
tha Thailand 95 107 5 21
col Colombia -82 -66 -5 17
nga Nigeria 2 15 4 14
bra Brazil -75 -28.5 -34 6
bgd Bangladesh 87 93 20 27
alb Albania 19 21.5 39.5 43
aut Austria 9 17.5 46 49.5
che Switzerland 5.5 11 45.5 48
khm Cambodia 102 108 9.8 15
arg Argentina -74 -53 -56 -21
bol Bolivia -70 -57 -23 -9
civ Ivory Coast -9 -2 4 11
gtm Guatemala -93 -88 13.5 18
hnd Honduras -90 -83 12.5 17
ken Kenya 33 42 -5 6
kgz Kyrgyzstan 69 81 39 44
mex Mexico -119 -86 14 33
moz Mozambique 30 41 -27 -10
mwi Malawi 32 36 -18 -9
per Peru -82 -68 -19 1
rwa Rwanda 28.5 31 -3 -1
tjk Tajikistan 67 76 36 41.5
zwe Zimbabwe 25 34 -23 -15

Note: If your country is not in this list, contact the admin to request a bounding box definition.


Step 5: Standard Variable Names

Variable ID Variable Name
temp-mean-24hr Temperature mean daily
temp-max-24hr Temperature maximum daily
temp-min-24hr Temperature minimum daily
relhum-mean-24hr Relative humidity mean daily
precip-sum-24hr Precipitation sum daily
wind-mean-24hr Wind speed mean daily
wind-max-24hr Wind speed maximum daily
wind-min-24hr Wind speed minimum daily

Note: For any new variables please contact Riskwolf support.

Step 6: Rasterization Process

Rasterization Function

The following function converts vector (point or polygon) data into a raster grid:

import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
from pyproj import CRS
import geopandas as gpd
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

def rasterize(
    vector_data: gpd.GeoDataFrame,
    bbox_dict: dict[str, float],
    min_date: dt.date,
    max_date: dt.date,
    metadata: dict[str, str]
) -> xr.Dataset:
    """
    Rasterize vector data to a 0.1° × 0.1° grid.

    Parameters:
    -----------
    vector_data : gpd.GeoDataFrame
        GeoDataFrame containing point or polygon geometries with 'value' column
    bbox_dict : dict[str, float]
        Bounding box dictionary with keys: 'xmin', 'xmax', 'ymin', 'ymax'
    min_date : dt.date
        Minimum date for temporal filtering
    max_date : dt.date
        Maximum date for temporal filtering
    metadata : dict[str, str]
        Metadata dictionary containing dataset information (country, provider, dataset, variable, etc.)

    Returns:
    --------
    xr.Dataset
        Rasterized dataset with dimensions (time, lat, lon)
    """
    def rasterize_function(**kwargs):
        return rasterize_image(all_touched=True, **kwargs)

    # create bounding box geometry
    bounding_box = box(
        minx=bbox_dict['xmin'],
        miny=bbox_dict['ymin'], 
        maxx=bbox_dict['xmax'],
        maxy=bbox_dict['ymax']
    )

    # create geocube (rasterize)
    resolution = 0.1
    time = pd.date_range(start=min_date, end=max_date, freq='D')
    lat = np.arange(bbox_dict['ymin'], bbox_dict['ymax'] + resolution, resolution)
    lon = np.arange(bbox_dict['xmin'], bbox_dict['xmax'] + resolution, resolution)
    template = xr.Dataset(coords={"time": time, "y": lat, "x": lon})
    template = template.assign(value=np.nan)
    template.attrs["crs"] = CRS.from_epsg(4326).to_wkt()

    grid = make_geocube(
        vector_data=vector_data,
        measurements=["value"],
        rasterize_function=rasterize_function,
        datetime_measurements=["time"],
        geom=bounding_box,
        fill=np.nan,
        like=template,
        group_by="time",
    )

    # rename coordinates and clean up
    grid = grid \
        .rename({"x": "lon", "y": "lat"}) \
        .reset_coords(["spatial_ref"], drop=True)

    # reindex to ensure complete grid coverage
    grid = grid.reindex(time=time, lon=lon, lat=lat, method="nearest")

    # sort by time
    grid = grid.sortby("time")

    grid.attrs = metadata
    grid.attrs["crs"] = template.attrs["crs"]

    return grid

Key Rasterization Parameters

  • Resolution: Fixed at (-0.1, 0.1) degrees (0.1° × 0.1° grid)
  • CRS: WGS84 (EPSG:4326)
  • Rasterization method: all_touched=True (includes all cells touched by geometries)
  • Fill value: np.nan for missing data

Step 7: Complete Processing Pipeline

Full Processing Script

Here's the complete script to process your data from input to output:

import os
import numpy as np
import pandas as pd
import xarray as xr
import datetime as dt
import geopandas as gpd
import netcdf4
from shapely.geometry import box
from geocube.api.core import make_geocube
from geocube.rasterize import rasterize_image

# load your data (adjust this based on your data source)
# data = pd.read_csv("your_data.csv")
# data['time'] = pd.to_datetime(data['time'])

# define your bounding box (use the appropriate country bbox)
bbox_dict = {
    "xmin": 68,    # Example for India
    "xmax": 97.5,
    "ymin": 6.5,
    "ymax": 36
}

# define output directory
output_dir = "outputs"
os.makedirs(output_dir, exist_ok=True)

# variable name from metadata (for logging)
variable = "temp-min-24hr"  # Replace with your variable name

# process data year by year
for year, subset in data.groupby("year"):

    # create output file path
    file = f"{output_dir}/{year}.nc4"

    # convert to GeoDataFrame
    vector_data: gpd.GeoDataFrame = gpd.GeoDataFrame(
        subset,
        geometry=gpd.points_from_xy(subset["lon"], subset["lat"]),
        crs="EPSG:4326"
    )

    # rasterize the data
    raster_data: xr.Dataset = rasterize(
        vector_data=vector_data,
        bbox_dict=bbox_dict,
        min_date=dt.date(year, 1, 1),
        max_date=dt.date(year, 12, 31)
    )

    # export to NetCDF4
    raster_data.to_netcdf(file)
    print(f"[{variable} | {year}] Saved: {file}")

Processing Steps Summary

  1. Load Data: Read your input data into a pandas DataFrame
  2. Group by Year: Process data year-by-year for efficient handling
  3. Convert to GeoDataFrame: Create a GeoDataFrame with point geometries
  4. Rasterize: Apply the rasterization function to create a grid
  5. Export: Save each year's raster as a separate NetCDF4 file

Step 8: Output Format

Output File Structure

  • Format: NetCDF4 (.nc4 extension)
  • Organization: One file per variable per year (e.g., 2005.nc4, 2006.nc4, ...)
  • Location: All files in the outputs/ directory (or your specified directory)

Output Dataset Structure

Each NetCDF4 file contains an xarray.Dataset with:

  • Dimensions:
  • time: Temporal dimension (daily timestamps)
  • lat: Latitude dimension (0.1° intervals)
  • lon: Longitude dimension (0.1° intervals)

  • Data Variables:

  • value: Rasterized values (float, with NaN for missing data)

  • Coordinates:

  • time: Array of timestamps
  • lat: Array of latitude values
  • lon: Array of longitude values

Example Output File

outputs/
├── 2005.nc4
├── 2006.nc4
├── 2007.nc4
└── ...

Step 9: Validation Checklist

Format:

  • [ ] NetCDF4 format (.nc4 extension)
  • [ ] Compression applied

Structure:

  • [ ] Variables: time, lat, lon, value
  • [ ] Data types: datetime64[ns], float64, float64, float64
  • [ ] Dimension order: (time, lat, lon)

Data:

  • [ ] No NaN in coordinates
  • [ ] Precision: 6 decimals (coords), 4 decimals (values)
  • [ ] Valid coordinate ranges

Metadata:

  • [ ] All required attributes present
  • [ ] Valid variable name
  • [ ] Correct attribute format

Summary

The onboarding process transforms vector (point/polygon) data into standardized raster grids:

  1. Input: Pandas DataFrame with time, coordinates, and values
  2. Metadata: JSON file with dataset information
  3. Processing: Rasterization to 0.1° × 0.1° grid using geocube
  4. Output: NetCDF4 files organized by year

This standardized format ensures compatibility with the Riskwolf platform's data ingestion and analysis pipelines.