Skip to content

Data Quality Assurance

Purpose:
This document outlines the Data Quality Framework employed by Riskwolf to ensure integrity, accuracy, and reliability of all datasets. It provides a structured overview of the controls implemented across the data processing pipeline to monitor and log data quality.


1. Overview of Controls

Riskwolf pipelines use a set of predefined controls that monitor data quality across all pipeline stages (Download → Load → Process). Each control is implemented via the Control class, which internally uses the ControlBinary handler and the associated enums to maintain structured state and metrics.

Control Description Pipeline Stage Check Type Dimensions
Corrupt Files Detects unreadable or corrupt files Load File count Success / Failure / Total
Processed Files Validates correctness of processed files Process Valid conversion Success / Failure / Total
Required Inputs Ensures availability of all dependent raw input files Process Variable exists Success / Failure / Total
Bounding Box Check Verifies spatial alignment of new data with existing datasets Process BBOX match Success / Failure / Total
Third-Party Outages Logs failures caused by external providers or APIs Download Third-party error Warning / Success / Total

2. Logging and Traceability

All controls use a standardized logging structure through the log_control method. Each log entry contains:

  • Run ID: Unique pipeline execution identifier
  • Job ID (ControlJob): Specifies the processing job (e.g., geo-downloader)
  • Provider / Dataset / Variable / Country: Metadata for context
  • Control ID (ControlId): Identifies the specific control (e.g., CORRUPT_FILES)
  • Check Type (ControlCheckType) & Dimension (ControlDimension): Type of quality check and result dimension
  • Timestamp (asAt): Execution time of the control

This framework guarantees reliable and traceable quality metrics across the pipeline.


Example Control Log

This example shows multiple controls across stages with successes, failures, and totals logged:

[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of successes: 0 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'success', 'value': 0}

[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of failures: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'failure', 'value': 1}

[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of files processed: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'total', 'value': 1}

[2025-09-10 14:43:54] [INFO ] [process-files] Total number of successes: 2 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'success', 'value': 2}

[2025-09-10 14:43:54] [INFO ] [process-files] Total number of failures: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'failure', 'value': 1}

[2025-09-10 14:43:54] [INFO ] [process-files] Total number of files processed: 3 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'total', 'value': 3}

3. Control State and Metrics

Each control is represented using the ControlBinary class, which tracks:

  • State (ControlState): INACTIVE or ACTIVE
  • Total: Number of items processed
  • Success / Failure / Warning: Counts for each dimension
  • Threshold: Minimum value for success criteria
  • Success Rate: Calculated as success / total when active

The ControlBinary class provides methods to record successes, failures, and warnings, and generates dimension-specific messages for logging purposes.


4. Control Descriptions

4.1 Corrupt Files

  • Monitors the number of files successfully read vs. corrupted.
  • Detects data integrity issues early during the Load stage.

4.2 Processed Files

  • Validates that files are correctly converted or processed.
  • Captures anomalies or processing failures during the Process stage.

4.3 Required Inputs

  • Ensures that all required input variables exist before processing.
  • Prevents downstream errors due to missing or incomplete source data.

4.4 Bounding Box Check

  • Verifies spatial alignment of incoming data against existing datasets.
  • Prevents latitude/longitude mismatches in geospatial datasets.

4.5 Third-Party Outages

  • Logs failures caused by external providers or APIs.
  • Categorizes events as warnings or failures, supporting proactive monitoring.

5. Enums Reference

The following enums are used to standardize control states, stages, jobs, IDs, check types, and dimensions:

  • ControlState: INACTIVE, ACTIVE
  • ControlStage: DOWNLOAD, LOAD, PROCESS, MASK, UPLOAD, CLEANUP
  • ControlJob: GEO_DOWNLOADER, GEO_EXTRACTOR, GEO_TRANSFORMER, GEO_STORE
  • ControlId: CORRUPT_FILES, PROCESS_FILES, REQUIRED_INPUTS, BBOX_MATCH, THIRD_PARTY_OUTAGE
  • ControlCheckType: ROW_COUNT, MISSING_VALUES, OUT_OF_RANGE, FILE_COUNT, VALID_CONVERSION, VARIABLE_EXISTS, BBOX_MATCH, THIRD_PARTY_ERROR
  • ControlDimension: SUCCESS, WARNING, FAILURE, TOTAL