Data Quality Assurance
Purpose:
This document outlines the Data Quality Framework employed by Riskwolf to ensure integrity, accuracy, and reliability of all datasets. It provides a structured overview of the controls implemented across the data processing pipeline to monitor and log data quality.
1. Overview of Controls
Riskwolf pipelines use a set of predefined controls that monitor data quality across all pipeline stages (Download → Load → Process). Each control is implemented via the Control
class, which internally uses the ControlBinary
handler and the associated enums to maintain structured state and metrics.
Control | Description | Pipeline Stage | Check Type | Dimensions |
---|---|---|---|---|
Corrupt Files | Detects unreadable or corrupt files | Load | File count | Success / Failure / Total |
Processed Files | Validates correctness of processed files | Process | Valid conversion | Success / Failure / Total |
Required Inputs | Ensures availability of all dependent raw input files | Process | Variable exists | Success / Failure / Total |
Bounding Box Check | Verifies spatial alignment of new data with existing datasets | Process | BBOX match | Success / Failure / Total |
Third-Party Outages | Logs failures caused by external providers or APIs | Download | Third-party error | Warning / Success / Total |
2. Logging and Traceability
All controls use a standardized logging structure through the log_control
method. Each log entry contains:
- Run ID: Unique pipeline execution identifier
- Job ID (
ControlJob
): Specifies the processing job (e.g.,geo-downloader
) - Provider / Dataset / Variable / Country: Metadata for context
- Control ID (
ControlId
): Identifies the specific control (e.g.,CORRUPT_FILES
) - Check Type (
ControlCheckType
) & Dimension (ControlDimension
): Type of quality check and result dimension - Timestamp (
asAt
): Execution time of the control
This framework guarantees reliable and traceable quality metrics across the pipeline.
Example Control Log
This example shows multiple controls across stages with successes, failures, and totals logged:
[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of successes: 0 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'success', 'value': 0}
[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of failures: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'failure', 'value': 1}
[2025-09-10 14:43:54] [INFO ] [corrupt-files] Total number of files processed: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'load', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'corrupt-files', 'checkType': 'file-count', 'dimension': 'total', 'value': 1}
[2025-09-10 14:43:54] [INFO ] [process-files] Total number of successes: 2 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'success', 'value': 2}
[2025-09-10 14:43:54] [INFO ] [process-files] Total number of failures: 1 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'failure', 'value': 1}
[2025-09-10 14:43:54] [INFO ] [process-files] Total number of files processed: 3 extra={'runId': '78ec1b04-c4af-4e0d-84ab-fb9e96a5d82d', 'jobId': 'geo-extractor', 'stage': 'process', 'provider': 'noaa', 'dataset': 'ibtracs', 'variable': 'wind-max-6hr', 'country': 'phl', 'asAt': '2025-09-10T09:13:50+00:00', 'controlId': 'process-files', 'checkType': 'valid-conversion', 'dimension': 'total', 'value': 3}
3. Control State and Metrics
Each control is represented using the ControlBinary
class, which tracks:
- State (
ControlState
):INACTIVE
orACTIVE
- Total: Number of items processed
- Success / Failure / Warning: Counts for each dimension
- Threshold: Minimum value for success criteria
- Success Rate: Calculated as
success / total
when active
The ControlBinary
class provides methods to record successes, failures, and warnings, and generates dimension-specific messages for logging purposes.
4. Control Descriptions
4.1 Corrupt Files
- Monitors the number of files successfully read vs. corrupted.
- Detects data integrity issues early during the Load stage.
4.2 Processed Files
- Validates that files are correctly converted or processed.
- Captures anomalies or processing failures during the Process stage.
4.3 Required Inputs
- Ensures that all required input variables exist before processing.
- Prevents downstream errors due to missing or incomplete source data.
4.4 Bounding Box Check
- Verifies spatial alignment of incoming data against existing datasets.
- Prevents latitude/longitude mismatches in geospatial datasets.
4.5 Third-Party Outages
- Logs failures caused by external providers or APIs.
- Categorizes events as warnings or failures, supporting proactive monitoring.
5. Enums Reference
The following enums are used to standardize control states, stages, jobs, IDs, check types, and dimensions:
- ControlState:
INACTIVE
,ACTIVE
- ControlStage:
DOWNLOAD
,LOAD
,PROCESS
,MASK
,UPLOAD
,CLEANUP
- ControlJob:
GEO_DOWNLOADER
,GEO_EXTRACTOR
,GEO_TRANSFORMER
,GEO_STORE
- ControlId:
CORRUPT_FILES
,PROCESS_FILES
,REQUIRED_INPUTS
,BBOX_MATCH
,THIRD_PARTY_OUTAGE
- ControlCheckType:
ROW_COUNT
,MISSING_VALUES
,OUT_OF_RANGE
,FILE_COUNT
,VALID_CONVERSION
,VARIABLE_EXISTS
,BBOX_MATCH
,THIRD_PARTY_ERROR
- ControlDimension:
SUCCESS
,WARNING
,FAILURE
,TOTAL