Testing & Validation
Ensuring data quality is critical for reliable analytics. This guide covers testing strategies, validation techniques, and monitoring approaches for the DataSuite ETL pipeline.
DBT Testing Framework
Built-in Tests
# models/schema.yml
version: 2
models:
- name: fact_sale
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- date_key
- territory_key
- product_key
- salesorderid
columns:
- name: sale_key
tests:
- unique
- not_null
- name: total_due
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
Custom Tests
Create tests/fact_sale_data_integrity.sql
:
-- Test that fact_sale totals match source data
SELECT
'Row count mismatch' as test_name,
source_count,
fact_count,
ABS(source_count - fact_count) as difference
FROM (
SELECT COUNT(*) as source_count
FROM {{ source('bronze_layer', 'sales_orders') }}
) source
CROSS JOIN (
SELECT COUNT(*) as fact_count
FROM {{ ref('fact_sale') }}
) fact
WHERE ABS(source_count - fact_count) > 100
Data Quality Monitoring
Quality Checks in ClickHouse
-- Data freshness check
SELECT
'bronze_layer' as layer,
MAX(_ingested_at) as last_update,
CURRENT_TIMESTAMP() - MAX(_ingested_at) as lag_minutes
FROM bronze_layer.sales_orders
UNION ALL
SELECT
'gold_layer' as layer,
MAX(created_at) as last_update,
CURRENT_TIMESTAMP() - MAX(created_at) as lag_minutes
FROM gold_layer.fact_sale;
Pipeline Validation
End-to-End Testing Script
Create scripts/validate-pipeline.sh
:
#!/bin/bash
echo "🔍 DataSuite ETL Pipeline Validation"
echo "===================================="
# Test data ingestion
echo "Testing data ingestion..."
MYSQL_COUNT=$(docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) FROM Sales.SalesOrderHeader;" -s -N)
CH_COUNT=$(curl -s -u admin:clickhouse123 http://localhost:8123/ -d "SELECT COUNT(*) FROM bronze_layer.sales_orders FORMAT TabSeparated")
echo "MySQL records: $MYSQL_COUNT"
echo "ClickHouse records: $CH_COUNT"
# Run DBT tests
echo "Running DBT tests..."
cd /path/to/dbt/project
dbt test
echo "Validation complete!"
Automated Quality Monitoring
Data Quality Dashboard
Monitor key metrics:
Row counts at each layer
Data freshness (time since last update)
Test pass/fail rates
Pipeline execution times
Error rates and patterns
Alerting Rules
Set up alerts for:
Data pipeline failures
Significant row count changes (>10%)
Data freshness exceeding thresholds
Test failures
Performance degradation
Next Steps
With testing and validation in place, your DataSuite ETL pipeline is production-ready. Consider:
Setting up automated monitoring dashboards
Implementing data lineage tracking
Adding more sophisticated data quality rules
Creating data catalog documentation
Your pipeline now has comprehensive quality assurance to ensure reliable, accurate analytics data.
Last updated
Was this helpful?