Testing & Validation

Ensuring data quality is critical for reliable analytics. This guide covers testing strategies, validation techniques, and monitoring approaches for the DataSuite ETL pipeline.

DBT Testing Framework

Built-in Tests

# models/schema.yml
version: 2

models:
  - name: fact_sale
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - date_key
            - territory_key
            - product_key
            - salesorderid
    columns:
      - name: sale_key
        tests:
          - unique
          - not_null
      - name: total_due
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

Custom Tests

Create tests/fact_sale_data_integrity.sql:

-- Test that fact_sale totals match source data
SELECT 
    'Row count mismatch' as test_name,
    source_count,
    fact_count,
    ABS(source_count - fact_count) as difference
FROM (
    SELECT COUNT(*) as source_count
    FROM {{ source('bronze_layer', 'sales_orders') }}
) source
CROSS JOIN (
    SELECT COUNT(*) as fact_count  
    FROM {{ ref('fact_sale') }}
) fact
WHERE ABS(source_count - fact_count) > 100

Data Quality Monitoring

Quality Checks in ClickHouse

-- Data freshness check
SELECT 
    'bronze_layer' as layer,
    MAX(_ingested_at) as last_update,
    CURRENT_TIMESTAMP() - MAX(_ingested_at) as lag_minutes
FROM bronze_layer.sales_orders

UNION ALL

SELECT 
    'gold_layer' as layer, 
    MAX(created_at) as last_update,
    CURRENT_TIMESTAMP() - MAX(created_at) as lag_minutes
FROM gold_layer.fact_sale;

Pipeline Validation

End-to-End Testing Script

Create scripts/validate-pipeline.sh:

#!/bin/bash

echo "🔍 DataSuite ETL Pipeline Validation"
echo "===================================="

# Test data ingestion
echo "Testing data ingestion..."
MYSQL_COUNT=$(docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) FROM Sales.SalesOrderHeader;" -s -N)
CH_COUNT=$(curl -s -u admin:clickhouse123 http://localhost:8123/ -d "SELECT COUNT(*) FROM bronze_layer.sales_orders FORMAT TabSeparated")

echo "MySQL records: $MYSQL_COUNT"
echo "ClickHouse records: $CH_COUNT"

# Run DBT tests
echo "Running DBT tests..."
cd /path/to/dbt/project
dbt test

echo "Validation complete!"

Automated Quality Monitoring

Data Quality Dashboard

Monitor key metrics:

  • Row counts at each layer

  • Data freshness (time since last update)

  • Test pass/fail rates

  • Pipeline execution times

  • Error rates and patterns

Alerting Rules

Set up alerts for:

  • Data pipeline failures

  • Significant row count changes (>10%)

  • Data freshness exceeding thresholds

  • Test failures

  • Performance degradation

Next Steps

With testing and validation in place, your DataSuite ETL pipeline is production-ready. Consider:

  1. Setting up automated monitoring dashboards

  2. Implementing data lineage tracking

  3. Adding more sophisticated data quality rules

  4. Creating data catalog documentation

Your pipeline now has comprehensive quality assurance to ensure reliable, accurate analytics data.

Last updated

Was this helpful?