# Testing & Validation

Ensuring data quality is critical for reliable analytics. This guide covers testing strategies, validation techniques, and monitoring approaches for the DataSuite ETL pipeline.

## DBT Testing Framework

### Built-in Tests

```yaml
# models/schema.yml
version: 2

models:
  - name: fact_sale
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - date_key
            - territory_key
            - product_key
            - salesorderid
    columns:
      - name: sale_key
        tests:
          - unique
          - not_null
      - name: total_due
        tests:
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
```

### Custom Tests

Create `tests/fact_sale_data_integrity.sql`:

```sql
-- Test that fact_sale totals match source data
SELECT 
    'Row count mismatch' as test_name,
    source_count,
    fact_count,
    ABS(source_count - fact_count) as difference
FROM (
    SELECT COUNT(*) as source_count
    FROM {{ source('bronze_layer', 'sales_orders') }}
) source
CROSS JOIN (
    SELECT COUNT(*) as fact_count  
    FROM {{ ref('fact_sale') }}
) fact
WHERE ABS(source_count - fact_count) > 100
```

## Data Quality Monitoring

### Quality Checks in ClickHouse

```sql
-- Data freshness check
SELECT 
    'bronze_layer' as layer,
    MAX(_ingested_at) as last_update,
    CURRENT_TIMESTAMP() - MAX(_ingested_at) as lag_minutes
FROM bronze_layer.sales_orders

UNION ALL

SELECT 
    'gold_layer' as layer, 
    MAX(created_at) as last_update,
    CURRENT_TIMESTAMP() - MAX(created_at) as lag_minutes
FROM gold_layer.fact_sale;
```

## Pipeline Validation

### End-to-End Testing Script

Create `scripts/validate-pipeline.sh`:

```bash
#!/bin/bash

echo "🔍 DataSuite ETL Pipeline Validation"
echo "===================================="

# Test data ingestion
echo "Testing data ingestion..."
MYSQL_COUNT=$(docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) FROM Sales.SalesOrderHeader;" -s -N)
CH_COUNT=$(curl -s -u admin:clickhouse123 http://localhost:8123/ -d "SELECT COUNT(*) FROM bronze_layer.sales_orders FORMAT TabSeparated")

echo "MySQL records: $MYSQL_COUNT"
echo "ClickHouse records: $CH_COUNT"

# Run DBT tests
echo "Running DBT tests..."
cd /path/to/dbt/project
dbt test

echo "Validation complete!"
```

## Automated Quality Monitoring

### Data Quality Dashboard

Monitor key metrics:

* Row counts at each layer
* Data freshness (time since last update)
* Test pass/fail rates
* Pipeline execution times
* Error rates and patterns

### Alerting Rules

Set up alerts for:

* Data pipeline failures
* Significant row count changes (>10%)
* Data freshness exceeding thresholds
* Test failures
* Performance degradation

## Next Steps

With testing and validation in place, your DataSuite ETL pipeline is production-ready. Consider:

1. Setting up automated monitoring dashboards
2. Implementing data lineage tracking
3. Adding more sophisticated data quality rules
4. Creating data catalog documentation

Your pipeline now has comprehensive quality assurance to ensure reliable, accurate analytics data.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasuite.vn/data-platform/index/development/testing-validation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
