Data Flow
This document explains how data moves through the DataSuite ETL pipeline, from source systems to analytics-ready dimensional models.
End-to-End Data Flow
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ MySQL │ │ LogStash │ │ ClickHouse │ │ DBT │
│ (Adventure │────│ (Data Ingestion)│────│ (Data Warehouse│────│ (Transformations│
│ Works Source) │ │ │ │ Bronze Layer) │ │ Gold Layer) │
└─────────────────┘ └──────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │ │
│ Transactional │ Real-time │ Batch │ Scheduled
│ Updates │ Streaming │ Processing │ Execution
└─────────────────────────┼─────────────────────────┼─────────────────────────┘
│ │
┌─────────────────┐ ┌─────────────────┐
│ Data Quality │ │ Analytics │
│ Monitoring │ │ & Reporting │
└─────────────────┘ └─────────────────┘Phase 1: Source Data Generation
Operational Activity
What Happens: Business transactions occur in the AdventureWorks MySQL database
Customer places orders through web/mobile applications
Inventory systems update product availability
Sales teams create customer records
Financial systems process payments
Data Characteristics:
Volume: 1,000-10,000 transactions per day
Velocity: Real-time as business events occur
Variety: Structured relational data across multiple schemas
Update Pattern: INSERT and UPDATE operations with timestamp tracking
Key Tables Updated:
Phase 2: Data Ingestion (LogStash)
Incremental Data Capture
What Happens: LogStash continuously monitors MySQL for changes
Process Flow:
Connection Establishment: JDBC connection to MySQL every 30 seconds
Change Detection: Query for records where
ModifiedDate > last_processed_timestampData Extraction: Pull changed records using optimized SQL queries
Format Transformation: Convert MySQL result set to JSON format
Data Transmission: HTTP POST to ClickHouse bronze layer
Sample LogStash Query:
Performance Characteristics:
Polling Frequency: Every 30 seconds
Batch Size: 1,000 records per query
Throughput: ~1,000 records/second
Latency: Average 45 seconds from source to bronze layer
Data Quality at Ingestion
Validation Rules:
Primary key uniqueness checks
Data type validation and conversion
NULL value handling for required fields
Timestamp consistency verification
Phase 3: Bronze Layer Storage (ClickHouse)
Raw Data Persistence
What Happens: LogStash delivers JSON payloads to ClickHouse HTTP interface
Storage Strategy:
Table Structure: Mirror source schema with minimal modifications
Partitioning: Date-based partitioning for query performance
Ordering: Clustered by date and primary key for optimal compression
Retention: Configurable retention policies (default: 2 years)
Example Bronze Table:
Data Characteristics in Bronze:
Fidelity: Exact copy of source data with no business logic applied
Completeness: All source columns preserved, including system fields
Auditability: Complete history of all changes with ingestion timestamps
Reprocessability: Ability to rebuild gold layer from bronze at any time
Phase 4: Data Transformation (DBT)
Staging Layer Processing
What Happens: DBT creates cleaned, standardized views of bronze data
Staging Transformations:
Quality Checks:
Business rule validation
Data type standardization
Null handling and default values
Referential integrity verification
Dimensional Model Creation
What Happens: DBT builds star schema optimized for analytics
Fact Table Creation:
Dimension Table Creation:
Phase 5: Gold Layer Materialization
Star Schema Implementation
What Happens: DBT executes SQL to create final dimensional models in ClickHouse
Materialization Strategy:
Fact Tables: Incremental materialization for performance
Dimension Tables: Full refresh for SCD Type 1, incremental for SCD Type 2
Partitioning: Date-based partitioning aligned with query patterns
Indexing: Optimized for star schema join patterns
Final Schema Structure:
Data Quality & Monitoring
Quality Gates Throughout Pipeline
Source Quality (MySQL):
Primary key constraints prevent duplicates
Foreign key relationships ensure referential integrity
NOT NULL constraints on required fields
Check constraints validate business rules
Ingestion Quality (LogStash):
Row count reconciliation between source and bronze
Data type validation during JSON conversion
Error logging and dead letter queuing for failed records
Throughput monitoring and alerting
Transformation Quality (DBT):
End-to-End Monitoring
Freshness Monitoring:
Volume Monitoring:
Row count tracking at each layer
Growth rate analysis and anomaly detection
Data distribution analysis for skew detection
Performance Characteristics
End-to-End Latency
Source to Bronze: 30-60 seconds (LogStash polling interval)
Bronze to Gold: 5-15 minutes (DBT execution time)
Total Latency: ~15-20 minutes for real-time updates
Throughput Capacity
LogStash Ingestion: 1,000 records/second per pipeline
ClickHouse Storage: 100,000+ inserts/second
DBT Processing: 1M+ records/minute for transformations
Query Performance: Sub-second response for most analytical queries
Scalability Patterns
Horizontal Scaling: Multiple LogStash instances for increased throughput
Vertical Scaling: Larger ClickHouse nodes for complex queries
Partitioning: Date-based partitioning for query pruning
Parallel Processing: DBT model parallelization for faster builds
Next Steps
Now that you understand the data flow, you can:
Set up your environment to see this flow in action
Configure LogStash for data ingestion
Build DBT models for transformations
The data flow provides the foundation for understanding how to troubleshoot issues and optimize performance in your own implementation.
Last updated
Was this helpful?