System Overview
The DataSuite ETL system is a modern data engineering platform that extracts data from operational systems, transforms it using dimensional modeling techniques, and loads it into a high-performance analytics warehouse.
Architecture Overview
@startuml
database DataSource
component LogStash
component Airflow {
component EtlJob
}
database Clickhouse {
component BronzeLayer
component GoldLayer
}
DataSource <- LogStash: read and query data from data source
LogStash -> BronzeLayer: push data to clickhouse
EtlJob --> BronzeLayer: read raw data from clickhouse
EtlJob --> GoldLayer: persist data aftere ETL to clickhouse
@enduml
Data Flow Pattern
The system follows a medallion architecture pattern with three distinct layers:
Source Layer → Raw operational data from MySQL (AdventureWorks database)
Bronze Layer → Raw data ingested via LogStash into ClickHouse
Gold Layer → Transformed dimensional models created by DBT
Key Design Principles
🔄 Event-Driven Processing
LogStash monitors source systems for changes using timestamp-based incremental loading
Changes are immediately propagated to the bronze layer
DBT transformations can run on demand or scheduled basis
📊 Dimensional Modeling
Star schema design optimized for analytical queries
Fact tables store measurable events (sales transactions)
Dimension tables provide context (customers, products, dates, territories)
🏗️ Layered Architecture
Bronze Layer: Raw, unprocessed data for auditing and reprocessing
Gold Layer: Business-ready dimensional models for analytics
Serving Layer: Optimized views and aggregations for specific use cases
⚡ High Performance
ClickHouse columnar storage for fast analytical queries
Incremental processing to minimize compute resources
Parallel processing capabilities across all components
Technology Stack
Data Source
MySQL 8.0
Operational database with AdventureWorks sample data
Data Ingestion
LogStash 8.11
Real-time data extraction and loading
Data Warehouse
ClickHouse
High-performance columnar analytics database
Data Transformation
DBT Core
SQL-based transformation and modeling
Orchestration
Apache Airflow
Workflow scheduling and monitoring
Containerization
Docker/Docker Compose
Development environment and deployment
System Benefits
For Data Engineers
Rapid Development: Pre-configured stack reduces setup time
Best Practices: Implements proven patterns and architectures
Scalability: Components can scale independently based on load
Maintainability: Clear separation of concerns and standardized tooling
For Data Analysts
Fast Queries: ClickHouse delivers sub-second response times
Familiar Interface: Standard SQL queries work across all layers
Data Quality: Built-in testing ensures reliable, accurate data
Self-Service: Well-documented dimensional models enable independent analysis
for Business Users
Real-Time Insights: Near real-time data availability for decision making
Consistent Metrics: Single source of truth for business definitions
Historical Analysis: Complete data lineage and change tracking
Reliable Reporting: Automated quality checks prevent data issues
Next Steps
Now that you understand the high-level architecture, explore the detailed component information:
Component Details - Deep dive into each system component
Data Flow - Detailed data movement and transformation process
Once you're comfortable with the architecture, proceed to:
Environment Setup - Get your development environment running
Last updated
Was this helpful?