System Overview

The DataSuite ETL system is a modern data engineering platform that extracts data from operational systems, transforms it using dimensional modeling techniques, and loads it into a high-performance analytics warehouse.

Architecture Overview

@startuml
database DataSource
component LogStash
component Airflow {
    component EtlJob
}
database Clickhouse {
    component BronzeLayer
    component GoldLayer
}
DataSource <- LogStash: read and query data from data source
LogStash -> BronzeLayer: push data to clickhouse
EtlJob --> BronzeLayer: read raw data from clickhouse
EtlJob --> GoldLayer: persist data aftere ETL to clickhouse 


@enduml

Data Flow Pattern

The system follows a medallion architecture pattern with three distinct layers:

  1. Source Layer → Raw operational data from MySQL (AdventureWorks database)

  2. Bronze Layer → Raw data ingested via LogStash into ClickHouse

  3. Gold Layer → Transformed dimensional models created by DBT

Key Design Principles

🔄 Event-Driven Processing

  • LogStash monitors source systems for changes using timestamp-based incremental loading

  • Changes are immediately propagated to the bronze layer

  • DBT transformations can run on demand or scheduled basis

📊 Dimensional Modeling

  • Star schema design optimized for analytical queries

  • Fact tables store measurable events (sales transactions)

  • Dimension tables provide context (customers, products, dates, territories)

🏗️ Layered Architecture

  • Bronze Layer: Raw, unprocessed data for auditing and reprocessing

  • Gold Layer: Business-ready dimensional models for analytics

  • Serving Layer: Optimized views and aggregations for specific use cases

⚡ High Performance

  • ClickHouse columnar storage for fast analytical queries

  • Incremental processing to minimize compute resources

  • Parallel processing capabilities across all components

Technology Stack

Component
Technology
Purpose

Data Source

MySQL 8.0

Operational database with AdventureWorks sample data

Data Ingestion

LogStash 8.11

Real-time data extraction and loading

Data Warehouse

ClickHouse

High-performance columnar analytics database

Data Transformation

DBT Core

SQL-based transformation and modeling

Orchestration

Apache Airflow

Workflow scheduling and monitoring

Containerization

Docker/Docker Compose

Development environment and deployment

System Benefits

For Data Engineers

  • Rapid Development: Pre-configured stack reduces setup time

  • Best Practices: Implements proven patterns and architectures

  • Scalability: Components can scale independently based on load

  • Maintainability: Clear separation of concerns and standardized tooling

For Data Analysts

  • Fast Queries: ClickHouse delivers sub-second response times

  • Familiar Interface: Standard SQL queries work across all layers

  • Data Quality: Built-in testing ensures reliable, accurate data

  • Self-Service: Well-documented dimensional models enable independent analysis

for Business Users

  • Real-Time Insights: Near real-time data availability for decision making

  • Consistent Metrics: Single source of truth for business definitions

  • Historical Analysis: Complete data lineage and change tracking

  • Reliable Reporting: Automated quality checks prevent data issues

Next Steps

Now that you understand the high-level architecture, explore the detailed component information:

Once you're comfortable with the architecture, proceed to:

Last updated

Was this helpful?