Component Details

This section provides detailed information about each component in the DataSuite ETL system, including configuration, capabilities, and integration points.

System Components

ID
Description

Data Source

External data sources that provide raw data for processing

LogStash

Data collection engine that reads and queries data from various sources

Airflow

Workflow orchestration platform containing ETL jobs

EtlJob

Data transformation jobs that process raw data

Clickhouse

Columnar database for analytical workloads

BronzeLayer

Raw data storage layer in ClickHouse

GoldLayer

Processed and refined data storage layer in ClickHouse

Data Source Layer

MySQL Database (AdventureWorks)

Purpose: Operational database containing business transaction data

Key Characteristics:

  • Database: MySQL 8.0 with AdventureWorks 2019 sample dataset

  • Schema Structure: Normalized relational design with multiple schemas:

    • Sales schema: Customer orders, products, sales data

    • Production schema: Product catalog, inventory

    • Person schema: Customer and employee information

  • Data Volume: ~500MB with realistic business data

  • Update Frequency: Continuous transactional updates

Sample Data Structure:

Connection Details:

  • Host: mysql (container) / localhost:3306 (local)

  • Authentication: Username/password with SSL support

  • Access Pattern: Read-only queries via JDBC connector

Data Ingestion Layer

LogStash

Purpose: Real-time data extraction and initial transformation

Key Capabilities:

  • JDBC Input Plugin: Connects to MySQL using JDBC drivers

  • Incremental Loading: Timestamp-based change data capture

  • Data Filtering: Basic transformations and data cleansing

  • Output Formatting: JSON formatting for ClickHouse compatibility

Configuration Highlights:

Performance Characteristics:

  • Throughput: ~1,000 records/second

  • Latency: Near real-time (30-second polling)

  • Memory Usage: ~512MB per pipeline

  • Scaling: Horizontal scaling via multiple LogStash instances

Data Warehouse Layer

ClickHouse

Purpose: High-performance columnar database for analytical workloads

Architecture:

  • Storage Engine: MergeTree family optimized for analytics

  • Compression: Built-in LZ4/ZSTD compression (80%+ reduction)

  • Indexing: Sparse indexes and bloom filters for fast queries

  • Replication: Multi-master replication for high availability

Database Structure:

Performance Specifications:

  • Query Speed: Sub-second response for most analytical queries

  • Compression Ratio: 8:1 average compression vs raw data

  • Concurrent Users: 100+ simultaneous connections

  • Data Volume: Scales to petabyte level

Data Transformation Layer

DBT (Data Build Tool)

Purpose: SQL-based transformation and dimensional modeling

Project Structure:

Key Features:

  • Incremental Models: Efficient processing of only changed data

  • Testing Framework: Built-in data quality validation

  • Documentation: Auto-generated lineage and column descriptions

  • Version Control: Git-based development workflow

Model Types:

  1. Staging Models (stg_*): Light transformations on raw data

  2. Intermediate Models (int_*): Business logic and calculations

  3. Mart Models (dim_*, fact_*): Final dimensional models

Bronze Layer (Raw Data)

Purpose and Design

Objective: Store raw, unprocessed data exactly as received from source systems

Key Characteristics:

  • Data Fidelity: Exact copy of source data with minimal transformation

  • Audit Trail: Complete history of all ingested data

  • Reprocessing Capability: Ability to replay transformations from raw data

  • Schema Evolution: Flexible schema to accommodate source changes

Table Structure:

Gold Layer (Dimensional Models)

Star Schema Design

Objective: Business-ready dimensional models optimized for analytics

Fact Tables:

  • fact_sale: Central sales transaction fact with measures and foreign keys

  • Grain: One row per sales order line item

  • Measures: Quantities, amounts, costs, profits

Dimension Tables:

  • dim_date: Pre-built calendar with business and fiscal periods

  • dim_customer: Customer master with hierarchies and attributes

  • dim_product: Product catalog with categories and pricing

  • dim_territory: Geographic sales territories and regions

Performance Optimizations:

  • Partitioning: Date-based partitioning for query pruning

  • Indexes: Optimized for common query patterns

  • Aggregations: Pre-computed summaries for dashboard queries

  • Materialized Views: Real-time aggregations for operational reporting

Integration Points

Data Lineage

Quality Gates

  1. Source Validation: Primary key and data type checks at ingestion

  2. Bronze Quality: Row count reconciliation and duplicate detection

  3. Gold Quality: Business rule validation and referential integrity

  4. Serving Quality: Performance monitoring and freshness checks

Monitoring Integration

  • LogStash: Pipeline throughput and error rates

  • ClickHouse: Query performance and storage metrics

  • DBT: Model execution times and test results

  • End-to-End: Data freshness and completeness monitoring

Next Steps

  • Data Flow - Understand how data moves through these components

  • Environment Setup - Start configuring these components in your environment

Last updated