Component Details
This section provides detailed information about each component in the DataSuite ETL system, including configuration, capabilities, and integration points.
System Components
Data Source
External data sources that provide raw data for processing
LogStash
Data collection engine that reads and queries data from various sources
Airflow
Workflow orchestration platform containing ETL jobs
EtlJob
Data transformation jobs that process raw data
Clickhouse
Columnar database for analytical workloads
BronzeLayer
Raw data storage layer in ClickHouse
GoldLayer
Processed and refined data storage layer in ClickHouse
Data Source Layer
MySQL Database (AdventureWorks)
Purpose: Operational database containing business transaction data
Key Characteristics:
Database: MySQL 8.0 with AdventureWorks 2019 sample dataset
Schema Structure: Normalized relational design with multiple schemas:
Salesschema: Customer orders, products, sales dataProductionschema: Product catalog, inventoryPersonschema: Customer and employee information
Data Volume: ~500MB with realistic business data
Update Frequency: Continuous transactional updates
Sample Data Structure:
Connection Details:
Host: mysql (container) / localhost:3306 (local)
Authentication: Username/password with SSL support
Access Pattern: Read-only queries via JDBC connector
Data Ingestion Layer
LogStash
Purpose: Real-time data extraction and initial transformation
Key Capabilities:
JDBC Input Plugin: Connects to MySQL using JDBC drivers
Incremental Loading: Timestamp-based change data capture
Data Filtering: Basic transformations and data cleansing
Output Formatting: JSON formatting for ClickHouse compatibility
Configuration Highlights:
Performance Characteristics:
Throughput: ~1,000 records/second
Latency: Near real-time (30-second polling)
Memory Usage: ~512MB per pipeline
Scaling: Horizontal scaling via multiple LogStash instances
Data Warehouse Layer
ClickHouse
Purpose: High-performance columnar database for analytical workloads
Architecture:
Storage Engine: MergeTree family optimized for analytics
Compression: Built-in LZ4/ZSTD compression (80%+ reduction)
Indexing: Sparse indexes and bloom filters for fast queries
Replication: Multi-master replication for high availability
Database Structure:
Performance Specifications:
Query Speed: Sub-second response for most analytical queries
Compression Ratio: 8:1 average compression vs raw data
Concurrent Users: 100+ simultaneous connections
Data Volume: Scales to petabyte level
Data Transformation Layer
DBT (Data Build Tool)
Purpose: SQL-based transformation and dimensional modeling
Project Structure:
Key Features:
Incremental Models: Efficient processing of only changed data
Testing Framework: Built-in data quality validation
Documentation: Auto-generated lineage and column descriptions
Version Control: Git-based development workflow
Model Types:
Staging Models (
stg_*): Light transformations on raw dataIntermediate Models (
int_*): Business logic and calculationsMart Models (
dim_*,fact_*): Final dimensional models
Bronze Layer (Raw Data)
Purpose and Design
Objective: Store raw, unprocessed data exactly as received from source systems
Key Characteristics:
Data Fidelity: Exact copy of source data with minimal transformation
Audit Trail: Complete history of all ingested data
Reprocessing Capability: Ability to replay transformations from raw data
Schema Evolution: Flexible schema to accommodate source changes
Table Structure:
Gold Layer (Dimensional Models)
Star Schema Design
Objective: Business-ready dimensional models optimized for analytics
Fact Tables:
fact_sale: Central sales transaction fact with measures and foreign keys
Grain: One row per sales order line item
Measures: Quantities, amounts, costs, profits
Dimension Tables:
dim_date: Pre-built calendar with business and fiscal periods
dim_customer: Customer master with hierarchies and attributes
dim_product: Product catalog with categories and pricing
dim_territory: Geographic sales territories and regions
Performance Optimizations:
Partitioning: Date-based partitioning for query pruning
Indexes: Optimized for common query patterns
Aggregations: Pre-computed summaries for dashboard queries
Materialized Views: Real-time aggregations for operational reporting
Integration Points
Data Lineage
Quality Gates
Source Validation: Primary key and data type checks at ingestion
Bronze Quality: Row count reconciliation and duplicate detection
Gold Quality: Business rule validation and referential integrity
Serving Quality: Performance monitoring and freshness checks
Monitoring Integration
LogStash: Pipeline throughput and error rates
ClickHouse: Query performance and storage metrics
DBT: Model execution times and test results
End-to-End: Data freshness and completeness monitoring
Next Steps
Data Flow - Understand how data moves through these components
Environment Setup - Start configuring these components in your environment
Last updated