# Component Details

This section provides detailed information about each component in the DataSuite ETL system, including configuration, capabilities, and integration points.

## System Components

| ID          | Description                                                             |
| ----------- | ----------------------------------------------------------------------- |
| Data Source | External data sources that provide raw data for processing              |
| LogStash    | Data collection engine that reads and queries data from various sources |
| Airflow     | Workflow orchestration platform containing ETL jobs                     |
| EtlJob      | Data transformation jobs that process raw data                          |
| Clickhouse  | Columnar database for analytical workloads                              |
| BronzeLayer | Raw data storage layer in ClickHouse                                    |
| GoldLayer   | Processed and refined data storage layer in ClickHouse                  |

## Data Source Layer

### MySQL Database (AdventureWorks)

**Purpose**: Operational database containing business transaction data

**Key Characteristics:**

* **Database**: MySQL 8.0 with AdventureWorks 2019 sample dataset
* **Schema Structure**: Normalized relational design with multiple schemas:
  * `Sales` schema: Customer orders, products, sales data
  * `Production` schema: Product catalog, inventory
  * `Person` schema: Customer and employee information
* **Data Volume**: \~500MB with realistic business data
* **Update Frequency**: Continuous transactional updates

**Sample Data Structure:**

```plantuml
class Sales_SalesOrderDetail
class Sales_SaleOrderHeader  
class Sales_Customer
class Sales_SalesTerritory

Sales_SaleOrderHeader --> Sales_SalesOrderDetail
Sales_SaleOrderHeader -> Sales_Customer
Sales_SaleOrderHeader -> Sales_SalesTerritory
```

**Connection Details:**

* **Host**: mysql (container) / localhost:3306 (local)
* **Authentication**: Username/password with SSL support
* **Access Pattern**: Read-only queries via JDBC connector

## Data Ingestion Layer

### LogStash

**Purpose**: Real-time data extraction and initial transformation

**Key Capabilities:**

* **JDBC Input Plugin**: Connects to MySQL using JDBC drivers
* **Incremental Loading**: Timestamp-based change data capture
* **Data Filtering**: Basic transformations and data cleansing
* **Output Formatting**: JSON formatting for ClickHouse compatibility

**Configuration Highlights:**

```ruby
input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/drivers/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://mysql:3306/adventureworks"
    statement => "SELECT * FROM Sales.SalesOrderHeader WHERE ModifiedDate > :sql_last_value"
    use_column_value => true
    tracking_column => "ModifiedDate"
    schedule => "*/30 * * * * *"  # Every 30 seconds
  }
}
```

**Performance Characteristics:**

* **Throughput**: \~1,000 records/second
* **Latency**: Near real-time (30-second polling)
* **Memory Usage**: \~512MB per pipeline
* **Scaling**: Horizontal scaling via multiple LogStash instances

## Data Warehouse Layer

### ClickHouse

**Purpose**: High-performance columnar database for analytical workloads

**Architecture:**

* **Storage Engine**: MergeTree family optimized for analytics
* **Compression**: Built-in LZ4/ZSTD compression (80%+ reduction)
* **Indexing**: Sparse indexes and bloom filters for fast queries
* **Replication**: Multi-master replication for high availability

**Database Structure:**

```sql
-- Bronze Layer Schema
bronze_layer.sales_orders          -- Raw sales data from LogStash
bronze_layer.customers             -- Customer master data  
bronze_layer.products              -- Product catalog
bronze_layer.sales_order_details   -- Order line items

-- Gold Layer Schema  
gold_layer.fact_sale               -- Sales fact table
gold_layer.dim_date                -- Date dimension
gold_layer.dim_customer            -- Customer dimension
gold_layer.dim_product             -- Product dimension
gold_layer.dim_territory           -- Territory dimension
```

**Performance Specifications:**

* **Query Speed**: Sub-second response for most analytical queries
* **Compression Ratio**: 8:1 average compression vs raw data
* **Concurrent Users**: 100+ simultaneous connections
* **Data Volume**: Scales to petabyte level

## Data Transformation Layer

### DBT (Data Build Tool)

**Purpose**: SQL-based transformation and dimensional modeling

**Project Structure:**

```
adventureworks_analytics/
├── models/
│   ├── staging/           # Bronze layer data cleaning
│   └── marts/            # Gold layer dimensional models
├── tests/                # Data quality tests
├── macros/               # Reusable SQL functions
└── docs/                 # Generated documentation
```

**Key Features:**

* **Incremental Models**: Efficient processing of only changed data
* **Testing Framework**: Built-in data quality validation
* **Documentation**: Auto-generated lineage and column descriptions
* **Version Control**: Git-based development workflow

**Model Types:**

1. **Staging Models** (`stg_*`): Light transformations on raw data
2. **Intermediate Models** (`int_*`): Business logic and calculations
3. **Mart Models** (`dim_*`, `fact_*`): Final dimensional models

## Bronze Layer (Raw Data)

### Purpose and Design

**Objective**: Store raw, unprocessed data exactly as received from source systems

**Key Characteristics:**

* **Data Fidelity**: Exact copy of source data with minimal transformation
* **Audit Trail**: Complete history of all ingested data
* **Reprocessing Capability**: Ability to replay transformations from raw data
* **Schema Evolution**: Flexible schema to accommodate source changes

**Table Structure:**

```sql
CREATE TABLE bronze_layer.sales_orders (
    -- Original source columns
    salesorderid Int32,
    customerid Int32, 
    orderdate DateTime,
    subtotal Decimal64(4),
    taxamt Decimal64(4),
    totaldue Decimal64(4),
    
    -- Metadata columns
    _ingested_at DateTime DEFAULT now(),
    _source_file String DEFAULT '',
    _pipeline_run_id String DEFAULT ''
) ENGINE = MergeTree()
ORDER BY (orderdate, salesorderid);
```

## Gold Layer (Dimensional Models)

### Star Schema Design

**Objective**: Business-ready dimensional models optimized for analytics

**Fact Tables:**

* **fact\_sale**: Central sales transaction fact with measures and foreign keys
* Grain: One row per sales order line item
* Measures: Quantities, amounts, costs, profits

**Dimension Tables:**

* **dim\_date**: Pre-built calendar with business and fiscal periods
* **dim\_customer**: Customer master with hierarchies and attributes
* **dim\_product**: Product catalog with categories and pricing
* **dim\_territory**: Geographic sales territories and regions

**Performance Optimizations:**

* **Partitioning**: Date-based partitioning for query pruning
* **Indexes**: Optimized for common query patterns
* **Aggregations**: Pre-computed summaries for dashboard queries
* **Materialized Views**: Real-time aggregations for operational reporting

## Integration Points

### Data Lineage

```
MySQL → LogStash → ClickHouse Bronze → DBT → ClickHouse Gold → Analytics Tools
```

### Quality Gates

1. **Source Validation**: Primary key and data type checks at ingestion
2. **Bronze Quality**: Row count reconciliation and duplicate detection
3. **Gold Quality**: Business rule validation and referential integrity
4. **Serving Quality**: Performance monitoring and freshness checks

### Monitoring Integration

* **LogStash**: Pipeline throughput and error rates
* **ClickHouse**: Query performance and storage metrics
* **DBT**: Model execution times and test results
* **End-to-End**: Data freshness and completeness monitoring

## Next Steps

* [**Data Flow**](https://docs.datasuite.vn/data-platform/index/architecture/data-flow) - Understand how data moves through these components
* [**Environment Setup**](https://docs.datasuite.vn/data-platform/index/setup/prerequisites) - Start configuring these components in your environment
