# Component Details

This section provides detailed information about each component in the DataSuite ETL system, including configuration, capabilities, and integration points.

## System Components

| ID          | Description                                                             |
| ----------- | ----------------------------------------------------------------------- |
| Data Source | External data sources that provide raw data for processing              |
| LogStash    | Data collection engine that reads and queries data from various sources |
| Airflow     | Workflow orchestration platform containing ETL jobs                     |
| EtlJob      | Data transformation jobs that process raw data                          |
| Clickhouse  | Columnar database for analytical workloads                              |
| BronzeLayer | Raw data storage layer in ClickHouse                                    |
| GoldLayer   | Processed and refined data storage layer in ClickHouse                  |

## Data Source Layer

### MySQL Database (AdventureWorks)

**Purpose**: Operational database containing business transaction data

**Key Characteristics:**

* **Database**: MySQL 8.0 with AdventureWorks 2019 sample dataset
* **Schema Structure**: Normalized relational design with multiple schemas:
  * `Sales` schema: Customer orders, products, sales data
  * `Production` schema: Product catalog, inventory
  * `Person` schema: Customer and employee information
* **Data Volume**: \~500MB with realistic business data
* **Update Frequency**: Continuous transactional updates

**Sample Data Structure:**

```plantuml
class Sales_SalesOrderDetail
class Sales_SaleOrderHeader  
class Sales_Customer
class Sales_SalesTerritory

Sales_SaleOrderHeader --> Sales_SalesOrderDetail
Sales_SaleOrderHeader -> Sales_Customer
Sales_SaleOrderHeader -> Sales_SalesTerritory
```

**Connection Details:**

* **Host**: mysql (container) / localhost:3306 (local)
* **Authentication**: Username/password with SSL support
* **Access Pattern**: Read-only queries via JDBC connector

## Data Ingestion Layer

### LogStash

**Purpose**: Real-time data extraction and initial transformation

**Key Capabilities:**

* **JDBC Input Plugin**: Connects to MySQL using JDBC drivers
* **Incremental Loading**: Timestamp-based change data capture
* **Data Filtering**: Basic transformations and data cleansing
* **Output Formatting**: JSON formatting for ClickHouse compatibility

**Configuration Highlights:**

```ruby
input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/drivers/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://mysql:3306/adventureworks"
    statement => "SELECT * FROM Sales.SalesOrderHeader WHERE ModifiedDate > :sql_last_value"
    use_column_value => true
    tracking_column => "ModifiedDate"
    schedule => "*/30 * * * * *"  # Every 30 seconds
  }
}
```

**Performance Characteristics:**

* **Throughput**: \~1,000 records/second
* **Latency**: Near real-time (30-second polling)
* **Memory Usage**: \~512MB per pipeline
* **Scaling**: Horizontal scaling via multiple LogStash instances

## Data Warehouse Layer

### ClickHouse

**Purpose**: High-performance columnar database for analytical workloads

**Architecture:**

* **Storage Engine**: MergeTree family optimized for analytics
* **Compression**: Built-in LZ4/ZSTD compression (80%+ reduction)
* **Indexing**: Sparse indexes and bloom filters for fast queries
* **Replication**: Multi-master replication for high availability

**Database Structure:**

```sql
-- Bronze Layer Schema
bronze_layer.sales_orders          -- Raw sales data from LogStash
bronze_layer.customers             -- Customer master data  
bronze_layer.products              -- Product catalog
bronze_layer.sales_order_details   -- Order line items

-- Gold Layer Schema  
gold_layer.fact_sale               -- Sales fact table
gold_layer.dim_date                -- Date dimension
gold_layer.dim_customer            -- Customer dimension
gold_layer.dim_product             -- Product dimension
gold_layer.dim_territory           -- Territory dimension
```

**Performance Specifications:**

* **Query Speed**: Sub-second response for most analytical queries
* **Compression Ratio**: 8:1 average compression vs raw data
* **Concurrent Users**: 100+ simultaneous connections
* **Data Volume**: Scales to petabyte level

## Data Transformation Layer

### DBT (Data Build Tool)

**Purpose**: SQL-based transformation and dimensional modeling

**Project Structure:**

```
adventureworks_analytics/
├── models/
│   ├── staging/           # Bronze layer data cleaning
│   └── marts/            # Gold layer dimensional models
├── tests/                # Data quality tests
├── macros/               # Reusable SQL functions
└── docs/                 # Generated documentation
```

**Key Features:**

* **Incremental Models**: Efficient processing of only changed data
* **Testing Framework**: Built-in data quality validation
* **Documentation**: Auto-generated lineage and column descriptions
* **Version Control**: Git-based development workflow

**Model Types:**

1. **Staging Models** (`stg_*`): Light transformations on raw data
2. **Intermediate Models** (`int_*`): Business logic and calculations
3. **Mart Models** (`dim_*`, `fact_*`): Final dimensional models

## Bronze Layer (Raw Data)

### Purpose and Design

**Objective**: Store raw, unprocessed data exactly as received from source systems

**Key Characteristics:**

* **Data Fidelity**: Exact copy of source data with minimal transformation
* **Audit Trail**: Complete history of all ingested data
* **Reprocessing Capability**: Ability to replay transformations from raw data
* **Schema Evolution**: Flexible schema to accommodate source changes

**Table Structure:**

```sql
CREATE TABLE bronze_layer.sales_orders (
    -- Original source columns
    salesorderid Int32,
    customerid Int32, 
    orderdate DateTime,
    subtotal Decimal64(4),
    taxamt Decimal64(4),
    totaldue Decimal64(4),
    
    -- Metadata columns
    _ingested_at DateTime DEFAULT now(),
    _source_file String DEFAULT '',
    _pipeline_run_id String DEFAULT ''
) ENGINE = MergeTree()
ORDER BY (orderdate, salesorderid);
```

## Gold Layer (Dimensional Models)

### Star Schema Design

**Objective**: Business-ready dimensional models optimized for analytics

**Fact Tables:**

* **fact\_sale**: Central sales transaction fact with measures and foreign keys
* Grain: One row per sales order line item
* Measures: Quantities, amounts, costs, profits

**Dimension Tables:**

* **dim\_date**: Pre-built calendar with business and fiscal periods
* **dim\_customer**: Customer master with hierarchies and attributes
* **dim\_product**: Product catalog with categories and pricing
* **dim\_territory**: Geographic sales territories and regions

**Performance Optimizations:**

* **Partitioning**: Date-based partitioning for query pruning
* **Indexes**: Optimized for common query patterns
* **Aggregations**: Pre-computed summaries for dashboard queries
* **Materialized Views**: Real-time aggregations for operational reporting

## Integration Points

### Data Lineage

```
MySQL → LogStash → ClickHouse Bronze → DBT → ClickHouse Gold → Analytics Tools
```

### Quality Gates

1. **Source Validation**: Primary key and data type checks at ingestion
2. **Bronze Quality**: Row count reconciliation and duplicate detection
3. **Gold Quality**: Business rule validation and referential integrity
4. **Serving Quality**: Performance monitoring and freshness checks

### Monitoring Integration

* **LogStash**: Pipeline throughput and error rates
* **ClickHouse**: Query performance and storage metrics
* **DBT**: Model execution times and test results
* **End-to-End**: Data freshness and completeness monitoring

## Next Steps

* [**Data Flow**](/data-platform/index/architecture/data-flow.md) - Understand how data moves through these components
* [**Environment Setup**](/data-platform/index/setup/prerequisites.md) - Start configuring these components in your environment


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasuite.vn/data-platform/index/architecture/components.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
