Data Platform Documentations

User guide for developer to getting started with begin using Data Engineer flow on Data suite

Document outlines

  • The overrall architect of the sysmte

  • Getting started

    • Installation using docker

      • Explain the architecture of the local stack

      • Step by step installation using individual containers

      • Installtation using Docker-compose

  • Write ETL code using DBT

    • The overral data architeecture

    • Step by step write DBT code

    • Execute the code and output the data

Architect

Overview

Verification Steps

  1. Check MySQL: docker exec mysql mysql -uroot -ppassword -e "SHOW DATABASES;"

  2. Check ClickHouse: curl -u admin:clickhouse123 http://localhost:8123/ping

  3. Check LogStash: curl http://localhost:5044

  4. Check Airflow: Visit http://localhost:8080 (admin/admin)

Components

ID
Description

Data Source

External data sources that provide raw data for processing

LogStash

Data collection engine that reads and queries data from various sources

Airflow

Workflow orchestration platform containing ETL jobs

EtlJob

Data transformation jobs that process raw data

Clickhouse

Columnar database for analytical workloads

BronzeLayer

Raw data storage layer in ClickHouse

GoldLayer

Processed and refined data storage layer in ClickHouse

Getting started

Installation for local development using docker

Prerequisites

  • Docker and Docker Compose installed

  • Git for cloning repositories

  • 8GB+ RAM recommended for full stack

Option 1: Individual Container Setup

  1. Install MySQL database

  1. Seed Adventure Works data into MySQL

Adventure Works Data Structure:

  • Sales schema: Customer orders, products, sales data

  • Production schema: Product catalog, inventory

  • Person schema: Customer and employee information

  • Sample query: SELECT COUNT(*) FROM Sales.SalesOrderHeader;

  1. Install ClickHouse data warehouse

  1. Install LogStash with MySQL connector

Sample LogStash Configuration (logstash.conf):

Option 2: Docker Compose Setup

Create docker-compose.yml:

Download MySQL JDBC driver and start all services:

  1. Expose ClickHouse securely via tunneling

Method 1: SSH Tunnel

Method 2: ngrok (for development/testing)

Method 3: ngrok with authentication

Security Best Practices:

  • Use IP whitelisting in ngrok dashboard

  • Enable ClickHouse authentication: users.xml configuration

  • Use HTTPS/TLS connections for production

  • Rotate access credentials regularly

  • Monitor access logs for suspicious activity

Write DBT code to ETL code from data warehouse bronze layer to gold layer

Step 2: Dimensional Model for Sale Processes

Overview

Fact Table
Dim Date
Dim Territory
Dim Product

fact_sale

x

x

x

Original table ERD Structure

Table Explanations

Fact Table

Table Name
Purpose
Description
Key Metrics
Grain

fact_sale

Central fact table for sales transactions

Contains measurable sales events with foreign keys to dimension tables. Each row represents a sales order line item with associated quantities, amounts, and dimensional context.

• OrderQty (quantity sold) • UnitPrice (price per unit) • LineTotal (extended price) • SubTotal (order subtotal) • TaxAmt (tax amount) • TotalDue (final amount)

One row per sales order detail line item

Dimension Tables

Table Name
Purpose
Description
Key Attributes
Type

dim_date

Time dimension

Provides temporal context for sales analysis with pre-computed date attributes for easy filtering and grouping by various time periods.

• Date (primary date) • Year, Quarter, Month, Day • DayOfWeek, WeekOfYear • IsWeekend, IsHoliday • FiscalYear, FiscalQuarter

Type 1 SCD (Static)

dim_territory

Geographic dimension

Contains sales territory information for regional analysis and geographic reporting. Supports hierarchical geographic analysis from region to country.

• TerritoryID (business key) • TerritoryName • CountryRegionCode • TerritoryGroup • RegionName

Type 1 SCD

dim_product

Product dimension

Product catalog information enabling product-based analysis. Contains current and historical product attributes for trend analysis.

• ProductID (business key) • ProductName • ProductNumber • ProductCategory • ProductSubcategory • StandardCost, ListPrice • ProductLine, ProductModel

Type 2 SCD (Historical)

Star Schema Relationships

Data Sources and Lineage

Dimension/Fact
Source Tables
Transformation Notes

fact_sale

Sales_SalesOrderHeader Sales_SalesOrderDetail

Joined on SalesOrderID, aggregated by line item

dim_date

Date functions/calendar table

Generated programmatically with business calendar rules

dim_territory

Sales_SalesTerritory

Direct mapping with cleansing and standardization

dim_product

Production_Product Production_ProductCategory Production_ProductSubcategory

Denormalized product hierarchy for performance

DBT Project Setup Steps

Step 1: Environment Setup

Step 2: Create New DBT Project

Step 3: Configure Database Connection

Update profiles.yml file (usually located in ~/.dbt/profiles.yml):

Step 4: Install Custom Data Suite Adapter (Optional)

Step 5: Configure Project Structure

Update dbt_project.yml:

Step 6: Create Layer Structure

Step 7: Define Data Sources

Create models/staging/sources.yml:

Step 8: Create Staging Models

Create staging models to clean and standardize bronze layer data:

Step 9: Build Dimensional Models

Create dimensional model files:

  • models/marts/core/dim_date.sql

  • models/marts/core/dim_territory.sql

  • models/marts/core/dim_product.sql

  • models/marts/core/fact_sale.sql

Step 10: Run DBT Pipeline

Step 11: Schedule with Airflow (Optional)

Create DAG for automated runs:

Step 12: Data Quality & Monitoring

  • Set up data quality tests in tests/ directory

  • Configure alerting for test failures

  • Monitor pipeline performance and data freshness

  • Set up incremental model strategies for large tables

Create and deploy code ETL to Airflow (production)

Overview

This section covers deploying the complete AdventureWorks ETL pipeline to Apache Airflow for production use, including LogStash data ingestion, DBT transformations, and monitoring.

Production Architecture

Prerequisites

  • Docker and Docker Compose installed

  • Apache Airflow 2.7+ running

  • ClickHouse database accessible

  • DBT project configured (from previous section)

  • LogStash configuration files ready

Step 1: Airflow Environment Setup

1.1 Docker Compose Configuration

Create docker-compose.yml for Airflow production setup:

1.2 Install Required Packages

Create requirements.txt for additional Python packages:

Build custom Airflow image with requirements:

Step 2: Create Production DAGs

2.1 Main ETL Pipeline DAG

Create dags/adventureworks_etl_pipeline.py:

2.2 Incremental Loading DAG

Create dags/adventureworks_incremental_etl.py:

Step 3: Configuration Management

3.1 Airflow Variables

Set up Airflow variables via Web UI or CLI:

3.2 Connections

Create Airflow connections for database access:

Step 4: Deployment Process

4.1 Pre-deployment Checklist

4.2 Production Deployment

Step 5: Monitoring and Alerting

5.1 Set up Airflow Monitoring

5.2 Alerting Configuration

Add to airflow.cfg:

Step 6: Performance Optimization

6.1 Resource Allocation

6.2 Parallel Execution

Step 7: Backup and Recovery

7.1 Automated Backups

7.2 Disaster Recovery Plan

  1. Data Recovery: Restore from ClickHouse backups

  2. Pipeline Recovery: Redeploy from version control

  3. State Recovery: Restore Airflow metadata database

  4. Validation: Run data quality checks post-recovery

Step 8: Security Best Practices

  1. Secrets Management: Use Airflow's secret backend

  2. Access Control: Implement RBAC for Airflow users

  3. Network Security: Use VPN/private networks

  4. Encryption: Enable SSL/TLS for all connections

  5. Audit Logging: Monitor all pipeline activities

Step 9: Documentation and Handover

  1. Runbook: Document troubleshooting procedures

  2. Architecture Diagrams: Maintain up-to-date system diagrams

  3. Change Management: Document all configuration changes

  4. Training: Provide team training on pipeline operations

Last updated