Data Platform Documentations
User guide for developer to getting started with begin using Data Engineer flow on Data suite
Document outlines
The overrall architect of the sysmte
Getting started
Installation using docker
Explain the architecture of the local stack
Step by step installation using individual containers
Installtation using Docker-compose
Write ETL code using DBT
The overral data architeecture
Step by step write DBT code
Execute the code and output the data
Architect
Overview
Verification Steps
Check MySQL:
docker exec mysql mysql -uroot -ppassword -e "SHOW DATABASES;"Check ClickHouse:
curl -u admin:clickhouse123 http://localhost:8123/pingCheck LogStash:
curl http://localhost:5044Check Airflow: Visit
http://localhost:8080(admin/admin)
Components
Data Source
External data sources that provide raw data for processing
LogStash
Data collection engine that reads and queries data from various sources
Airflow
Workflow orchestration platform containing ETL jobs
EtlJob
Data transformation jobs that process raw data
Clickhouse
Columnar database for analytical workloads
BronzeLayer
Raw data storage layer in ClickHouse
GoldLayer
Processed and refined data storage layer in ClickHouse
Getting started
Installation for local development using docker
Prerequisites
Docker and Docker Compose installed
Git for cloning repositories
8GB+ RAM recommended for full stack
Option 1: Individual Container Setup
Install MySQL database
Seed Adventure Works data into MySQL
Adventure Works Data Structure:
Salesschema: Customer orders, products, sales data
Productionschema: Product catalog, inventoryPersonschema: Customer and employee informationSample query:
SELECT COUNT(*) FROM Sales.SalesOrderHeader;
Install ClickHouse data warehouse
Install LogStash with MySQL connector
Sample LogStash Configuration (logstash.conf):
Option 2: Docker Compose Setup
Create docker-compose.yml:
Download MySQL JDBC driver and start all services:
Expose ClickHouse securely via tunneling
Method 1: SSH Tunnel
Method 2: ngrok (for development/testing)
Method 3: ngrok with authentication
Security Best Practices:
Use IP whitelisting in ngrok dashboard
Enable ClickHouse authentication:
users.xmlconfigurationUse HTTPS/TLS connections for production
Rotate access credentials regularly
Monitor access logs for suspicious activity
Write DBT code to ETL code from data warehouse bronze layer to gold layer
Step 2: Dimensional Model for Sale Processes
Overview
fact_sale
x
x
x
Original table ERD Structure
Table Explanations
Fact Table
fact_sale
Central fact table for sales transactions
Contains measurable sales events with foreign keys to dimension tables. Each row represents a sales order line item with associated quantities, amounts, and dimensional context.
• OrderQty (quantity sold) • UnitPrice (price per unit) • LineTotal (extended price) • SubTotal (order subtotal) • TaxAmt (tax amount) • TotalDue (final amount)
One row per sales order detail line item
Dimension Tables
dim_date
Time dimension
Provides temporal context for sales analysis with pre-computed date attributes for easy filtering and grouping by various time periods.
• Date (primary date) • Year, Quarter, Month, Day • DayOfWeek, WeekOfYear • IsWeekend, IsHoliday • FiscalYear, FiscalQuarter
Type 1 SCD (Static)
dim_territory
Geographic dimension
Contains sales territory information for regional analysis and geographic reporting. Supports hierarchical geographic analysis from region to country.
• TerritoryID (business key) • TerritoryName • CountryRegionCode • TerritoryGroup • RegionName
Type 1 SCD
dim_product
Product dimension
Product catalog information enabling product-based analysis. Contains current and historical product attributes for trend analysis.
• ProductID (business key) • ProductName • ProductNumber • ProductCategory • ProductSubcategory • StandardCost, ListPrice • ProductLine, ProductModel
Type 2 SCD (Historical)
Star Schema Relationships
Data Sources and Lineage
fact_sale
Sales_SalesOrderHeader
Sales_SalesOrderDetail
Joined on SalesOrderID, aggregated by line item
dim_date
Date functions/calendar table
Generated programmatically with business calendar rules
dim_territory
Sales_SalesTerritory
Direct mapping with cleansing and standardization
dim_product
Production_Product
Production_ProductCategory
Production_ProductSubcategory
Denormalized product hierarchy for performance
DBT Project Setup Steps
Step 1: Environment Setup
Step 2: Create New DBT Project
Step 3: Configure Database Connection
Update profiles.yml file (usually located in ~/.dbt/profiles.yml):
Step 4: Install Custom Data Suite Adapter (Optional)
Step 5: Configure Project Structure
Update dbt_project.yml:
Step 6: Create Layer Structure
Step 7: Define Data Sources
Create models/staging/sources.yml:
Step 8: Create Staging Models
Create staging models to clean and standardize bronze layer data:
Step 9: Build Dimensional Models
Create dimensional model files:
models/marts/core/dim_date.sqlmodels/marts/core/dim_territory.sqlmodels/marts/core/dim_product.sqlmodels/marts/core/fact_sale.sql
Step 10: Run DBT Pipeline
Step 11: Schedule with Airflow (Optional)
Create DAG for automated runs:
Step 12: Data Quality & Monitoring
Set up data quality tests in
tests/directoryConfigure alerting for test failures
Monitor pipeline performance and data freshness
Set up incremental model strategies for large tables
Create and deploy code ETL to Airflow (production)
Overview
This section covers deploying the complete AdventureWorks ETL pipeline to Apache Airflow for production use, including LogStash data ingestion, DBT transformations, and monitoring.
Production Architecture
Prerequisites
Docker and Docker Compose installed
Apache Airflow 2.7+ running
ClickHouse database accessible
DBT project configured (from previous section)
LogStash configuration files ready
Step 1: Airflow Environment Setup
1.1 Docker Compose Configuration
Create docker-compose.yml for Airflow production setup:
1.2 Install Required Packages
Create requirements.txt for additional Python packages:
Build custom Airflow image with requirements:
Step 2: Create Production DAGs
2.1 Main ETL Pipeline DAG
Create dags/adventureworks_etl_pipeline.py:
2.2 Incremental Loading DAG
Create dags/adventureworks_incremental_etl.py:
Step 3: Configuration Management
3.1 Airflow Variables
Set up Airflow variables via Web UI or CLI:
3.2 Connections
Create Airflow connections for database access:
Step 4: Deployment Process
4.1 Pre-deployment Checklist
4.2 Production Deployment
Step 5: Monitoring and Alerting
5.1 Set up Airflow Monitoring
5.2 Alerting Configuration
Add to airflow.cfg:
Step 6: Performance Optimization
6.1 Resource Allocation
6.2 Parallel Execution
Step 7: Backup and Recovery
7.1 Automated Backups
7.2 Disaster Recovery Plan
Data Recovery: Restore from ClickHouse backups
Pipeline Recovery: Redeploy from version control
State Recovery: Restore Airflow metadata database
Validation: Run data quality checks post-recovery
Step 8: Security Best Practices
Secrets Management: Use Airflow's secret backend
Access Control: Implement RBAC for Airflow users
Network Security: Use VPN/private networks
Encryption: Enable SSL/TLS for all connections
Audit Logging: Monitor all pipeline activities
Step 9: Documentation and Handover
Runbook: Document troubleshooting procedures
Architecture Diagrams: Maintain up-to-date system diagrams
Change Management: Document all configuration changes
Training: Provide team training on pipeline operations
Last updated