ETL Production Guide - on Kubernetes
Overview
This documentation provides a comprehensive guide for deploying the complete AdventureWorks ETL pipeline on Kubernetes, including MySQL source database, LogStash data ingestion, ClickHouse data warehouse, DBT transformations, and Apache Airflow orchestration.
Kubernetes Architecture
Prerequisites
Kubernetes cluster (v1.25+) with at least 16GB RAM and 8 CPU cores
kubectl configured with cluster access
Helm 3.x installed
Persistent Volume support (StorageClass configured)
LoadBalancer or Ingress controller for external access
Container registry access (Docker Hub or private registry)
Namespace Strategy
Step 1: MySQL Source Database Deployment
1.1 MySQL ConfigMap
Create mysql/mysql-configmap.yaml:
1.2 MySQL Secret
Create mysql/mysql-secret.yaml:
1.3 MySQL Persistent Volume
Create mysql/mysql-pvc.yaml:
1.4 MySQL Deployment
Create mysql/mysql-deployment.yaml:
1.5 MySQL Service
Create mysql/mysql-service.yaml:
1.6 MySQL Data Initialization
Create mysql/mysql-initdb-configmap.yaml:
Step 2: ClickHouse Data Warehouse Deployment
2.1 ClickHouse ConfigMap
Create clickhouse/clickhouse-config.yaml:
2.2 ClickHouse Deployment
Create clickhouse/clickhouse-deployment.yaml:
2.3 ClickHouse PVC and Service
Create clickhouse/clickhouse-pvc.yaml:
Create clickhouse/clickhouse-service.yaml:
Step 3: LogStash Data Ingestion Deployment
3.1 LogStash ConfigMap
Create logstash/logstash-config.yaml:
3.2 LogStash Deployment
Create logstash/logstash-deployment.yaml:
3.3 LogStash Service
Create logstash/logstash-service.yaml:
Step 4: Apache Airflow Orchestration Deployment
4.1 Airflow Namespace and RBAC
Create airflow/namespace-rbac.yaml:
4.2 Airflow PostgreSQL Database
Create airflow/postgres-deployment.yaml:
4.3 Airflow Redis
Create airflow/redis-deployment.yaml:
4.4 Airflow Configuration
Create airflow/airflow-config.yaml:
4.5 Airflow Webserver
Create airflow/airflow-webserver.yaml:
4.6 Airflow Scheduler
Create airflow/airflow-scheduler.yaml:
4.7 Airflow DAGs ConfigMap
Create airflow/airflow-dags.yaml:
Step 5: Monitoring with Prometheus and Grafana
5.1 Prometheus Deployment
Create monitoring/prometheus.yaml:
5.2 Grafana Deployment
Create monitoring/grafana.yaml:
Step 6: Deployment Scripts and Automation
6.1 Main Deployment Script
Create deploy-k8s.sh:
6.2 Cleanup Script
Create cleanup-k8s.sh:
6.3 Health Check Script
Create health-check.sh:
Step 7: Production Best Practices
7.1 Resource Management
Create resource-quotas.yaml:
7.2 Network Policies
Create network-policies.yaml:
7.3 Backup Strategy
Create backup-cronjob.yaml:
Step 8: Troubleshooting Guide
Common Issues and Solutions
Pod Stuck in Pending State
Service Connection Issues
LogStash Connection Failures
ClickHouse Performance Issues
Airflow DAG Issues
Step 9: Scaling and Performance Tuning
Horizontal Pod Autoscaler
Create hpa.yaml:
Vertical Pod Autoscaler
Create vpa.yaml:
Conclusion
This Kubernetes deployment provides a scalable, production-ready infrastructure for the AdventureWorks ETL pipeline with:
High Availability: Multi-replica deployments with health checks
Scalability: Horizontal and vertical autoscaling capabilities
Security: Network policies, RBAC, and secret management
Monitoring: Comprehensive observability with Prometheus and Grafana
Automation: GitOps-ready YAML manifests and deployment scripts
Reliability: Persistent storage, backup strategies, and disaster recovery
The infrastructure is designed to handle enterprise workloads while maintaining operational excellence and cost efficiency in Kubernetes environments.
Last updated
Was this helpful?