ETL Production Guide - on Kubernetes

Overview

This documentation provides a comprehensive guide for deploying the complete AdventureWorks ETL pipeline on Kubernetes, including MySQL source database, LogStash data ingestion, ClickHouse data warehouse, DBT transformations, and Apache Airflow orchestration.

Kubernetes Architecture

Prerequisites

  • Kubernetes cluster (v1.25+) with at least 16GB RAM and 8 CPU cores

  • kubectl configured with cluster access

  • Helm 3.x installed

  • Persistent Volume support (StorageClass configured)

  • LoadBalancer or Ingress controller for external access

  • Container registry access (Docker Hub or private registry)

Namespace Strategy

Step 1: MySQL Source Database Deployment

1.1 MySQL ConfigMap

Create mysql/mysql-configmap.yaml:

1.2 MySQL Secret

Create mysql/mysql-secret.yaml:

1.3 MySQL Persistent Volume

Create mysql/mysql-pvc.yaml:

1.4 MySQL Deployment

Create mysql/mysql-deployment.yaml:

1.5 MySQL Service

Create mysql/mysql-service.yaml:

1.6 MySQL Data Initialization

Create mysql/mysql-initdb-configmap.yaml:

Step 2: ClickHouse Data Warehouse Deployment

2.1 ClickHouse ConfigMap

Create clickhouse/clickhouse-config.yaml:

2.2 ClickHouse Deployment

Create clickhouse/clickhouse-deployment.yaml:

2.3 ClickHouse PVC and Service

Create clickhouse/clickhouse-pvc.yaml:

Create clickhouse/clickhouse-service.yaml:

Step 3: LogStash Data Ingestion Deployment

3.1 LogStash ConfigMap

Create logstash/logstash-config.yaml:

3.2 LogStash Deployment

Create logstash/logstash-deployment.yaml:

3.3 LogStash Service

Create logstash/logstash-service.yaml:

Step 4: Apache Airflow Orchestration Deployment

4.1 Airflow Namespace and RBAC

Create airflow/namespace-rbac.yaml:

4.2 Airflow PostgreSQL Database

Create airflow/postgres-deployment.yaml:

4.3 Airflow Redis

Create airflow/redis-deployment.yaml:

4.4 Airflow Configuration

Create airflow/airflow-config.yaml:

4.5 Airflow Webserver

Create airflow/airflow-webserver.yaml:

4.6 Airflow Scheduler

Create airflow/airflow-scheduler.yaml:

4.7 Airflow DAGs ConfigMap

Create airflow/airflow-dags.yaml:

Step 5: Monitoring with Prometheus and Grafana

5.1 Prometheus Deployment

Create monitoring/prometheus.yaml:

5.2 Grafana Deployment

Create monitoring/grafana.yaml:

Step 6: Deployment Scripts and Automation

6.1 Main Deployment Script

Create deploy-k8s.sh:

6.2 Cleanup Script

Create cleanup-k8s.sh:

6.3 Health Check Script

Create health-check.sh:

Step 7: Production Best Practices

7.1 Resource Management

Create resource-quotas.yaml:

7.2 Network Policies

Create network-policies.yaml:

7.3 Backup Strategy

Create backup-cronjob.yaml:

Step 8: Troubleshooting Guide

Common Issues and Solutions

  1. Pod Stuck in Pending State

  2. Service Connection Issues

  3. LogStash Connection Failures

  4. ClickHouse Performance Issues

  5. Airflow DAG Issues

Step 9: Scaling and Performance Tuning

Horizontal Pod Autoscaler

Create hpa.yaml:

Vertical Pod Autoscaler

Create vpa.yaml:

Conclusion

This Kubernetes deployment provides a scalable, production-ready infrastructure for the AdventureWorks ETL pipeline with:

  • High Availability: Multi-replica deployments with health checks

  • Scalability: Horizontal and vertical autoscaling capabilities

  • Security: Network policies, RBAC, and secret management

  • Monitoring: Comprehensive observability with Prometheus and Grafana

  • Automation: GitOps-ready YAML manifests and deployment scripts

  • Reliability: Persistent storage, backup strategies, and disaster recovery

The infrastructure is designed to handle enterprise workloads while maintaining operational excellence and cost efficiency in Kubernetes environments.

Last updated