Advanced Setup (Individual Containers)

This setup method gives you complete control over each service configuration. Choose this approach when you need custom networking, specific versions, or advanced configuration options.

Why Choose Individual Containers?

✅ Advantages:

  • Complete Control: Configure each service exactly as needed

  • Custom Networking: Set up advanced network topologies

  • Resource Management: Allocate specific CPU/memory to each service

  • Version Control: Use different versions of each component

  • Production-Like: More similar to production deployment patterns

❌ Considerations:

  • More complex setup and management

  • Manual network configuration required

  • Requires deeper understanding of each component

  • More time-consuming for initial setup

Step 1: Create Docker Network

Create a custom network for service communication:

# Create dedicated network for DataSuite services
docker network create datasuite-network

# Verify network creation
docker network ls | grep datasuite

Expected Output:

NETWORK ID     NAME                DRIVER    SCOPE
a1b2c3d4e5f6   datasuite-network   bridge    local

Step 2: Set Up MySQL Database

Install and configure MySQL with AdventureWorks data:

# Run MySQL with persistent storage and custom configuration
docker run --name mysql \
  --network datasuite-network \
  -e MYSQL_ROOT_PASSWORD=password \
  -e MYSQL_DATABASE=adventureworks \
  -p 3306:3306 \
  -v mysql-data:/var/lib/mysql \
  -v $(pwd)/mysql-config:/etc/mysql/conf.d \
  --restart unless-stopped \
  -d mysql:8.0

Advanced MySQL Configuration (optional): Create mysql-config/custom.cnf:

[mysqld]
# Performance tuning
innodb_buffer_pool_size = 2G
innodb_log_file_size = 512M
max_connections = 200

# Binary logging for replication
log-bin = mysql-bin
binlog_format = ROW
expire_logs_days = 7

# Character set
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

Load Sample Data:

# Wait for MySQL to be ready
until docker exec mysql mysqladmin ping -h localhost -u root -ppassword --silent; do
  echo "Waiting for MySQL to be ready..."
  sleep 2
done

# Load Adventure Works data
docker exec -i mysql mysql -uroot -ppassword adventureworks < docs/data/AdventureWorks2019.sql

# Verify data loading
docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) FROM Sales.SalesOrderHeader;"

Step 3: Install ClickHouse Data Warehouse

Set up ClickHouse with custom configuration:

# Run ClickHouse with authentication and custom settings
docker run --name clickhouse \
  --network datasuite-network \
  -p 8123:8123 \
  -p 9000:9000 \
  -e CLICKHOUSE_USER=admin \
  -e CLICKHOUSE_PASSWORD=clickhouse123 \
  -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
  -v clickhouse-data:/var/lib/clickhouse \
  -v $(pwd)/clickhouse-config:/etc/clickhouse-server/config.d \
  --restart unless-stopped \
  -d clickhouse/clickhouse-server:latest

Advanced ClickHouse Configuration (optional): Create clickhouse-config/custom.xml:

<yandex>
    <!-- Memory settings -->
    <max_memory_usage>8000000000</max_memory_usage>
    <max_bytes_before_external_group_by>4000000000</max_bytes_before_external_group_by>
    
    <!-- Network settings -->
    <max_connections>1000</max_connections>
    <keep_alive_timeout>10</keep_alive_timeout>
    
    <!-- Logging -->
    <logger>
        <level>information</level>
        <log>/var/log/clickhouse-server/clickhouse-server.log</log>
        <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
        <size>1000M</size>
        <count>3</count>
    </logger>
</yandex>

Initialize ClickHouse Databases:

# Wait for ClickHouse to be ready
until curl -s -u admin:clickhouse123 http://localhost:8123/ping; do
  echo "Waiting for ClickHouse to be ready..."
  sleep 2
done

# Create databases
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS bronze_layer"
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS gold_layer"

# Create bronze layer table for sales orders
curl -u admin:clickhouse123 http://localhost:8123/ -d "
CREATE TABLE IF NOT EXISTS bronze_layer.sales_orders (
    salesorderid Int32,
    customerid Int32,
    territoryid Nullable(Int32),
    orderdate DateTime,
    duedate DateTime,
    shipdate Nullable(DateTime),
    status Int8,
    subtotal Decimal64(4),
    taxamt Decimal64(4),
    freight Decimal64(4),
    totaldue Decimal64(4),
    modifieddate DateTime,
    _ingested_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(orderdate)
ORDER BY (orderdate, salesorderid)
SETTINGS index_granularity = 8192
"

Step 4: Install LogStash with Custom Pipeline

Set up LogStash with MySQL connector and custom configuration:

# Create directories for LogStash configuration
mkdir -p logstash-drivers logstash-config logstash-patterns

# Download MySQL JDBC driver
docker run --rm -v $(pwd)/logstash-drivers:/drivers alpine:latest \
  sh -c "apk add --no-cache wget && \
         wget -O /drivers/mysql-connector-java.jar \
         https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.9-rc/mysql-connector-java-8.0.9-rc.jar"

# Download ClickHouse JDBC driver (optional for alternative output)
docker run --rm -v $(pwd)/logstash-drivers:/drivers alpine:latest \
  sh -c "apk add --no-cache wget && \
         wget -O /drivers/clickhouse-jdbc.jar \
         https://repo1.maven.org/maven2/ru/yandex/clickhouse/clickhouse-jdbc/0.3.2/clickhouse-jdbc-0.3.2.jar"

Create Advanced LogStash Configuration: Create logstash-config/pipelines.yml:

- pipeline.id: sales-orders
  path.config: "/usr/share/logstash/pipeline/sales-orders.conf"
  pipeline.workers: 2
  
- pipeline.id: customers  
  path.config: "/usr/share/logstash/pipeline/customers.conf"
  pipeline.workers: 1
  
- pipeline.id: products
  path.config: "/usr/share/logstash/pipeline/products.conf" 
  pipeline.workers: 1

Create Pipeline-Specific Configurations: Create logstash-config/sales-orders.conf:

input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/drivers/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://mysql:3306/adventureworks?useSSL=false"
    jdbc_user => "root"
    jdbc_password => "password"
    statement_filepath => "/usr/share/logstash/sql-queries/sales_orders.sql"
    use_column_value => true
    tracking_column => "ModifiedDate"
    tracking_column_type => "timestamp"
    last_run_metadata_path => "/usr/share/logstash/.logstash_jdbc_last_run_sales"
    clean_run => false
    schedule => "*/30 * * * * *"
    type => "sales_order"
  }
}

filter {
  if [type] == "sales_order" {
    # Convert timestamps to ISO format
    date {
      match => [ "orderdate", "yyyy-MM-dd HH:mm:ss" ]
      target => "orderdate_iso"
    }
    
    # Add metadata
    mutate {
      add_field => { "[@metadata][table]" => "sales_orders" }
      add_field => { "[@metadata][database]" => "bronze_layer" }
      remove_field => [ "@version", "@timestamp", "type" ]
    }
    
    # Data quality checks
    if [salesorderid] {
      mutate { add_tag => [ "valid_record" ] }
    } else {
      mutate { add_tag => [ "invalid_record" ] }
    }
  }
}

output {
  if "valid_record" in [tags] {
    http {
      url => "http://clickhouse:8123/"
      http_method => "post"
      format => "message"
      message => "INSERT INTO %{[@metadata][database]}.%{[@metadata][table]} FORMAT JSONEachRow %{message}"
      headers => {
        "Authorization" => "Basic YWRtaW46Y2xpY2tob3VzZTEyMw=="
        "Content-Type" => "application/json"
      }
      # Retry configuration
      retry_failed => true
      retries => 3
      retry_delay => 5
    }
  }
  
  if "invalid_record" in [tags] {
    file {
      path => "/usr/share/logstash/logs/invalid_sales_orders.log"
      codec => json_lines
    }
  }
  
  # Optional: Send to multiple outputs
  # elasticsearch {
  #   hosts => ["elasticsearch:9200"]
  #   index => "sales-orders-%{+YYYY.MM.dd}"
  # }
}

Create SQL Query Files: Create sql-queries/sales_orders.sql:

SELECT 
    salesorderid,
    customerid,
    territoryid,
    orderdate,
    duedate,
    shipdate,
    status,
    subtotal,
    taxamt,
    freight,
    totaldue,
    modifieddate
FROM Sales.SalesOrderHeader 
WHERE ModifiedDate > :sql_last_value 
ORDER BY ModifiedDate ASC
LIMIT 1000

Run LogStash Container:

# Run LogStash with multiple pipeline support
docker run --name logstash \
  --network datasuite-network \
  -p 5044:5044 \
  -p 9600:9600 \
  -v $(pwd)/logstash-config:/usr/share/logstash/pipeline \
  -v $(pwd)/logstash-drivers:/usr/share/logstash/drivers \
  -v $(pwd)/sql-queries:/usr/share/logstash/sql-queries \
  -v logstash-data:/usr/share/logstash/data \
  -e "LS_JAVA_OPTS=-Xmx2g -Xms2g" \
  --restart unless-stopped \
  -d docker.elastic.co/logstash/logstash:8.11.0

Step 5: Install Apache Airflow (Optional)

Set up Airflow for workflow orchestration:

# Create Airflow directories
mkdir -p airflow-dags airflow-logs airflow-plugins

# Initialize Airflow database and create admin user
docker run --rm \
  -v $(pwd)/airflow-dags:/opt/airflow/dags \
  -v $(pwd)/airflow-logs:/opt/airflow/logs \
  -e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
  -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
  apache/airflow:2.7.0 \
  bash -c "airflow db init && airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email [email protected]"

# Run Airflow webserver
docker run --name airflow-webserver \
  --network datasuite-network \
  -p 8080:8080 \
  -v $(pwd)/airflow-dags:/opt/airflow/dags \
  -v $(pwd)/airflow-logs:/opt/airflow/logs \
  -v $(pwd)/airflow-plugins:/opt/airflow/plugins \
  -e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
  -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
  --restart unless-stopped \
  -d apache/airflow:2.7.0 \
  webserver

# Run Airflow scheduler (in separate container)
docker run --name airflow-scheduler \
  --network datasuite-network \
  -v $(pwd)/airflow-dags:/opt/airflow/dags \
  -v $(pwd)/airflow-logs:/opt/airflow/logs \
  -v $(pwd)/airflow-plugins:/opt/airflow/plugins \
  -e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
  -e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
  --restart unless-stopped \
  -d apache/airflow:2.7.0 \
  scheduler

Step 6: Configure Service Health Monitoring

Create monitoring scripts for service health:

Create scripts/health-check.sh:

#!/bin/bash

echo "=== DataSuite ETL Health Check ==="

# Check MySQL
echo -n "MySQL: "
if docker exec mysql mysqladmin ping -h localhost -u root -ppassword --silent 2>/dev/null; then
    echo "✅ Running"
else
    echo "❌ Down"
fi

# Check ClickHouse
echo -n "ClickHouse: "
if curl -s -u admin:clickhouse123 http://localhost:8123/ping | grep -q "Ok"; then
    echo "✅ Running"
else
    echo "❌ Down"
fi

# Check LogStash
echo -n "LogStash: "
if curl -s http://localhost:9600/_node/stats >/dev/null 2>&1; then
    echo "✅ Running"
else
    echo "❌ Down"
fi

# Check Airflow (if running)
if docker ps | grep -q airflow-webserver; then
    echo -n "Airflow: "
    if curl -s http://localhost:8080/health >/dev/null 2>&1; then
        echo "✅ Running"
    else
        echo "❌ Down"
    fi
fi

echo "=== Network Connectivity ==="
docker exec mysql nc -zv clickhouse 8123 2>/dev/null && echo "MySQL → ClickHouse: ✅" || echo "MySQL → ClickHouse: ❌"
docker exec logstash nc -zv mysql 3306 2>/dev/null && echo "LogStash → MySQL: ✅" || echo "LogStash → MySQL: ❌"
docker exec logstash nc -zv clickhouse 8123 2>/dev/null && echo "LogStash → ClickHouse: ✅" || echo "LogStash → ClickHouse: ❌"
# Make script executable and run it
chmod +x scripts/health-check.sh
./scripts/health-check.sh

Step 7: Advanced Network Configuration

Custom Bridge Network

# Create network with custom CIDR
docker network create datasuite-network \
  --driver bridge \
  --subnet=172.20.0.0/16 \
  --ip-range=172.20.240.0/20

# Assign static IPs to services
docker run --name mysql \
  --network datasuite-network \
  --ip 172.20.240.10 \
  # ... other options

Service Discovery Configuration

Create /etc/hosts entries for easier access:

# Add to /etc/hosts (requires sudo)
echo "127.0.0.1 mysql.datasuite.local" | sudo tee -a /etc/hosts
echo "127.0.0.1 clickhouse.datasuite.local" | sudo tee -a /etc/hosts
echo "127.0.0.1 logstash.datasuite.local" | sudo tee -a /etc/hosts

Management Scripts

Create scripts/manage-services.sh:

#!/bin/bash

case "$1" in
  start)
    echo "Starting DataSuite ETL services..."
    docker start mysql clickhouse logstash
    if docker ps -a | grep -q airflow; then
      docker start airflow-webserver airflow-scheduler
    fi
    ;;
  stop)
    echo "Stopping DataSuite ETL services..."
    docker stop mysql clickhouse logstash
    if docker ps | grep -q airflow; then
      docker stop airflow-webserver airflow-scheduler
    fi
    ;;
  restart)
    echo "Restarting DataSuite ETL services..."
    $0 stop
    sleep 5
    $0 start
    ;;
  status)
    ./scripts/health-check.sh
    ;;
  logs)
    docker logs -f ${2:-logstash}
    ;;
  *)
    echo "Usage: $0 {start|stop|restart|status|logs [service]}"
    exit 1
    ;;
esac
chmod +x scripts/manage-services.sh

Resource Management

CPU and Memory Limits

# Run containers with resource limits
docker run --name mysql \
  --cpus="2.0" \
  --memory="4g" \
  --memory-swap="4g" \
  # ... other options

docker run --name clickhouse \
  --cpus="4.0" \
  --memory="8g" \
  --memory-swap="8g" \
  # ... other options

Storage Optimization

# Create volumes with specific drivers
docker volume create mysql-data --driver local \
  --opt type=ext4 \
  --opt device=/dev/sdb1

docker volume create clickhouse-data --driver local \
  --opt type=ext4 \
  --opt device=/dev/sdb2

Backup and Recovery

Create scripts/backup.sh:

#!/bin/bash

BACKUP_DIR="./backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup MySQL
echo "Backing up MySQL..."
docker exec mysql mysqldump -uroot -ppassword --all-databases > "$BACKUP_DIR/mysql_backup.sql"

# Backup ClickHouse
echo "Backing up ClickHouse..."
curl -u admin:clickhouse123 "http://localhost:8123/?query=SELECT * FROM bronze_layer.sales_orders FORMAT Native" > "$BACKUP_DIR/clickhouse_bronze.native"

echo "Backup completed in $BACKUP_DIR"

Next Steps

Your individual containers setup provides maximum flexibility. Proceed to:

  1. Service Verification - Confirm everything works correctly

  2. LogStash Configuration - Customize your data pipelines

  3. DBT Getting Started - Build dimensional models

Benefits of Individual Container Setup

  • 🎛️ Fine-Grained Control: Configure each service precisely for your needs

  • 🏗️ Production Similarity: More closely matches production deployment patterns

  • 📈 Scalability: Easy to scale individual services based on load

  • 🔧 Customization: Advanced configuration options for performance tuning

  • 🚀 Deployment Flexibility: Can be adapted for Kubernetes or other orchestrators

This setup method provides the foundation for understanding how to deploy and manage DataSuite ETL in production environments.

Last updated

Was this helpful?