Quick Start (Docker Compose)

This is the recommended setup method for getting started quickly with DataSuite ETL. Docker Compose orchestrates all services with a single command.

Why Choose Docker Compose?

✅ Advantages:

  • Fast Setup: Single command to start entire stack

  • Simplified Networking: Automatic service discovery and communication

  • Easy Management: Start, stop, and restart all services together

  • Consistent Environment: Same configuration across different machines

  • Beginner Friendly: Minimal configuration required

❌ Limitations:

  • Less control over individual service configuration

  • All services use default settings

  • Limited customization options

Step 1: Download MySQL JDBC Driver

First, create the directory structure and download the required MySQL connector:

# Create directory for LogStash drivers
mkdir -p logstash-drivers

# Download MySQL JDBC driver using Docker container
docker run --rm -v $(pwd)/logstash-drivers:/drivers alpine:latest \
  sh -c "apk add --no-cache wget && \
         wget -O /drivers/mysql-connector-java.jar \
         https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.9-rc/mysql-connector-java-8.0.9-rc.jar"

Expected Output:

Connecting to repo1.maven.org...
HTTP request sent, awaiting response... 200 OK
Length: 2295712 (2.2M) [application/java-archive]
mysql-connector-java.jar saved [2295712/2295712]

Step 2: Create Docker Compose Configuration

Create docker-compose.yml in your project directory:

version: '3.8'

services:
  mysql:
    image: mysql:8.0
    container_name: mysql
    environment:
      MYSQL_ROOT_PASSWORD: password
      MYSQL_DATABASE: adventureworks  
    ports:
      - "3306:3306"
    volumes:
      - mysql-data:/var/lib/mysql
    networks:
      - datasuite-network

  clickhouse:
    image: clickhouse/clickhouse-server:latest
    container_name: clickhouse
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: admin
      CLICKHOUSE_PASSWORD: clickhouse123
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: 1
    volumes:
      - clickhouse-data:/var/lib/clickhouse
    networks:
      - datasuite-network

  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    container_name: logstash
    ports:
      - "5044:5044"
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
      - ./logstash-drivers:/usr/share/logstash/drivers
    depends_on:
      - mysql
      - clickhouse
    networks: 
      - datasuite-network

  airflow:
    image: apache/airflow:2.7.0
    container_name: airflow
    ports:
      - "8080:8080"
    volumes:
      - ./dags:/opt/airflow/dags
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
    depends_on:
      - clickhouse
    networks:
      - datasuite-network

volumes:
  mysql-data:
  clickhouse-data:

networks:
  datasuite-network:
    driver: bridge

Step 3: Create LogStash Configuration

Create logstash.conf in your project directory:

input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/drivers/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://mysql:3306/adventureworks"
    jdbc_user => "root"
    jdbc_password => "password"
    statement => "SELECT * FROM Sales.SalesOrderHeader WHERE ModifiedDate > :sql_last_value ORDER BY ModifiedDate"
    use_column_value => true
    tracking_column => "ModifiedDate"
    tracking_column_type => "timestamp"
    last_run_metadata_path => "/usr/share/logstash/.logstash_jdbc_last_run"
    clean_run => false
    schedule => "*/30 * * * * *"
  }
}

filter {
  # Convert to proper JSON format for ClickHouse
  mutate {
    remove_field => ["@version", "@timestamp"]
  }
}

output {
  http {
    url => "http://clickhouse:8123/"
    http_method => "post"
    format => "message"
    message => "INSERT INTO bronze_layer.sales_orders FORMAT JSONEachRow %{message}"
    headers => {
      "Authorization" => "Basic YWRtaW46Y2xpY2tob3VzZTEyMw=="  # admin:clickhouse123 base64
      "Content-Type" => "application/json"
    }
  }
  
  # Debug output to console
  stdout { 
    codec => rubydebug 
  }
}

Step 4: Start All Services

Launch the entire stack with a single command:

# Start all services in detached mode
docker-compose up -d

Expected Output:

Creating network "datasuite-etl_datasuite-network" with driver "bridge"
Creating volume "datasuite-etl_mysql-data" with default driver
Creating volume "datasuite-etl_clickhouse-data" with default driver
Creating mysql ... done
Creating clickhouse ... done
Creating logstash ... done
Creating airflow ... done

Step 5: Load Sample Data

Once MySQL is running, load the AdventureWorks sample data:

# Wait for MySQL to be ready (may take 30-60 seconds)
docker-compose exec mysql mysqladmin ping -h localhost -u root -ppassword

# Load Adventure Works 2019 data
docker exec -i mysql mysql -uroot -ppassword adventureworks < docs/data/AdventureWorks2019.sql

# Verify data was loaded successfully
docker exec mysql mysql -uroot -ppassword adventureworks -e "SHOW TABLES;"
docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) as table_count FROM information_schema.tables WHERE table_schema='adventureworks';"

Expected Output:

+-------------------------+
| Tables_in_adventureworks|
+-------------------------+
| Sales_Customer          |
| Sales_SalesOrderDetail  |
| Sales_SalesOrderHeader  |
| Sales_SalesTerritory    |
| Production_Product      |
+-------------------------+

+--------------+
| table_count  |
+--------------+
|           5  |
+--------------+

Step 6: Prepare ClickHouse

Create the required databases in ClickHouse:

# Create bronze layer database
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS bronze_layer"

# Create gold layer database  
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS gold_layer"

# Create sales_orders table in bronze layer
curl -u admin:clickhouse123 http://localhost:8123/ -d "
CREATE TABLE IF NOT EXISTS bronze_layer.sales_orders (
    salesorderid Int32,
    customerid Int32,
    territoryid Nullable(Int32),
    orderdate DateTime,
    duedate DateTime,
    shipdate Nullable(DateTime),
    status Int8,
    subtotal Decimal64(4),
    taxamt Decimal64(4),
    freight Decimal64(4),
    totaldue Decimal64(4),
    modifieddate DateTime
) ENGINE = MergeTree()
ORDER BY (orderdate, salesorderid)
"

Step 7: Monitor Service Status

Check that all services are running correctly:

# View service status
docker-compose ps

# Expected output should show all services as "Up"

Expected Output:

    Name                   Command               State                    Ports                  
------------------------------------------------------------------------------------------------
airflow        /usr/bin/dumb-init -- /ent ...   Up      0.0.0.0:8080->8080/tcp,:::8080->8080/tcp
clickhouse     /entrypoint.sh                   Up      0.0.0.0:8123->8123/tcp,:::8123->8123/tcp, 
                                                        0.0.0.0:9000->9000/tcp,:::9000->9000/tcp
logstash       /usr/local/bin/docker-entr ...   Up      0.0.0.0:5044->5044/tcp,:::5044->5044/tcp
mysql          docker-entrypoint.sh mysqld      Up      0.0.0.0:3306->3306/tcp,:::3306->3306/tcp, 33060/tcp

Step 8: Access Service UIs

Once everything is running, you can access the web interfaces:

ClickHouse Web Interface

  • URL: http://localhost:8123/play

  • Username: admin

  • Password: clickhouse123

Airflow Web Interface

  • URL: http://localhost:8080

  • Username: admin

  • Password: admin

MySQL (via command line)

# Connect to MySQL directly
docker exec -it mysql mysql -uroot -ppassword adventureworks

Useful Management Commands

Service Management

# Stop all services
docker-compose down

# Stop and remove volumes (WARNING: deletes all data)
docker-compose down -v

# Restart specific service
docker-compose restart logstash

# View service logs
docker-compose logs -f logstash

# Scale services (if supported)
docker-compose up -d --scale logstash=2

Data Management

# Backup MySQL data
docker exec mysql mysqldump -uroot -ppassword adventureworks > backup.sql

# Backup ClickHouse data  
curl -u admin:clickhouse123 "http://localhost:8123/?query=SELECT * FROM bronze_layer.sales_orders FORMAT CSV" > bronze_backup.csv

Resource Monitoring

# Monitor resource usage
docker stats

# View disk usage
docker system df

# Clean up unused resources
docker system prune

Troubleshooting

Services Won't Start

# Check for port conflicts
netstat -tulpn | grep -E ':(3306|8123|9000|5044|8080)'

# Check Docker daemon status
docker system info

# View detailed logs
docker-compose logs [service-name]

LogStash Connection Issues

# Test MySQL connectivity from LogStash container
docker exec logstash nc -zv mysql 3306

# Check LogStash pipeline status
curl http://localhost:9600/_node/stats/pipelines

ClickHouse Access Issues

# Test ClickHouse HTTP interface
curl -u admin:clickhouse123 http://localhost:8123/ping

# Check ClickHouse logs
docker logs clickhouse

Next Steps

Your Docker Compose setup is now complete! Proceed to:

  1. Service Verification - Test that everything is working correctly

  2. LogStash Configuration - Customize data ingestion

  3. DBT Getting Started - Build your first data models

Advantages of This Setup

  • 🚀 Fast Deployment: Entire stack running in under 5 minutes

  • 🔗 Automatic Networking: Services can communicate without manual configuration

  • 📦 Consistent Environment: Same setup works across different operating systems

  • 🛠️ Easy Maintenance: Single commands for start, stop, update operations

  • 📊 Complete Stack: All necessary services included and pre-configured

This Docker Compose setup provides a solid foundation for learning and development. When you're ready for more advanced configurations, consider the individual containers setup for greater flexibility.

Last updated

Was this helpful?