Advanced Setup (Individual Containers)
This setup method gives you complete control over each service configuration. Choose this approach when you need custom networking, specific versions, or advanced configuration options.
Why Choose Individual Containers?
✅ Advantages:
Complete Control: Configure each service exactly as needed
Custom Networking: Set up advanced network topologies
Resource Management: Allocate specific CPU/memory to each service
Version Control: Use different versions of each component
Production-Like: More similar to production deployment patterns
❌ Considerations:
More complex setup and management
Manual network configuration required
Requires deeper understanding of each component
More time-consuming for initial setup
Step 1: Create Docker Network
Create a custom network for service communication:
# Create dedicated network for DataSuite services
docker network create datasuite-network
# Verify network creation
docker network ls | grep datasuite
Expected Output:
NETWORK ID NAME DRIVER SCOPE
a1b2c3d4e5f6 datasuite-network bridge local
Step 2: Set Up MySQL Database
Install and configure MySQL with AdventureWorks data:
# Run MySQL with persistent storage and custom configuration
docker run --name mysql \
--network datasuite-network \
-e MYSQL_ROOT_PASSWORD=password \
-e MYSQL_DATABASE=adventureworks \
-p 3306:3306 \
-v mysql-data:/var/lib/mysql \
-v $(pwd)/mysql-config:/etc/mysql/conf.d \
--restart unless-stopped \
-d mysql:8.0
Advanced MySQL Configuration (optional): Create mysql-config/custom.cnf
:
[mysqld]
# Performance tuning
innodb_buffer_pool_size = 2G
innodb_log_file_size = 512M
max_connections = 200
# Binary logging for replication
log-bin = mysql-bin
binlog_format = ROW
expire_logs_days = 7
# Character set
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
Load Sample Data:
# Wait for MySQL to be ready
until docker exec mysql mysqladmin ping -h localhost -u root -ppassword --silent; do
echo "Waiting for MySQL to be ready..."
sleep 2
done
# Load Adventure Works data
docker exec -i mysql mysql -uroot -ppassword adventureworks < docs/data/AdventureWorks2019.sql
# Verify data loading
docker exec mysql mysql -uroot -ppassword adventureworks -e "SELECT COUNT(*) FROM Sales.SalesOrderHeader;"
Step 3: Install ClickHouse Data Warehouse
Set up ClickHouse with custom configuration:
# Run ClickHouse with authentication and custom settings
docker run --name clickhouse \
--network datasuite-network \
-p 8123:8123 \
-p 9000:9000 \
-e CLICKHOUSE_USER=admin \
-e CLICKHOUSE_PASSWORD=clickhouse123 \
-e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
-v clickhouse-data:/var/lib/clickhouse \
-v $(pwd)/clickhouse-config:/etc/clickhouse-server/config.d \
--restart unless-stopped \
-d clickhouse/clickhouse-server:latest
Advanced ClickHouse Configuration (optional): Create clickhouse-config/custom.xml
:
<yandex>
<!-- Memory settings -->
<max_memory_usage>8000000000</max_memory_usage>
<max_bytes_before_external_group_by>4000000000</max_bytes_before_external_group_by>
<!-- Network settings -->
<max_connections>1000</max_connections>
<keep_alive_timeout>10</keep_alive_timeout>
<!-- Logging -->
<logger>
<level>information</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>1000M</size>
<count>3</count>
</logger>
</yandex>
Initialize ClickHouse Databases:
# Wait for ClickHouse to be ready
until curl -s -u admin:clickhouse123 http://localhost:8123/ping; do
echo "Waiting for ClickHouse to be ready..."
sleep 2
done
# Create databases
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS bronze_layer"
curl -u admin:clickhouse123 http://localhost:8123/ -d "CREATE DATABASE IF NOT EXISTS gold_layer"
# Create bronze layer table for sales orders
curl -u admin:clickhouse123 http://localhost:8123/ -d "
CREATE TABLE IF NOT EXISTS bronze_layer.sales_orders (
salesorderid Int32,
customerid Int32,
territoryid Nullable(Int32),
orderdate DateTime,
duedate DateTime,
shipdate Nullable(DateTime),
status Int8,
subtotal Decimal64(4),
taxamt Decimal64(4),
freight Decimal64(4),
totaldue Decimal64(4),
modifieddate DateTime,
_ingested_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(orderdate)
ORDER BY (orderdate, salesorderid)
SETTINGS index_granularity = 8192
"
Step 4: Install LogStash with Custom Pipeline
Set up LogStash with MySQL connector and custom configuration:
# Create directories for LogStash configuration
mkdir -p logstash-drivers logstash-config logstash-patterns
# Download MySQL JDBC driver
docker run --rm -v $(pwd)/logstash-drivers:/drivers alpine:latest \
sh -c "apk add --no-cache wget && \
wget -O /drivers/mysql-connector-java.jar \
https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.9-rc/mysql-connector-java-8.0.9-rc.jar"
# Download ClickHouse JDBC driver (optional for alternative output)
docker run --rm -v $(pwd)/logstash-drivers:/drivers alpine:latest \
sh -c "apk add --no-cache wget && \
wget -O /drivers/clickhouse-jdbc.jar \
https://repo1.maven.org/maven2/ru/yandex/clickhouse/clickhouse-jdbc/0.3.2/clickhouse-jdbc-0.3.2.jar"
Create Advanced LogStash Configuration: Create logstash-config/pipelines.yml
:
- pipeline.id: sales-orders
path.config: "/usr/share/logstash/pipeline/sales-orders.conf"
pipeline.workers: 2
- pipeline.id: customers
path.config: "/usr/share/logstash/pipeline/customers.conf"
pipeline.workers: 1
- pipeline.id: products
path.config: "/usr/share/logstash/pipeline/products.conf"
pipeline.workers: 1
Create Pipeline-Specific Configurations: Create logstash-config/sales-orders.conf
:
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/drivers/mysql-connector-java.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://mysql:3306/adventureworks?useSSL=false"
jdbc_user => "root"
jdbc_password => "password"
statement_filepath => "/usr/share/logstash/sql-queries/sales_orders.sql"
use_column_value => true
tracking_column => "ModifiedDate"
tracking_column_type => "timestamp"
last_run_metadata_path => "/usr/share/logstash/.logstash_jdbc_last_run_sales"
clean_run => false
schedule => "*/30 * * * * *"
type => "sales_order"
}
}
filter {
if [type] == "sales_order" {
# Convert timestamps to ISO format
date {
match => [ "orderdate", "yyyy-MM-dd HH:mm:ss" ]
target => "orderdate_iso"
}
# Add metadata
mutate {
add_field => { "[@metadata][table]" => "sales_orders" }
add_field => { "[@metadata][database]" => "bronze_layer" }
remove_field => [ "@version", "@timestamp", "type" ]
}
# Data quality checks
if [salesorderid] {
mutate { add_tag => [ "valid_record" ] }
} else {
mutate { add_tag => [ "invalid_record" ] }
}
}
}
output {
if "valid_record" in [tags] {
http {
url => "http://clickhouse:8123/"
http_method => "post"
format => "message"
message => "INSERT INTO %{[@metadata][database]}.%{[@metadata][table]} FORMAT JSONEachRow %{message}"
headers => {
"Authorization" => "Basic YWRtaW46Y2xpY2tob3VzZTEyMw=="
"Content-Type" => "application/json"
}
# Retry configuration
retry_failed => true
retries => 3
retry_delay => 5
}
}
if "invalid_record" in [tags] {
file {
path => "/usr/share/logstash/logs/invalid_sales_orders.log"
codec => json_lines
}
}
# Optional: Send to multiple outputs
# elasticsearch {
# hosts => ["elasticsearch:9200"]
# index => "sales-orders-%{+YYYY.MM.dd}"
# }
}
Create SQL Query Files: Create sql-queries/sales_orders.sql
:
SELECT
salesorderid,
customerid,
territoryid,
orderdate,
duedate,
shipdate,
status,
subtotal,
taxamt,
freight,
totaldue,
modifieddate
FROM Sales.SalesOrderHeader
WHERE ModifiedDate > :sql_last_value
ORDER BY ModifiedDate ASC
LIMIT 1000
Run LogStash Container:
# Run LogStash with multiple pipeline support
docker run --name logstash \
--network datasuite-network \
-p 5044:5044 \
-p 9600:9600 \
-v $(pwd)/logstash-config:/usr/share/logstash/pipeline \
-v $(pwd)/logstash-drivers:/usr/share/logstash/drivers \
-v $(pwd)/sql-queries:/usr/share/logstash/sql-queries \
-v logstash-data:/usr/share/logstash/data \
-e "LS_JAVA_OPTS=-Xmx2g -Xms2g" \
--restart unless-stopped \
-d docker.elastic.co/logstash/logstash:8.11.0
Step 5: Install Apache Airflow (Optional)
Set up Airflow for workflow orchestration:
# Create Airflow directories
mkdir -p airflow-dags airflow-logs airflow-plugins
# Initialize Airflow database and create admin user
docker run --rm \
-v $(pwd)/airflow-dags:/opt/airflow/dags \
-v $(pwd)/airflow-logs:/opt/airflow/logs \
-e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
-e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
apache/airflow:2.7.0 \
bash -c "airflow db init && airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email [email protected]"
# Run Airflow webserver
docker run --name airflow-webserver \
--network datasuite-network \
-p 8080:8080 \
-v $(pwd)/airflow-dags:/opt/airflow/dags \
-v $(pwd)/airflow-logs:/opt/airflow/logs \
-v $(pwd)/airflow-plugins:/opt/airflow/plugins \
-e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
-e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
--restart unless-stopped \
-d apache/airflow:2.7.0 \
webserver
# Run Airflow scheduler (in separate container)
docker run --name airflow-scheduler \
--network datasuite-network \
-v $(pwd)/airflow-dags:/opt/airflow/dags \
-v $(pwd)/airflow-logs:/opt/airflow/logs \
-v $(pwd)/airflow-plugins:/opt/airflow/plugins \
-e AIRFLOW__CORE__EXECUTOR=LocalExecutor \
-e AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=sqlite:////opt/airflow/airflow.db \
--restart unless-stopped \
-d apache/airflow:2.7.0 \
scheduler
Step 6: Configure Service Health Monitoring
Create monitoring scripts for service health:
Create scripts/health-check.sh
:
#!/bin/bash
echo "=== DataSuite ETL Health Check ==="
# Check MySQL
echo -n "MySQL: "
if docker exec mysql mysqladmin ping -h localhost -u root -ppassword --silent 2>/dev/null; then
echo "✅ Running"
else
echo "❌ Down"
fi
# Check ClickHouse
echo -n "ClickHouse: "
if curl -s -u admin:clickhouse123 http://localhost:8123/ping | grep -q "Ok"; then
echo "✅ Running"
else
echo "❌ Down"
fi
# Check LogStash
echo -n "LogStash: "
if curl -s http://localhost:9600/_node/stats >/dev/null 2>&1; then
echo "✅ Running"
else
echo "❌ Down"
fi
# Check Airflow (if running)
if docker ps | grep -q airflow-webserver; then
echo -n "Airflow: "
if curl -s http://localhost:8080/health >/dev/null 2>&1; then
echo "✅ Running"
else
echo "❌ Down"
fi
fi
echo "=== Network Connectivity ==="
docker exec mysql nc -zv clickhouse 8123 2>/dev/null && echo "MySQL → ClickHouse: ✅" || echo "MySQL → ClickHouse: ❌"
docker exec logstash nc -zv mysql 3306 2>/dev/null && echo "LogStash → MySQL: ✅" || echo "LogStash → MySQL: ❌"
docker exec logstash nc -zv clickhouse 8123 2>/dev/null && echo "LogStash → ClickHouse: ✅" || echo "LogStash → ClickHouse: ❌"
# Make script executable and run it
chmod +x scripts/health-check.sh
./scripts/health-check.sh
Step 7: Advanced Network Configuration
Custom Bridge Network
# Create network with custom CIDR
docker network create datasuite-network \
--driver bridge \
--subnet=172.20.0.0/16 \
--ip-range=172.20.240.0/20
# Assign static IPs to services
docker run --name mysql \
--network datasuite-network \
--ip 172.20.240.10 \
# ... other options
Service Discovery Configuration
Create /etc/hosts
entries for easier access:
# Add to /etc/hosts (requires sudo)
echo "127.0.0.1 mysql.datasuite.local" | sudo tee -a /etc/hosts
echo "127.0.0.1 clickhouse.datasuite.local" | sudo tee -a /etc/hosts
echo "127.0.0.1 logstash.datasuite.local" | sudo tee -a /etc/hosts
Management Scripts
Create scripts/manage-services.sh
:
#!/bin/bash
case "$1" in
start)
echo "Starting DataSuite ETL services..."
docker start mysql clickhouse logstash
if docker ps -a | grep -q airflow; then
docker start airflow-webserver airflow-scheduler
fi
;;
stop)
echo "Stopping DataSuite ETL services..."
docker stop mysql clickhouse logstash
if docker ps | grep -q airflow; then
docker stop airflow-webserver airflow-scheduler
fi
;;
restart)
echo "Restarting DataSuite ETL services..."
$0 stop
sleep 5
$0 start
;;
status)
./scripts/health-check.sh
;;
logs)
docker logs -f ${2:-logstash}
;;
*)
echo "Usage: $0 {start|stop|restart|status|logs [service]}"
exit 1
;;
esac
chmod +x scripts/manage-services.sh
Resource Management
CPU and Memory Limits
# Run containers with resource limits
docker run --name mysql \
--cpus="2.0" \
--memory="4g" \
--memory-swap="4g" \
# ... other options
docker run --name clickhouse \
--cpus="4.0" \
--memory="8g" \
--memory-swap="8g" \
# ... other options
Storage Optimization
# Create volumes with specific drivers
docker volume create mysql-data --driver local \
--opt type=ext4 \
--opt device=/dev/sdb1
docker volume create clickhouse-data --driver local \
--opt type=ext4 \
--opt device=/dev/sdb2
Backup and Recovery
Create scripts/backup.sh
:
#!/bin/bash
BACKUP_DIR="./backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Backup MySQL
echo "Backing up MySQL..."
docker exec mysql mysqldump -uroot -ppassword --all-databases > "$BACKUP_DIR/mysql_backup.sql"
# Backup ClickHouse
echo "Backing up ClickHouse..."
curl -u admin:clickhouse123 "http://localhost:8123/?query=SELECT * FROM bronze_layer.sales_orders FORMAT Native" > "$BACKUP_DIR/clickhouse_bronze.native"
echo "Backup completed in $BACKUP_DIR"
Next Steps
Your individual containers setup provides maximum flexibility. Proceed to:
Service Verification - Confirm everything works correctly
LogStash Configuration - Customize your data pipelines
DBT Getting Started - Build dimensional models
Benefits of Individual Container Setup
🎛️ Fine-Grained Control: Configure each service precisely for your needs
🏗️ Production Similarity: More closely matches production deployment patterns
📈 Scalability: Easy to scale individual services based on load
🔧 Customization: Advanced configuration options for performance tuning
🚀 Deployment Flexibility: Can be adapted for Kubernetes or other orchestrators
This setup method provides the foundation for understanding how to deploy and manage DataSuite ETL in production environments.
Last updated
Was this helpful?