LogStash Configuration

LogStash serves as the data ingestion engine in the DataSuite ETL pipeline, extracting data from MySQL and loading it into ClickHouse. This guide covers configuration, customization, and optimization of LogStash pipelines.

Understanding LogStash Architecture

Pipeline Components

Input → Filter → Output
  ↓       ↓       ↓
MySQL   Transform ClickHouse

Input Plugins: Extract data from various sources (JDBC, files, APIs) Filter Plugins: Transform, enrich, and validate data Output Plugins: Load data to destinations (HTTP, databases, files)

Configuration Structure

LogStash configurations use a declarative syntax:

input {
  # Data source configuration
}

filter {
  # Data transformation logic
}

output {
  # Destination configuration
}

Basic JDBC Input Configuration

MySQL Connection Setup

Key Configuration Parameters

Parameter
Purpose
Example

jdbc_connection_string

Database URL

jdbc:mysql://mysql:3306/adventureworks

statement

SQL query to execute

SELECT * FROM table WHERE id > :sql_last_value

tracking_column

Column for incremental sync

ModifiedDate, id

schedule

Execution frequency

*/30 * * * * * (cron format)

jdbc_fetch_size

Rows per database fetch

1000

last_run_metadata_path

State persistence file

/path/to/.logstash_jdbc_last_run

Advanced Input Configurations

Multiple Table Ingestion

Sales Orders Pipeline (sales-orders.conf):

Complex SQL Queries

File: sql/sales_orders.sql

Filter Transformations

Data Type Conversions

Data Enrichment

Data Validation and Quality

Output Configurations

ClickHouse HTTP Output

Multiple Output Destinations

Performance Optimization

JDBC Connection Tuning

Pipeline Worker Configuration

File: config/pipelines.yml

JVM Tuning

Environment Variables:

Monitoring and Debugging

Pipeline Monitoring

Log Analysis

Error Handling Patterns

Security Configuration

Credential Management

SSL/TLS Configuration

Common Configuration Patterns

Conditional Processing

Template-based Configuration

Testing LogStash Configurations

Configuration Validation

Pipeline Testing

Best Practices

Configuration Organization

Error Recovery Strategies

  1. Persistent Queues: Enable for guaranteed delivery

  2. Dead Letter Queues: Capture failed events for analysis

  3. Circuit Breakers: Prevent cascade failures

  4. Retry Logic: Automatic retry with exponential backoff

  5. Monitoring: Comprehensive metrics and alerting

Performance Guidelines

  1. Batch Size: Balance between latency and throughput (1000-5000 events)

  2. Workers: Match CPU cores (typically 2-4 workers per pipeline)

  3. Memory: Allocate sufficient heap space (50% of container memory)

  4. Network: Use persistent connections and connection pooling

  5. Disk I/O: Use SSD storage for persistent queues

Next Steps

With LogStash configured for data ingestion:

  1. DBT Getting Started - Transform the ingested data

  2. Testing & Validation - Ensure data quality

  3. Troubleshooting - Resolve pipeline issues

Your LogStash configuration forms the foundation of the data pipeline, ensuring reliable and efficient data ingestion from source systems to the data warehouse.

Last updated