CDC with Debezium on Real-Time theLook eCommerce Data
This project unlocks the power of the popular theLook eCommerce dataset for modern event-driven applications. It uses a re-engineered real-time data generator that transforms the original static dataset into a continuous stream of simulated user activity, writing directly to a PostgreSQL database.
This stream becomes an ideal source for building and testing Change Data Capture (CDC) pipelines using Debezium and Kafka—enabling developers and analysts to work with a familiar, realistic schema in a real-time context.
As a practical demonstration, this project includes deployment of a Debezium connector to stream database changes into Kafka topics.
Introduction to CDC with Debezium
Change Data Capture (CDC) is a set of design patterns used to track changes in a database so that other systems can react to those changes. Debezium is an open-source, distributed platform that turns your existing databases into event streams. This allows applications to respond to row-level changes in the database in real-time. Debezium provides a library of connectors for various databases that can monitor and record changes, publishing them to a streaming service like Apache Kafka.
This project utilizes the Debezium PostgreSQL connector to capture changes from a PostgreSQL database. Here’s a breakdown of the key components and configurations:
-
Debezium PostgreSQL Connector: This connector is configured to monitor the
demo
schema within the PostgreSQL database. It reads the database's write-ahead log (WAL) to capture allINSERT
,UPDATE
, andDELETE
operations committed to the tables within that schema. To enable this, the PostgreSQL server is started with thewal_level
set tological
in thecompose-flex.yml
file, which allows the WAL to contain the information needed for logical decoding. -
Logical Decoding Plugin (
pgoutput
): The Debezium PostgreSQL connector uses PostgreSQL's built-inpgoutput
logical decoding plugin. This plugin is part of PostgreSQL's core and provides a way to stream a sequence of changes from the WAL. -
Database Initialization: Before Debezium can start capturing changes, the database is prepared with a specific schema and publication:
- A new schema named
demo
is created to house the eCommerce dataset. - A
PUBLICATION
namedcdc_pub
is created for all tables within thedemo
schema. This publication acts as a logical grouping of tables whose changes should be made available to subscribers, in this case, the Debezium connector.
- A new schema named
-
Push-Based Model: Debezium operates on a push-based model. The PostgreSQL database actively sends changes to the Debezium connector as they occur in the transaction log. The connector then processes these changes and pushes them as events to Kafka topics. This approach ensures low-latency data streaming.
CDC setup in Factor House Local
Database Service Configuration
# factorhouse-local > compose-flex.yml
services:
...
postgres:
image: postgres:17
container_name: postgres
command: ["postgres", "-c", "wal_level=logical"]
ports:
- 5432:5432
networks:
- factorhouse
volumes:
- ./resources/postgres:/docker-entrypoint-initdb.d
environment:
POSTGRES_DB: fh_dev
POSTGRES_USER: db_user
POSTGRES_PASSWORD: db_password
TZ: UTC
...
Database Initialization Script
--// factorhouse-local > resources/postgres/01-init-databases.sql
-- Create schema
CREATE SCHEMA IF NOT EXISTS demo;
-- Grant privileges on schema to the application user
GRANT ALL ON SCHEMA demo TO db_user;
-- Set search_path at the DB level
ALTER DATABASE fh_dev SET search_path TO demo, public;
-- Set search_path for current session too
SET search_path TO demo, public;
-- Create CDC publication for Debezium
CREATE PUBLICATION cdc_pub FOR TABLES IN SCHEMA demo;
Set Up the Environment
Clone the Project
git clone https://github.com/factorhouse/examples.git
cd examples
Start Kafka and Flink
This project uses Factor House Local to spin up the Kafka and Flink environments, including Kpow and Flex for monitoring.
Before starting, make sure you have valid licenses for Kpow and Flex. See the license setup guide for instructions.
# Clone Factor House Local
git clone https://github.com/factorhouse/factorhouse-local.git
# Download necessary connectors and dependencies
./factorhouse-local/resources/setup-env.sh
# Configure edition and licenses
# Community:
# export KPOW_SUFFIX="-ce"
# export FLEX_SUFFIX="-ce"
# Or for Enterprise:
# unset KPOW_SUFFIX
# unset FLEX_SUFFIX
# Licenses:
# export KPOW_LICENSE=<path>
# export FLEX_LICENSE=<path>
# Start Kafka and Flink environments
docker compose -p kpow -f ./factorhouse-local/compose-kpow.yml up -d \
&& docker compose -p flex -f ./factorhouse-local/compose-flex.yml up -d
Launch the theLook eCommerce Data Generator
Start the containerized data generator to simulate real-time activity.
docker compose -f projects/thelook-ecomm-cdc/docker-compose.yml up -d
This will populate the following tables under the demo
schema in the fh_dev
PostgreSQL database:
users
products
dist_centers
orders
order_items
events
heartbeat
(used internally by Debezium)
Deploy the Debezium Connector (thelook-ecomm
)
The debezium.json
configuration defines a CDC pipeline using Debezium to stream changes from PostgreSQL into Kafka, capturing activity in the demo
schema and serializing records in Avro format.
Key Features
- Connector: PostgreSQL CDC using Debezium with the
pgoutput
plugin - Target Database: Connects to
fh_dev
onpostgres:5432
- Monitored Tables: All tables under the
demo
schema - Snapshot Mode:
"initial"
- performs a full snapshot on first run, then streams allINSERT
,UPDATE
, andDELETE
operations
Serialization and Schema Management
- Format: Avro (
AvroConverter
) - Schema Registry: Integrated with Confluent Schema Registry at
http://schema:8081
Kafka Topic Management
- Naming Convention: Topics follow the pattern
ecomm.schema_name.table_name
- Auto-Creation: Enabled (
"topic.creation.enable": "true"
) - Cleanup Policy: Set to
compact
, retaining the latest value for each key - Defaults: 3 partitions, replication factor of 1 (suitable for development only)
⚠️ In production, increase replication factor to ensure fault tolerance.
Heartbeat Table
The connector uses the demo.heartbeat
table to emit regular heartbeat events:
- Triggers an
INSERT
orUPDATE
every 10 seconds (heartbeat.interval.ms
) - Keeps the connector’s offset up to date and allows safe WAL file cleanup
Deploy via Kpow
Visit http://localhost:3000 to deploy the Debezium connector using Kpow’s UI. Once deployed, the following Kafka topics will be created and populated:
ecomm.demo.users
ecomm.demo.products
ecomm.demo.dist_centers
ecomm.demo.orders
ecomm.demo.order_items
ecomm.demo.events
Shut Down
When you're done, shut down all containers and unset any environment variables:
# Stop the data generator
docker compose -f projects/thelook-ecomm-cdc/docker-compose.yml down
# Stop Factor House Local containers
docker compose -p flex -f ./factorhouse-local/compose-flex.yml down \
&& docker compose -p kpow -f ./factorhouse-local/compose-kpow.yml down
# Clear environment variables
unset KPOW_SUFFIX FLEX_SUFFIX KPOW_LICENSE FLEX_LICENSE
Conclusion
This project offers a practical, end-to-end environment for working with Change Data Capture using real-time eCommerce data. With a live stream of events feeding into Kafka, you can now:
- 🔍 Build real-time analytics with tools like Flink or Apache Pinot
- 🧊 Ingest data into a lakehouse with Spark, Flink, or Kafka Connect
- ⚙️ Develop event-driven microservices that respond to user or order changes
By combining a realistic dataset with open-source tooling, this project makes it easy to experiment, prototype, and build production-ready CDC pipelines.