Unified analytics platform with Flex, Flink, Spark, Iceberg & Hive Metastore
This stack builds a comprehensive analytics platform that erases the line between real-time stream analytics and large-scale batch processing. It achieves this by combining the power of Apache Flink, enhanced by Flex for enterprise-grade management and monitoring, with Apache Spark on a unified data lakehouse, enabling you to work with a single source of truth for all your data workloads.
π Descriptionβ
This architecture is designed around a modern data lakehouse that serves both streaming and batch jobs from the same data. At its foundation, data is stored in Apache Iceberg tables on MinIO, an S3-compatible object store. This provides powerful features like ACID transactions, schema evolution, and time travel for your data.
A central Hive Metastore serves as a unified metadata catalog for the entire data ecosystem, providing essential information about the structure and location of datasets. By using a robust PostgreSQL database as its backend, the metastore reliably tracks all table schemas and metadata. This central catalog allows both Apache Flink (for low-latency streaming) and Apache Spark (for batch ETL and interactive analytics) to discover, query, and write to the same tables seamlessly, eliminating data silos.
The role of PostgreSQL is twofold: in addition to providing a durable backend for the metastore, it is configured as a high-performance transactional database ready for Change Data Capture (CDC). This design allows you to stream every INSERT
, UPDATE
, and DELETE
from your operational data directly into the lakehouse, keeping it perfectly synchronized in near real-time.
The platform is rounded out by enterprise-grade tooling: Flex simplifies Flink management and monitoring, a Flink SQL Gateway enables interactive queries on live data streams, and a single node Spark cluster supports complex data transformations. This integrated environment is ideal for building sophisticated solutions for fraud detection, operational intelligence, and unified business analytics.
π Key componentsβ
π Flex (Flink management & monitoring toolkit)β
- Container: kpow from (
factorhouse/flex:latest
(enterprise)) or kpow-ce from (factorhouse/flex-ce:latest
(community)) - Provides an enterprise-ready tooling solution to streamline and simplify Apache Flink management. It gathers Flink resource information, offering custom telemetry, insights, and a rich data-oriented UI. Key features include:
- Comprehensive Flink Monitoring & Insights:
- Gathers Flink resource information minute-by-minute.
- Offers fully integrated metrics and telemetry.
- Provides access to long-term metrics and aggregated consumption/production data, from cluster-level down to individual job-level details.
- Simplified Management for All User Groups:
- User-friendly interface and intuitive controls.
- Aims to align business needs with Flink capabilities.
- Enterprise-Grade Security & Governance:
- Versatile Authentication: Supports DB, File, LDAP, SAML, OpenID, Okta, and Keycloak.
- Robust Authorization: Offers Simple or fine-grained Role-Based Access Controls (RBAC).
- Data Policies: Includes capabilities for masking and redaction of sensitive data (e.g., PII, Credit Card).
- Audit Logging: Captures all user actions for comprehensive data governance.
- Secure Deployments: Supports HTTPS and is designed for air-gapped environments (all data remains local).
- Powerful Flink Enhancements:
- Multi-tenancy: Advanced capabilities to manage Flink resources effectively with control over visibility and usage.
- Multi-Cluster Monitoring: Manage and monitor multiple Flink clusters from a single installation.
- Key Integrations:
- Prometheus: Exposes endpoints for integration with preferred metrics and alerting systems.
- Slack: Allows user actions to be sent to an operations channel in real-time.
- Comprehensive Flink Monitoring & Insights:
- Exposes UI at
http://localhost:3001
π§ Flink clusterβ
- JobManager (
jobmanager
) coordinates all tasks, handles scheduling, checkpoints, and failover. Flink UI is exposed athttp://localhost:8082
. - TaskManagers (
taskmanager-1
,-2
,-3
) run user code and perform actual stream processing. - The cluster is configured to use the central Hive Metastore for catalog services.
π SQL Gatewayβ
- Container:
sql-gateway
- A REST-accessible endpoint (
http://localhost:9090
) for interactive Flink SQL queries against the unified catalog.
π Spark compute engine (Batch engine)β
- Container:
spark-iceberg
- Provides an Apache Spark environment pre-configured with Apache Iceberg support and connected to the central Hive Metastore.
- Spark Web UI for monitoring running jobs (
http://localhost:4040
). - Spark History Server for reviewing completed jobs (
http://localhost:18080
).
π Hive Metastore (Unified catalog)β
- Container:
hive-metastore
- A central metadata service for the entire data lakehouse. It allows both Flink and Spark to interact with the same Iceberg tables consistently.
- Uses the PostgreSQL database as its backend to durably store all metadata (schemas, partitions, table locations).
- Accessible internally at
thrift://hive-metastore:9083
.
π PostgreSQL (Transactional hub & Metastore backend)β
- Container:
postgres
- This component is the transactional and metadata backbone of the entire platform, serving two distinct and critical functions:
- Durable Metastore Backend: It provides the persistent storage for the Hive Metastore. All schemas, table versions, and partition information for the entire Iceberg lakehouse are stored transactionally in PostgreSQL. This makes the lakehouse catalog robust, reliable, and recoverable (Database:
metastore
). - Transactional Workload & CDC Hub: It functions as a full-fledged relational database for application workloads. It is purpose-built for Change Data Capture (CDC), with
wal_level=logical
enabled by design. This configuration prepares it for seamless integration with tools like Debezium, allowing everyINSERT
,UPDATE
, andDELETE
to be captured and streamed into the Flink/Iceberg pipeline.
- Durable Metastore Backend: It provides the persistent storage for the Hive Metastore. All schemas, table versions, and partition information for the entire Iceberg lakehouse are stored transactionally in PostgreSQL. This makes the lakehouse catalog robust, reliable, and recoverable (Database:
- Accessible at
localhost:5432
(Database:fh_dev
).
πΎ S3-Compatible object storage (MinIO)β
- MinIO provides S3-compatible object storage, acting as the data lake storage layer for Iceberg tables, Flink checkpoints, and other artifacts.
- MinIO API at
http://localhost:9000
| MinIO Console UI athttp://localhost:9001
(admin
/password
).
- MinIO API at
- MinIO Client: A utility container that initializes MinIO by creating necessary buckets:
warehouse
(for Iceberg data),fh-dev-bucket
,flink-checkpoints
, andflink-savepoints
.
π§° Use casesβ
Unified data lakehouseβ
- Create, manage, and query Iceberg tables using both Flink SQL for real-time writes and Spark for batch updates, all through the central Hive Metastore.
- Perform ACID-compliant operations from either engine on the same datasets.
Real-time ingestion from transactional systems (CDC)β
- The architecture is purpose-built to support CDC pipelines. The
wal_level=logical
setting in PostgreSQL is intentionally enabled, allowing a tool like Debezium to capture every row-level change (INSERT, UPDATE, DELETE) and stream it into the data lakehouse in near real-time. This keeps the Iceberg tables continuously synchronized with the operational database.
Batch ETL/ELT pipelinesβ
- Use Spark to ingest data from various sources (including transactional data from PostgreSQL), perform large-scale transformations, and load it into Iceberg tables.
- Read from Iceberg tables for downstream processing, reporting, or machine learning.
Real-time ETL & stream enrichmentβ
- Ingest data from Kafka or CDC streams with Flink.
- Join streaming data with lookup tables in real-time.
- Write enriched, structured data directly into Iceberg tables, making it immediately available for Spark to query.
Interactive & self-service analyticsβ
- Empower analysts to query live, streaming data via the Flink SQL Gateway and historical, large-scale data using Sparkβall with familiar SQL pointed at the same tables.