Skip to main content
Version: 95.1

Unified analytics platform with Flex, Flink, Spark, Iceberg & Hive Metastore

This stack builds a comprehensive analytics platform that erases the line between real-time stream analytics and large-scale batch processing. It achieves this by combining the power of Apache Flink, enhanced by Flex for enterprise-grade management and monitoring, with Apache Spark on a unified data lakehouse, enabling you to work with a single source of truth for all your data workloads.

πŸ“Œ Description​

This architecture is designed around a modern data lakehouse that serves both streaming and batch jobs from the same data. At its foundation, data is stored in Apache Iceberg tables on MinIO, an S3-compatible object store. This provides powerful features like ACID transactions, schema evolution, and time travel for your data.

A central Hive Metastore serves as a unified metadata catalog for the entire data ecosystem. By using a robust PostgreSQL database as its backend, the metastore reliably tracks all table schemas and metadata. This central catalog allows both Apache Flink (for low-latency streaming) and Apache Spark (for batch ETL and interactive analytics) to discover, query, and write to the same tables seamlessly, eliminating data silos. The platform also includes Redis, an in-memory data store, to facilitate high-speed data lookups and caching for real-time enrichment tasks.

The role of PostgreSQL is twofold: in addition to providing a durable backend for the metastore, it is configured as a high-performance transactional database ready for Change Data Capture (CDC). This design allows you to stream every INSERT, UPDATE, and DELETE from your operational data directly into the lakehouse, keeping it perfectly synchronized in near real-time.

The platform is rounded out by enterprise-grade tooling: Flex simplifies Flink management, a Flink SQL Gateway enables interactive queries, and a single node Spark cluster supports complex data transformations. This integrated environment is ideal for building sophisticated solutions for fraud detection, operational intelligence, and unified business analytics.


πŸ”‘ Key Components​

  • Container: flex from (factorhouse/flex:latest)
  • Provides an enterprise-ready tooling solution to streamline and simplify Apache Flink management. It gathers Flink resource information, offering custom telemetry, insights, and a rich data-oriented UI. Key features include:
    • Comprehensive Flink Monitoring & Insights: Offers fully integrated metrics, long-term history, and aggregated consumption/production data, from cluster-level down to individual job-level details.
    • Enterprise-Grade Security & Governance: Supports versatile Authentication (LDAP, SAML, OpenID), robust Role-Based Access Controls (RBAC), data masking policies, and a full audit log.
    • Powerful Flink Enhancements: Includes multi-tenancy and multi-cluster monitoring capabilities.
    • Key Integrations: Exposes Prometheus endpoints and allows for Slack notifications.
  • Exposes UI at http://localhost:3001
  • JobManager (jobmanager) coordinates all tasks, handles scheduling, checkpoints, and failover. Flink UI is exposed at http://localhost:8082.
  • TaskManagers (taskmanager-1, -2, -3) run user code and perform actual stream processing.
  • The cluster is configured to use the central Hive Metastore for catalog services and includes a rich set of connectors for Kafka, Iceberg, and more.
  • Container: sql-gateway
  • A REST-accessible endpoint (http://localhost:9090) for interactive Flink SQL queries against the unified catalog.

πŸš€ Spark Compute Engine (Batch Engine)​

  • Container: spark-iceberg
  • Provides an Apache Spark environment pre-configured with Apache Iceberg and OpenLineage support, connected to the central Hive Metastore.
  • Spark Web UI for monitoring running jobs (http://localhost:4040).
  • Spark History Server for reviewing completed jobs (http://localhost:18080).

πŸ“š Hive Metastore (Unified Catalog)​

  • Container: hive-metastore
  • A central metadata service that allows both Flink and Spark to interact with the same Iceberg tables consistently.
  • Uses the PostgreSQL database as its backend to durably store all metadata.
  • Accessible internally at thrift://hive-metastore:9083.

🐘 PostgreSQL (Transactional Hub & Metastore Backend)​

  • Container: postgres
  • Serves two critical functions:
    1. Durable Metastore Backend: Provides persistent storage for the Hive Metastore.
    2. Transactional Workload & CDC Hub: Functions as a relational database for application workloads, with wal_level=logical enabled for seamless Change Data Capture.
  • Accessible at localhost:5432.

πŸ’Ύ S3-Compatible Object Storage (MinIO)​

  • MinIO provides S3-compatible object storage, acting as the data lake storage layer for Iceberg tables and Flink checkpoints.
  • MinIO Client (mc) initializes MinIO by creating necessary buckets (warehouse, flink-checkpoints, etc.).

⚑ Redis (In-Memory Data Store)​

  • Container: redis
  • A high-performance in-memory key-value store that adds a low-latency data access layer to the platform.
  • Purpose: Ideal for caching frequently accessed data, serving as a high-speed lookup table for stream enrichment in Flink, or acting as a serving layer for real-time application results.
  • Configuration: Configured for persistence (appendonly yes) and secured with a password.
  • Accessible at localhost:6379.

🧰 Use Cases​

Unified Data Lakehouse​

  • Create, manage, and query Iceberg tables using both Flink SQL for real-time writes and Spark for batch updates, all through the central Hive Metastore.
  • Perform ACID-compliant operations from either engine on the same datasets.

Real-Time Ingestion from Transactional Systems (CDC)​

  • The architecture is purpose-built to support CDC pipelines. The wal_level=logical setting in PostgreSQL allows tools like Debezium to capture every row-level change and stream it into the data lakehouse in near real-time.

Batch ETL/ELT Pipelines​

  • Use Spark to perform large-scale transformations on data within the lakehouse for reporting or machine learning.

Real-Time ETL & Stream Enrichment​

  • Ingest data from Kafka or CDC streams with Flink.
  • Join streaming data with lookup tables in Redis for millisecond-latency enrichment.
  • Write enriched, structured data directly into Iceberg tables, making it immediately available for Spark to query.

Interactive & Self-Service Analytics​

  • Empower analysts to query live, streaming data via the Flink SQL Gateway and historical, large-scale data using Sparkβ€”all with familiar SQL pointed at the same tables.