Real-time OLAP with Apache Pinot
This stack deploys a basic Apache Pinot cluster, a real-time distributed OLAP (Online Analytical Processing) datastore designed for ultra-low-latency analytics at scale. It includes the core Pinot components: Controller, Broker, and Server.
📌 Description
This architecture provides the foundation for ingesting data from batch (e.g., HDFS, S3) or streaming sources (e.g., Kafka) and making it available for analytical queries with response times often in milliseconds. Pinot is optimized for user-facing analytics, real-time dashboards, anomaly detection, and other scenarios requiring fast insights on fresh data.
Note: This configuration requires an external Apache Zookeeper instance running at zookeeper:2181
on the factorhouse
network for cluster coordination, which is not defined within the Docker Compose file.
🔑 Key components
👑 Pinot controller (pinot-controller
)
- Container:
apachepinot/pinot:1.2.0
- Role: Manages the overall cluster state, handles administration tasks (like adding tables, schema management), coordinates segment assignment, and monitors node health via Zookeeper.
- Admin UI/API: Exposed externally at
http://localhost:19000
(maps to internal port 9000). - Healthcheck verifies its readiness.
📡 Pinot broker (pinot-broker
)
- Container:
apachepinot/pinot:1.2.0
- Role: Acts as the query gateway. Receives SQL queries from clients, determines which servers hold the relevant data segments, scatters the query to those servers, gathers the results, and returns the final consolidated response.
- Query Endpoint: Exposed externally at
http://localhost:18099
(maps to internal port 8099). - Depends on the Controller being healthy before starting.
- Healthcheck verifies its readiness.
💾 Pinot server (pinot-server
)
- Container:
apachepinot/pinot:1.2.0
- Role: Hosts data segments (shards) and executes query fragments against the data it stores. Can ingest data directly from streaming sources (Realtime Server) or load pre-built segments from deep storage (Offline Server). This configuration runs a generic Server capable of both roles depending on table setup.
- Internal API/Metrics: Exposed externally at
http://localhost:18098
(maps to internal port 8098/8097 for health). Direct interaction is less common than with the Broker or Controller. - Depends on the Broker being healthy before starting.
- Healthcheck verifies its readiness.
🌐 Network & dependencies
- All components reside on the
factorhouse
network. - Relies on an external Zookeeper instance at
zookeeper:2181
for coordination. - Startup order is enforced via
depends_on
andhealthcheck
conditions: Controller -> Broker -> Server.
🧰 Use cases
Real-time dashboards
- Power interactive dashboards requiring millisecond query latency on potentially large, constantly updating datasets (e.g., operational monitoring, business intelligence).
User-facing analytics
- Embed analytics directly into applications where users can explore data slices and dices with immediate feedback (e.g., e-commerce site analytics, personalized recommendations).
Anomaly & threat detection
- Query streaming event data in near real-time to identify patterns, outliers, or anomalies quickly (e.g., fraud detection, system security monitoring).
A/B testing analysis
- Ingest experiment data and provide rapid aggregations and comparisons to evaluate A/B test performance.
Log analytics
- Provide fast, interactive querying over large volumes of log or event data for troubleshooting and analysis.