Centralized Observability & Data Lineage
This Docker Compose stack deploys a powerful environment for data lineage and systems observability. It features Marquez, the reference implementation of the OpenLineage open standard for data lineage, alongside a complete observability suite from Prometheus, Grafana, and Alertmanager.
📌 Description​
This architecture is engineered for data professionals who need to understand their data's journey and monitor the health of their systems. It combines the power of OpenLineage for standardized metadata collection with Marquez for visualization and analysis. This is complemented by the industry-standard Prometheus/Grafana stack for comprehensive metrics and alerting.
The core of this stack is OpenLineage, a standardized API for collecting data lineage information. Marquez acts as the metadata server, collecting these OpenLineage events to build a living map of how datasets are produced and consumed. This is invaluable for impact and root cause analysis, data governance, and debugging complex data pipelines. The surrounding observability tools ensure the reliability and performance of the entire platform.
🔑 Key Components​
🚀 OpenLineage & Marquez (Data Lineage & Metadata Service)​
OpenLineage is an open standard for the collection and analysis of data lineage. It provides a consistent format for data pipeline tools to emit metadata about jobs, datasets, and runs.
marquez-api
(marquezproject/marquez:0.51.1
): The core Marquez backend service. It provides a RESTful API that is compliant with the OpenLineage standard, allowing it to receive metadata from a wide range of integrated tools like Flink, Spark, Airflow, and dbt.marquez-web
(marquezproject/marquez-web:0.51.1
): The web interface for Marquez, which visualizes the collected OpenLineage data. It allows users to browse the metadata catalog, explore interactive data lineage graphs, and trace the journey of their data. The UI is exposed on port3003
.marquez-db
(postgres:14
): This PostgreSQL database serves as the backend for Marquez, storing all the metadata collected via OpenLineage events. It holds information on jobs, datasets, historical runs, and their relationships.
📊 Observability Stack (Prometheus, Grafana & Alertmanager)​
This is a widely-used, powerful open-source stack for monitoring and alerting.
prometheus
(prom/prometheus:v3.5.0
): A time-series database that collects and stores metrics by scraping configured endpoints. It is configured via aprometheus.yml
file and includes specific rule files for monitoring other services (e.g., Kpow), indicating its role in a larger ecosystem. It is accessible on port19090
.alertmanager
(prom/alertmanager:v0.28.1
): Manages alerts sent by Prometheus. It is responsible for deduplicating, grouping, and routing them to the correct notification channels like email or Slack. It exposes its UI on port19093
.grafana
(grafana/grafana:12.1.1
): A leading visualization platform for creating dashboards from the metrics stored in Prometheus. This service is pre-configured with an admin user and uses a provisioning folder to automatically set up datasources and dashboards on startup. It is available on port3004
.
🧰 Use Cases​
Automated Data Lineage & Provenance Tracking​
- Leverage OpenLineage integrations to automatically capture lineage metadata from your data pipelines. Use Marquez to visualize the origin, movement, and transformations of data across your entire ecosystem.
Impact and Root Cause Analysis​
- When a data pipeline fails or data quality issues arise, use the lineage graph in Marquez to quickly identify the root cause upstream and assess the potential impact on downstream datasets and dashboards.
Data Governance and Compliance​
- Maintain a detailed, historical record of dataset versions, schema changes, and job execution history. This is essential for auditing, ensuring data governance policies are met, and understanding the lifecycle of your data.
Centralized System Health Monitoring​
- Utilize the Prometheus and Grafana stack to monitor the performance and health of the Marquez services and other integrated components. Create dashboards to track API latency, database connections, and resource utilization.
Proactive Alerting on Data & System Issues​
- Configure alerts in Prometheus and Alertmanager to be notified of potential problems. This could include failed job runs reported in the OpenLineage metadata, or system-level issues like high CPU usage, before they impact your data consumers.