← Projects
Production Work

Enterprise Observability Stack

View on GitHub ↗
Prometheus + AlertmanagerLoki + PromtailGrafanaDocker Compose / KubernetesPromQLLogQLOpenTelemetry

About this project

Full-stack observability platform built on Prometheus, Loki, and Grafana for unified metrics, logs, and alerting across distributed infrastructure. Production-grade, deployed at enterprise scale.

Background

Before this platform existed, the monitoring story at Accent Group was fragmented: some teams had Datadog agents on some servers, others had nothing, and alerting was a mix of email thresholds and manual checks. The business operates 800+ retail stores with dependencies across point-of-sale systems, payment gateways, inventory management, and e-commerce. When something goes wrong, you need to know within minutes — not from a customer complaint.

I chose the open-source PLG stack (Prometheus, Loki, Grafana) deliberately. Vendor lock-in with observability tooling is expensive and limits your ability to evolve the platform. Prometheus gives you a pull-based metrics model that scales predictably, Loki indexes log streams by label without parsing the full text (which keeps storage costs rational), and Grafana provides a query and visualisation layer that anyone on the team can learn. The total cost is infrastructure — no per-seat or per-host licensing.

The alerting model was the hardest part to get right. Raw Prometheus alerts are easy to create and easy to make noisy. I built a severity tiering framework that distinguishes between symptoms (a metric threshold breached) and causes (a service genuinely degraded), and routed each tier differently — critical pages on-call via PagerDuty, high goes to Slack, medium to email digest. That discipline reduced alert fatigue significantly and improved the signal-to-noise ratio for the operations team.

The templated Grafana dashboards with multi-environment variable switching mean the same dashboard works for dev, staging, and production — engineers drill down by selecting the environment rather than maintaining separate dashboards. OpenTelemetry instrumentation was added progressively as applications were updated, giving us trace-level visibility in addition to metrics and logs.

Highlights

  • PromQL alerting rules with severity tiers routed via Alertmanager
  • Node Exporter, cAdvisor, and Blackbox Exporter integrations
  • Label-based log indexing via Loki for efficient storage without full-text overhead
  • Templated Grafana dashboards with multi-environment variable switching
  • Alert notification channels: email, Slack, PagerDuty
← All projects GitHub ↗
← Phenomenal infractl →