Pages

Friday, 20 February 2026

Grafana Observability Stack

 




Grafana uses these components together as an observability stack, but each has a clear role:


Loki – log database. It stores and indexes logs (especially from Kubernetes) in a cost‑efficient, label‑based way, similar to Prometheus but for logs.

Tempo – distributed tracing backend. It stores distributed traces (spans) from OpenTelemetry, Jaeger, Zipkin, etc., so you can see call flows across microservices and where latency comes from.

Mimir – Prometheus‑compatible metrics backend. It is a horizontally scalable, long‑term storage and query engine for Prometheus‑style metrics (time series).

Alloy – telemetry pipeline (collector). It is Grafana’s distribution of the OpenTelemetry Collector / Prometheus agent / Promtail ideas, used to collect, process, and forward metrics, logs, traces, profiles into Loki/Tempo/Mimir (or other backends).


How Grafana UI relates to them


Grafana UI itself is “just” the visualization and alerting layer:

  • It connects to Loki, Tempo, Mimir (and many others) as data sources.
  • For each backend you configure:
    • A Loki data source for logs.
    • A Tempo data source for traces.
    • A Prometheus/Mimir data source for metrics (Mimir exposes a Prometheus‑compatible API).
  • Grafana then lets you:
    • Build dashboards and alerts from Mimir metrics.
    • Explore logs from Loki.
    • Explore traces from Tempo and cross‑link them with logs/metrics (e.g., click from a log line to a trace, or from a metrics graph into logs/traces).

A useful mental model: Loki/Tempo/Mimir are databases, Alloy is the collector/router, and Grafana is the UI on top.


Are they deployed in the same Kubernetes cluster?


Common patterns:

  • Very common: deploy Loki, Tempo, Mimir, Alloy, and Grafana in the same Kubernetes cluster as your apps. This is the typical “in‑cluster LGTM” setup; all telemetry stays inside the cluster and traffic is simple.
  • Also common: run them in a separate observability cluster (or use Grafana Cloud backends), while Alloy/agents run in each workload cluster and ship data over the network. This improves isolation and makes it easier to share one observability stack across many clusters.
  • In smaller setups or dev environments, everything (apps + LGTM + Grafana) often lives in one cluster; in larger/regulated setups, people tend to separate “workload clusters” and an “observability cluster”.

So: they don’t have to be on the same cluster, but it’s perfectly normal (and often simplest) to run Grafana + Loki + Tempo + Mimir + Alloy together in a single Kubernetes cluster and point your apps’ telemetry to Alloy.


Why not using elasticsearch instead of loki, tempo and mimir?


Elasticsearch can replace part of what Loki, Tempo, and Mimir do, but not all of it, and usually with higher cost/complexity for cloud‑native observability.

1. Scope: logs vs full observability


Elasticsearch is a general search and analytics engine that’s great at full‑text search, aggregations, and analytics over documents (including logs).

The LGTM stack is explicitly split by signal:
  • Loki → logs
  • Tempo → traces
  • Mimir → metrics

Each is optimized only for its signal type and integrates tightly with Grafana and modern telemetry standards.

You could plausibly replace Loki with Elasticsearch for logs, but Elasticsearch does not natively replace Tempo (distributed tracing backend) or Mimir (Prometheus‑compatible metrics backend).

2. Logs: Loki vs Elasticsearch


Elasticsearch strengths:
  • Very powerful full‑text search, fuzzy matching, relevance scoring, complex aggregations.
  • Good when you need deep forensic search and advanced analytics on log text.

Loki strengths:
  • Stores logs as compressed chunks plus a small label index, so storage and compute are much cheaper than Elasticsearch for typical Kubernetes logs.
  • Very tight integration with Grafana and the rest of LGTM, and simple, label‑based querying.

Trade‑off: Elasticsearch gives richer search at a high infra + ops cost, Loki gives “good enough” search for operational troubleshooting with much lower cost and operational burden.

3. Traces and metrics: Tempo & Mimir vs “just ES”


Tempo:
  • Implements distributed tracing concepts (spans, traces, service graphs) and OpenTelemetry/Jaeger/Zipkin protocols; the data model and APIs are specialized for traces.
  • Elasticsearch can store trace‑like JSON documents, but you’d have to build/maintain all the trace stitching, UI navigation, and integrations yourself.

Mimir:
  • Is a horizontally scalable, Prometheus‑compatible time‑series database, with native remote‑write/read and PromQL semantics.
  • Elasticsearch can store time‑stamped metrics, but you lose Prometheus compatibility, PromQL semantics, and the whole ecosystem that expects a Prometheus‑style API.

So using only Elasticsearch means you’re giving up the standard metrics and tracing ecosystems and rebuilding a lot of tooling on top of a generic search engine.

4. Cost, complexity, and operational burden


Elasticsearch clusters generally need:
  • More RAM/CPU per node, careful shard and index management, and capacity planning.
  • Storage overhead from full‑text indexes (often 1.5–3× raw log size plus replicas).
Loki/Tempo/Mimir:

  • Are designed for object storage, compression, and label‑only indexing, which dramatically lowers storage and compute requirements for logs and metrics.
  • Have simpler, well‑documented reference architectures specifically for observability.

For a modern Kubernetes‑centric environment, that usually makes LGTM cheaper and easier to run than a single big Elasticsearch cluster for everything.

5. When Elasticsearch still makes sense


You might still choose Elasticsearch (often with Kibana/APM) if:
  • You already have a strong ELK stack and team expertise.
  • Your primary need is deep, flexible text search and analytics over logs, with less emphasis on Prometheus/OTel ecosystems.
  • You want Elasticsearch’s ML/anomaly‑detection features and are willing to pay the operational cost.

But if your goal is a Grafana‑centric, standards‑based (Prometheus + OpenTelemetry) observability platform, LGTM (Loki+Tempo+Mimir, plus Alloy as collector) is a better fit than trying to push everything into Elasticsearch.

---

No comments:

Post a Comment