Stop Grepping Logs: Building an Observability Stack That Actually Tells You What's Broken

If your debugging process starts with 'grep' and ends with 'I have no idea what happened,' your monitoring is fundamentally broken. Here is how to build a Prometheus, Grafana, and Loki stack that pinpoints failures in seconds.

5 min read

TL;DR

Logs without context are noise. Metrics without alerts are decoration. Traces without correlation are useless.

The Stack: Prometheus (Metrics), Grafana (Visualization), Loki (Log Aggregation), Tempo (Distributed Tracing), and Alertmanager (Intelligent Alerting).
The Verdict: A properly built observability stack reduces your Mean Time To Detection (MTTD) from hours to seconds. It is the difference between finding the bug and the bug finding your customers.

The "It Works on My Machine" Incident

It's Monday morning. Users are reporting intermittent 500 errors. Your backend team checks the application logs. Nothing. The database team checks RDS metrics. Everything looks normal. The network team checks the load balancer. All healthy.

Three hours later, a junior engineer discovers that a single Kubernetes pod has been OOMKilled (Out of Memory) and restarting every 90 seconds. The pod logs were lost on each restart because nobody configured persistent log shipping.

Team working on monitoring dashboards If your team is debugging production incidents by SSHing into servers and running 'grep,' you are operating at the speed of 2010.

Three hours of combined engineering time wasted because the infrastructure couldn't answer a simple question: "What broke, when, and why?"

The Three Pillars of Observability

Monitoring tells you that something is wrong. Observability tells you why.

Modern observability is built on three pillars:

Pillar	What It Answers	Tool
Metrics	"Is the system healthy right now?"	Prometheus
Logs	"What happened leading up to the failure?"	Loki
Traces	"Which specific request failed and where in the chain?"	Tempo

Most companies implement one or two of these pillars. The magic happens when you connect all three.

Deep Dive 1: Metrics with Prometheus

Prometheus scrapes numerical time-series data from your applications and infrastructure. It answers questions like: "What is the current CPU usage? How many HTTP 500 errors per second? What is the p99 latency?"

The critical insight most teams miss: you must instrument your application, not just your infrastructure.

prometheus-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payment-service
spec:
  selector:
    matchLabels:
      app: payment-service
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Your application should expose a /metrics endpoint using a Prometheus client library. This gives you business-level metrics like:

payment_transactions_total (How many payments processed?)
payment_duration_seconds (How long does a payment take?)
payment_failures_total (How many payments are failing?)

Infrastructure metrics tell you the server is on fire. Application metrics tell you the business is on fire.

Deep Dive 2: Logs with Loki (Not Elasticsearch)

For years, the default logging stack was the EFK stack (Elasticsearch, Fluentd, Kibana). It works, but Elasticsearch is a resource-hungry monster that requires dedicated engineers just to keep it alive.

Grafana Loki is the modern alternative. Instead of indexing the full text of every log line (like Elasticsearch), Loki only indexes the metadata labels (like namespace, pod, container). This makes it 10x cheaper to operate and massively faster to deploy.

loki-values.yaml

loki:
  auth_enabled: false
  storage:
    type: s3
    s3:
      bucketnames: loki-logs-production
      region: eu-west-1
  limits_config:
    retention_period: 30d

With Loki deployed, every log line from every pod in your Kubernetes cluster is automatically shipped, labeled, and queryable from Grafana. When an alert fires, you click directly from the Prometheus metric to the exact log lines that occurred at the same timestamp. No more kubectl logs. No more guessing.

Deep Dive 3: The "Golden Signals" Alert Strategy

Most alerting configurations are broken because they alert on symptoms instead of signals. Getting an alert every time CPU crosses 70% is useless noise that trains your team to ignore alerts entirely.

Google's SRE handbook defines four Golden Signals that you should alert on:

Latency — How long requests take (alert on p99 > threshold)
Traffic — How many requests per second (alert on sudden drops)
Errors — The rate of failed requests (alert on error rate > 1%)
Saturation — How "full" a resource is (alert on memory > 90%)

alertmanager-rules.yaml

groups:
  - name: golden-signals
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 1% for 5 minutes"
 
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency exceeds 2 seconds"

With Golden Signals, your team only gets paged when users are actually impacted. No more alert fatigue. No more "the CPU alert fired again, just ignore it."

The Operational Reality (What Breaks)

Building observability is an ongoing investment, not a one-time project:

Cardinality Explosion: If you add a user_id label to a Prometheus metric, you will create millions of unique time series and crash your Prometheus server. Labels must be low-cardinality (e.g., status_code, method, service).
Log Volume Costs: If your application logs every HTTP request body in production, your Loki S3 bucket will cost thousands per month. Log judiciously—log errors in full detail, log successful requests as summaries.
Dashboard Rot: Grafana dashboards that nobody looks at are technical debt. Every dashboard must be tied to a runbook. If there's no action to take, delete the dashboard.

The Payoff

When observability is done right, incident response transforms completely.

Instead of a 3-hour investigation involving 5 engineers, a single on-call engineer opens Grafana, sees the error rate spike on the golden signals dashboard, clicks through to the correlated Loki logs, identifies the failing microservice, and rolls it back using ArgoCD—all within 5 minutes.

That is the difference between a $50,000 outage and a 5-minute blip.

Is your team flying blind in production? If your debugging strategy is "SSH in and grep," you are wasting hours of engineering time on every incident.

I build observability stacks that reduce your incident response time from hours to minutes.

Stop guessing. Start seeing. Book a Free Infrastructure Audit.

Get weekly DevOps insights

Join engineers who read my deep-dives on Kubernetes, AWS cost optimization, CI/CD, and infrastructure automation.

View My Services Book a Free Audit

Mohamed ARKID

DevOps Engineer & Cloud Consultant | FinOps, GitOps & Kubernetes Expert

I build systems that run reliably, scale efficiently, and deploy intelligently. See how I can help your team.

Keep Reading

Your AWS Bill is 30% Too High: The Architect's Guide to Slashing Cloud Costs

4 min read

→

Kubernetes on Bare Metal: Why It's Harder Than You Think (And Why It's Worth It)

6 min read

→

Your Docker Images are a Liability: How to Automate Container Security and Stop Supply Chain Attacks

4 min read

→

Command Palette

Stop Grepping Logs: Building an Observability Stack That Actually Tells You What's Broken

Get weekly DevOps insights

Keep Reading