Stop Grepping Logs: Building an Observability Stack That Actually Tells You What's Broken
If your debugging process starts with 'grep' and ends with 'I have no idea what happened,' your monitoring is fundamentally broken. Here is how to build a Prometheus, Grafana, and Loki stack that pinpoints failures in seconds.
5 min readTL;DR
Logs without context are noise. Metrics without alerts are decoration. Traces without correlation are useless.
- The Stack: Prometheus (Metrics), Grafana (Visualization), Loki (Log Aggregation), Tempo (Distributed Tracing), and Alertmanager (Intelligent Alerting).
- The Verdict: A properly built observability stack reduces your Mean Time To Detection (MTTD) from hours to seconds. It is the difference between finding the bug and the bug finding your customers.
The "It Works on My Machine" Incident
It's Monday morning. Users are reporting intermittent 500 errors. Your backend team checks the application logs. Nothing. The database team checks RDS metrics. Everything looks normal. The network team checks the load balancer. All healthy.
Three hours later, a junior engineer discovers that a single Kubernetes pod has been OOMKilled (Out of Memory) and restarting every 90 seconds. The pod logs were lost on each restart because nobody configured persistent log shipping.
If your team is debugging production incidents by SSHing into servers and running 'grep,' you are operating at the speed of 2010.
Three hours of combined engineering time wasted because the infrastructure couldn't answer a simple question: "What broke, when, and why?"
The Three Pillars of Observability
Monitoring tells you that something is wrong. Observability tells you why.
Modern observability is built on three pillars:
| Pillar | What It Answers | Tool |
|---|---|---|
| Metrics | "Is the system healthy right now?" | Prometheus |
| Logs | "What happened leading up to the failure?" | Loki |
| Traces | "Which specific request failed and where in the chain?" | Tempo |
Most companies implement one or two of these pillars. The magic happens when you connect all three.
Deep Dive 1: Metrics with Prometheus
Prometheus scrapes numerical time-series data from your applications and infrastructure. It answers questions like: "What is the current CPU usage? How many HTTP 500 errors per second? What is the p99 latency?"
The critical insight most teams miss: you must instrument your application, not just your infrastructure.
prometheus-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-service
spec:
selector:
matchLabels:
app: payment-service
endpoints:
- port: metrics
interval: 15s
path: /metricsYour application should expose a /metrics endpoint using a Prometheus client library. This gives you business-level metrics like:
payment_transactions_total(How many payments processed?)payment_duration_seconds(How long does a payment take?)payment_failures_total(How many payments are failing?)
Infrastructure metrics tell you the server is on fire. Application metrics tell you the business is on fire.
Deep Dive 2: Logs with Loki (Not Elasticsearch)
For years, the default logging stack was the EFK stack (Elasticsearch, Fluentd, Kibana). It works, but Elasticsearch is a resource-hungry monster that requires dedicated engineers just to keep it alive.
Grafana Loki is the modern alternative. Instead of indexing the full text of every log line (like Elasticsearch), Loki only indexes the metadata labels (like namespace, pod, container). This makes it 10x cheaper to operate and massively faster to deploy.
loki-values.yaml
loki:
auth_enabled: false
storage:
type: s3
s3:
bucketnames: loki-logs-production
region: eu-west-1
limits_config:
retention_period: 30dWith Loki deployed, every log line from every pod in your Kubernetes cluster is automatically shipped, labeled, and queryable from Grafana. When an alert fires, you click directly from the Prometheus metric to the exact log lines that occurred at the same timestamp. No more kubectl logs. No more guessing.
Deep Dive 3: The "Golden Signals" Alert Strategy
Most alerting configurations are broken because they alert on symptoms instead of signals. Getting an alert every time CPU crosses 70% is useless noise that trains your team to ignore alerts entirely.
Google's SRE handbook defines four Golden Signals that you should alert on:
- Latency — How long requests take (alert on p99 > threshold)
- Traffic — How many requests per second (alert on sudden drops)
- Errors — The rate of failed requests (alert on error rate > 1%)
- Saturation — How "full" a resource is (alert on memory > 90%)
alertmanager-rules.yaml
groups:
- name: golden-signals
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 1% for 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency exceeds 2 seconds"With Golden Signals, your team only gets paged when users are actually impacted. No more alert fatigue. No more "the CPU alert fired again, just ignore it."
The Operational Reality (What Breaks)
Building observability is an ongoing investment, not a one-time project:
- Cardinality Explosion: If you add a
user_idlabel to a Prometheus metric, you will create millions of unique time series and crash your Prometheus server. Labels must be low-cardinality (e.g.,status_code,method,service). - Log Volume Costs: If your application logs every HTTP request body in production, your Loki S3 bucket will cost thousands per month. Log judiciously—log errors in full detail, log successful requests as summaries.
- Dashboard Rot: Grafana dashboards that nobody looks at are technical debt. Every dashboard must be tied to a runbook. If there's no action to take, delete the dashboard.
The Payoff
When observability is done right, incident response transforms completely.
Instead of a 3-hour investigation involving 5 engineers, a single on-call engineer opens Grafana, sees the error rate spike on the golden signals dashboard, clicks through to the correlated Loki logs, identifies the failing microservice, and rolls it back using ArgoCD—all within 5 minutes.
That is the difference between a $50,000 outage and a 5-minute blip.
Is your team flying blind in production? If your debugging strategy is "SSH in and grep," you are wasting hours of engineering time on every incident.
I build observability stacks that reduce your incident response time from hours to minutes.
Stop guessing. Start seeing. Book a Free Infrastructure Audit.
Get weekly DevOps insights
Join engineers who read my deep-dives on Kubernetes, AWS cost optimization, CI/CD, and infrastructure automation.

DevOps Engineer & Cloud Consultant | FinOps, GitOps & Kubernetes Expert
I build systems that run reliably, scale efficiently, and deploy intelligently. See how I can help your team.