Why "just use CloudWatch" isn't enough
CloudWatch is useful. It's also expensive at volume, awkward to query, and produces dashboards that tell you something is wrong after the fact. For teams running production workloads, the defaults leave serious blind spots: no distributed tracing, no structured log correlation, and alert fatigue from metrics that lack context.
Here's the observability stack we've converged on after running a dozen cloud-native systems in production.
The three pillars, in practice
Logs
Structured JSON logs from day one. Every log line includes traceId, requestId,userId (where available), service, and level. We ship to either OpenSearch (self-hosted on AWS) or Axiom (managed, much cheaper at moderate volume) depending on the client's compliance requirements.
The key rule: log at the boundary, not in the middle. Log incoming requests and outgoing responses. Don't log every function call — that's what traces are for.
Traces
OpenTelemetry for instrumentation — it's vendor-neutral and has first-class support in Node.js, Python, and Go. We send traces to either AWS X-Ray (if the client is already AWS-native) or Jaeger on a small EC2 instance for cost-sensitive setups.
The single most valuable thing traces give you: the ability to find the slow span in a multi-service request. When a user reports that "the checkout is slow," a trace shows you exactly whether the bottleneck is the payment gateway call, the inventory check, or the database write.
Metrics and alerts
We use Prometheus + Grafana for service-level metrics: request rate, error rate, p50/p95/p99 latency (the RED method). Alerts are wired to PagerDuty or Slack depending on severity.
Critical rule: every alert must be actionable. If an alert fires and the on-call engineer has no clear first step, the alert is noise. We audit alert runbooks quarterly.
The Cloudflare Workers gotcha
If you're running on Cloudflare Workers, the observability story is different. Workers have no persistent memory — you can't run a local Prometheus exporter. We use Cloudflare's built-in analytics for request/error rates and pipe structured logs to Axiom via Workers' ctx.waitUntil to avoid blocking the response path.
Start here if you're starting from zero
- Add structured JSON logging with a trace ID in week one.
- Set up uptime monitoring (Better Uptime or Checkly) before launch — it takes 30 minutes.
- Add p95 latency and error rate alerts before you have users. Baseline them, then alert on deviation.
- Add distributed tracing in month two, once you have enough load to make it meaningful.