This page covers where to look when you want to see what the platform is doing, and how alarms get from a metric to incident.io.
Metrics and dashboards
Section titled “Metrics and dashboards”All metrics, dashboards, and alarms live in Amazon CloudWatch. Default AWS metrics are augmented with custom metrics emitted from application code where there’s something specific worth tracking.
SLOs are codified in Terraform. If you want to know what counts as
“production health degraded” for a service, look at its alarms in
infra/providers/aws/03-environment/03-application/.
Application logs land in CloudWatch Logs. Day-to-day querying is through CloudWatch Logs Insights. Logs are JSON-structured; the infrastructure annotates each line with the service name, correlation ID, and request ID automatically.
What not to log
Section titled “What not to log”Don’t log personal data: no email addresses, no full names, no phone numbers, no addresses, no customer-uploaded content. Use identifiers (user ID, tenant ID, resource ID) instead. The same applies to credentials: no passwords, no API keys, no OAuth or session tokens, no AWS credentials.
If you spot any of these in logs, treat the incident as a leaked credential under Vulnerability management and rotate the affected credential first.
Tracing
Section titled “Tracing”AWS X-Ray is enabled for distributed tracing. When you’re tracking a request across services, X-Ray is the place to start.
Alarms
Section titled “Alarms”Alarms are defined in Terraform and routed through SNS topics into incident.io. The pattern:
- Each service has its own SNS topic subscription with an
AlarmNameprefix filter, so an alarm calledApiGateway5xxErrorsroutes to the API service’s incident.io feed automatically. - Alarm names follow
<Service><Description>(e.g.ApiGateway5xxErrors,DatabaseConnectionFailures).
When you add a new alarm, follow the naming convention and the existing service-prefix routing in the Terraform, and the routing follows.
Synthetic monitoring
Section titled “Synthetic monitoring”Sentry runs uptime pings against the production public endpoints.