Skip to content

Monitoring

Monitoring

Key Concepts

Utilizing AWS CloudWatch dashboards enables centralized monitoring of API Gateway, Lambda functions, and DynamoDB, providing real-time insights into their performance and operational health. By aggregating metrics, logs, and alarms, CloudWatch facilitates swift issue diagnosis and analysis across your serverless applications. Additionally, setting up alarms ensures immediate alerts during anomalous activities, enabling proactive issue mitigation.

Service Architecture

hl

The goal is to monitor the service API gateway, Lambda function, and DynamoDB tables and ensure everything is in order.

In addition, we want to visualize service KPI metrics.

Monitoring Dashboards

We will define two dashboards:

  • High level
  • Low level

Each dashboard has its usage and tailors different personas' usage.

High Level Dashboard

hl

This dashboard is designed to be an executive overview of the service.

Total API gateway metrics provide information on the performance and error rate of the service.

KPI metrics are included in the bottom part as well.

Personas that use this dashboard: SRE, developers, and product teams (KPIs)

Low Level Dashboard

lv dynamo

It is aimed at a deep dive into all the service's resources. Requires an understanding of the service architecture and its moving parts.

The dashboard provides the Lambda function's metrics for latency, errors, throttles, provisioned concurrency, and total invocations.

In addition, a CloudWatch logs widget shows only 'error' logs from the Lambda function.

As for DynamoDB tables, we have the primary database and the idempotency table for usage, operation latency, errors, and throttles.

Personas that use this dashboard: developers, SREs.

Alarms

Having visibility and information is one thing, but being proactive and knowing beforehand that a significant error is looming is another. A CloudWatch

Alarm is an automated notification tool within AWS CloudWatch that triggers alerts based on user-defined thresholds, enabling users to identify and

respond to operational issues, breaches, or anomalies in AWS resources by monitoring specified metrics over a designated period.

In this dashboard, you will find an example of two types of alarms:

  • Alarm for performance threshold monitoring
  • Alarm for error rate threshold monitoring

For latency-related issues, we define the following alarm:

p90

For P90, P50 metrics, follow this explanation.

For internal server errors rate, we define the following alarm: 5xx

Actions

Alarms are of no use unless they have an action. We have configured the alarms to send an SNS notification to a new SNS topic. From there, you can connect any subscription - HTTPS/SMS/Email, etc. to notify your teams with the alarm details.

CDK Reference

We use the open-source cdk-monitoring-constructs.

You can view find the monitoring CDK construct here.

Further Reading

If you wish to learn more about this concept and go over details on the CDK code, check out my blog post.