AWS CloudWatch Usage Guide

Intro to CloudWatch

Amazon CloudWatch monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real time. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications[https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html].

Amazon CloudWatch console – https://eu-north-1.console.aws.amazon.com/cloudwatch/home?region=eu-north-1#

What can we monitor in AWS CloudWatch?

Metrics

AWS CloudWatch Metrics – https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#metricsV2?graph=~()

Metrics are the fundamental concept in monitoring and are used to represent the data points of various resources and applications.

There are these 2 types of metrics:

System Metrics: Automatically collected and sent to CloudWatch for AWS services like EC2, RDS, etc. These are defined in “AWS namespaces”.
Custom Metrics: Metrics that you define and publish to CloudWatch. These are defined in “Custom namespaces”.

These are main concepts needed to understand metrics' data and structure:

Namespace: A container for CloudWatch metrics, allowing differentiation between different services or applications.

EC2, EFS, ELB, Events – these all are different namespaces.
Dimensions: Name/value pairs that uniquely identify a metric.
For example, this is a dimension.
Units: The statistical unit of a metric (e.g., Seconds, Bytes, Count).
For example, this metrics’s unit is Count.

After selecting a metric, you can change

Label
Statistic – It refers to the basic mathematical operations that are applied to metric data points over a specified period of time. It is a way to aggregate and interpret data points for analysis. The primary statistics available in CloudWatch are:
1. Average: The sum of all data points divided by the number of data points.
2. Sum: The sum of all data points.
3. Minimum: The smallest data point value.
4. Maximum: The largest data point value.
5. Sample Count: The count of data points.
6. pXX (Percentiles): These are less common but can be very useful. Percentiles (like p90, p95, p99) show the value below which a certain percentage of data points fall.
Period - It is the length of time over which AWS CloudWatch aggregates data points into a single data point (i.e., the statistic). It's essentially the window of time that CloudWatch uses to evaluate your metric data. Common periods are 1 minute, 5 minutes, 15 minutes, 1 hour, etc.

Logs

AWS CloudWatch Log Groups – https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#logsV2:log-groups

1 - defines log group name which should be descriptive and maybe include service name for which it stores logs

2 - retention period how long to store logging data in a log group

Log groups are used to group log streams that share the same retention, monitoring, and access control settings. For instance, you can create a log group for all logs from a particular application or system component.

How to Choose a Log Group?

Mostly we have this log group naming convention by default:

fluent-bit-cloudwatch-{cluster_name} → containers' logs
fluent-bit-cloudwatch-{cluster_name}-kube → containers' logs from kube-system namespace
adot_log_group_name → adot’s logs (turned off be default)
/aws/eks/{cluster_name}/cluster → AWS EKS logs (api, scheduler, etc.)
/aws/rds/instance/{rds_name}/… → RDS instance logs

AWS CloudWatch Logs Insights – https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#logsV2:logs-insights

CloudWatch Logs Insights is a powerful feature for querying and analyzing log data stored in CloudWatch Logs. It allows you to perform queries to help you understand and interpret your log data.

Querying Logs: You can write queries to extract fields from log data, calculate statistics, sort and filter results, and more.
Visualizing Data: Logs Insights can visualize query results, making it easier to analyze and interpret the data.
Interactive Analysis: You can interactively run queries on your log data, fine-tune them, and see results quickly.

AWS CloudWatch Live Tail – https://us-west-1.console.aws.amazon.com/cloudwatch/home?region=us-west-1#logsV2:live-tail

Live Tail in CloudWatch Logs allows you to view streaming log data in real-time as it is sent to CloudWatch. This feature is particularly useful for real-time application and system monitoring, troubleshooting, and quickly identifying issues as they occur.

Live Tail has these important features:

Real-Time Streaming: Live Tail streams log data as it's sent to CloudWatch Logs without any delay, providing immediate insights into your application or system's behavior.
Search and Filter: You can run queries to filter and search the log data in real-time, which helps in pinpointing specific issues or monitoring certain aspects of your system.

You need to select a log group and can set a filter:

Alarms

AWS CloudWatch Alarms – https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#alarmsV2:

They enable you to watch a single CloudWatch metric or the result of a CloudWatch Logs Insights query, and perform one or more actions based on the value of the metric relative to a given threshold over a number of time periods.

Key concepts:

Metrics and Thresholds: Alarms are set on specific metrics. You define a threshold, and when the metric crosses this threshold, the alarm changes state.
Alarm States: A CloudWatch alarm has three states:
- OK: The metric is within the defined threshold.
- ALARM: The metric is outside of the defined threshold.
- INSUFFICIENT_DATA: The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.
Periods and Evaluation: The period is the length of time over which the metric is evaluated. You also set an evaluation period, which is how many consecutive periods the metric must be breaching the threshold to trigger the alarm.
Actions: When an alarm changes its state, it can trigger actions. Actions can include sending a message to an SNS topic, Auto Scaling actions, or EC2 actions.
Composite Alarms: These allow you to combine multiple alarms into one alarm. The composite alarm goes to ALARM state only if all the underlying alarms are in the ALARM state, based on a logical "AND" or "OR" rule.

Dashboards

AWS CloudWatch Dashboards – https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#dashboards/

Creating a dashboard in AWS CloudWatch provides a unified view of the metrics, logs, and alarms for your AWS resources and applications. Dashboards are customizable and can display data in various formats, such as graphs, metrics widgets, and text widgets.

Key concepts:

Widgets: The primary components of a CloudWatch Dashboard are widgets. Each widget can display different types of data or visualizations, like line charts, stacked area charts, numbers, text, and even query results from CloudWatch Logs Insights.
Metrics and Logs Visualization: You can add widgets to display metrics or logs. Metrics widgets can show data like CPU utilization, network traffic, disk I/O, while log widgets can display the output of a CloudWatch Logs Insights query.
Alarm Widgets: These widgets display the state of CloudWatch alarms and can be used to quickly understand the health status of your resources and applications.
Customization: Dashboards are highly customizable. You can resize and arrange widgets to suit your monitoring needs and preferences.
Real-Time and Historical Data: Dashboards can show both real-time and historical data, allowing you to analyze trends over time.