Thursday 30 May 2024

How to monitor EC2 instance metrics in Amazon CloudWatch



Amazon CloudWatch offers 2 types of monitoring:
  • Basic
  • Detailed
    • all metrics available in 1 minute periods
    • needs explicitly to be enabled for the instance; can be enabled on an instance upon launch or after the instance is running or stopped; enabling it does not affect the monitoring of the EBS volumes attached to the instance
    • charge per metric

Amazon CloudWatch can monitor two types of EC2 instance metrics:
  • Basic (Default) Metrics
    • AWS/EC2 namespace includes the following metrics:
      • Instance metrics:
        • CPUUtilization
        • DiskReadOps
        • DiskWriteOps
        • DiskReadBytes
        • DiskWriteBytes
        • MetadataNoToken
        • MetadataNoTokenRejected
        • NetworkIn
        • NetworkOut
        • NetworkPacketsIn
        • NetworkPacketsOut
      • CPU credit metrics
        • CPUCreditUsage
        • CPUCreditBalance
        • CPUSurplusCreditBalance
        • CPUSurplusCreditsCharged
      • Dedicated Host metrics
        • DedicatedHostCPUUtilization
      • EBS metrics for Nitro-based instances
        • EBSReadOps
        • EBSWriteOps
        • EBSReadBytes
        • EBSWriteBytes
        • EBSIOBalance%
        • EBSByteBalance%
      • Status check metrics
        • StatusCheckFailed
        • StatusCheckFailed_Instance
        • StatusCheckFailed_System
        • StatusCheckFailed_AttachedEBS
    • AWS/EBS namespace includes the following status check metric
      • VolumeStalledIOCheck
    • By default, Amazon EC2 sends metric data to CloudWatch in 5-minute periods
  • Additional (Custom) Metrics 
    • internal system-level metrics
    • e.g. RAM, Instance Swap details, EBS disks utilization etc...
    • require use of one of these technologies:
      • CloudWatch Agents (recommended way). It needs to be installed on our EC2 instances, and then configured to emit selected metrics.
      • CloudWatch Monitoring Scripts (legacy way)
    • Metrics collected by the CloudWatch agent are billed as custom metrics.

CloudWatch monitoring scripts are deprecated and recommended way is to use CloudWatch agent instead of script for collecting the logs and metrics. 


How to use CloudWatch Agent for EC2 instance monitoring?


First, CloudWatch Agent needs to be installed on the EC2 instance. CloudWatch agent is available as a package in Amazon Linux 2023 and Amazon Linux 2.


# yum install amazon-cloudwatch-agent

Amazon Linux 2023 repository                                                                                                                                                                                                                   31 kB/s | 3.6 kB     00:00    
Amazon Linux 2023 Kernel Livepatch repository                                                                                                                                                                                                  38 kB/s | 2.9 kB     00:00    
Dependencies resolved.
===============================================================================================================
 Package                 Architecture            Version                 Repository                  Size
 ===============================================================================================================
Installing:
 amazon-cloudwatch-agent   x86_64          1.300033.0-1.amzn2023        amazonlinux                   95 M

Transaction Summary
===============================================================================================================
Install  1 Package

Total download size: 95 M
Installed size: 360 M
Is this ok [y/N]: y
Downloading Packages:
amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64.rpm                                                                                                                                                                                       71 MB/s |  95 MB     00:01    
---------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                                                                                                          67 MB/s |  95 MB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                                                                                                                                      1/1 
  Running scriptlet: amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64                                                                                                                                                                                                 1/1 
create group cwagent, result: 0
create user cwagent, result: 0

  Installing       : amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64                                                                                                                                                                                                 1/1 
  Running scriptlet: amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64                                                                                                                                                                                                 1/1 
  Verifying        : amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64                                                                                                                                                                                                 1/1 

Installed:
  amazon-cloudwatch-agent-1.300033.0-1.amzn2023.x86_64                                                                                                                                                                                                                        
Complete!

This was done on the EC2 instance with this OS:

# cat /etc/os-release

NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.4.20240429"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"

We then need to make sure that the IAM role attached to the EC2 instance has the CloudWatchAgentServerPolicy - AWS Managed Policy attached to it.



How to check if CloudWatch agent is installed on EC2 instance?



We need to SSH into EC2 instance and then run as root:

# /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status

{
  "status": "stopped",
  "starttime": "",
  "configstatus": "not configured",
  "version": "1.300033.0"
}

If service is configure and running, the output is like:

{
  "status": "running",
  "starttime": "2024-05-29T11:27:25+00:00",
  "configstatus": "configured",
  "version": "1.300033.0"
}


How to configure CloudWatch Agent?


Before running the CloudWatch agent on any server, we must create one or more CloudWatch agent configuration file(s) on the server. This configuration file define which metrics we want to be collected, on which resources and also around which descriptors (dimensions) we want to group these metrics for a visual representation.

Dimensions:
  • attributes that provide context for metrics by categorizing them according to specific criteria
  • describe and categorize
  • descriptive characteristic or attribute of data
Metrics:
  • they quantify and provide numerical details
  • assign numeric values to dimensions of our choice

Configuration file can have an arbitrary name e.g. amazon-cloudwatch-agent.json.


Example: Collecting metrics of EBS disks mounted on EC2 instance

We first need to find mounting points for root and data disks:

# lsblk

NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1       259:0    0   50G  0 disk 
├─nvme0n1p1   259:1    0   50G  0 part /
├─nvme0n1p127 259:2    0    1M  0 part 
└─nvme0n1p128 259:3    0   10M  0 part /boot/efi
nvme1n1       259:4    0  300G  0 disk /home/my-user/ebs-volume-1-mount-point-dir
nvme2n1       259:5    0  130G  0 disk /home/my-user/ebs-volume-2-mount-point-dir
nvme3n1       259:6    0   35G  0 disk /home/my-user/ebs-volume-3-mount-point-dir

We can then specify which metrics we want to collect on which disks (identified via their mount points) and then how to group them - in this case by InstanceId:

# vi /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "append_dimensions": {
        "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "disk": {
        "measurement": [
  "free",
  "total",
  "used",
          "used_percent"
        ],
        "metrics_collection_interval": 60,
        "resources": [
          "/",
          "/home/my-user/ebs-volume-1-mount-point-dir",
          "/home/my-user/ebs-volume-2-mount-point-dir",
          "/home/my-user/ebs-volume-3-mount-point-dir"
        ]
      }
    }
  }
}


We can create a dedicated user (e.g. cwagent) or use one of existing users (e.g. my-user) on the instance.

We can also specify a namespace but if it's not specified, a default namespace (CWAgent) is used. That namespace will appear in AWS Console in CloudWatch >> Metrics >> All Metrics >> Custom namespaces.

If we use InstanceId as a dimension and don't specify any aggregation dimensions, CloudWatch will automatically use InstanceId, path, fdisk and device. For the example above, there will be 4 graphs as we have [InstanceId, path, fdisk, device], InstanceId is always the same and there are 4 unique combinations of [path, fdisk, device].

If our instances are managed by Auto-scaling Group (ASG) and we want to monitor disks within it, we need to specify aggregation dimensions as otherwise CloudWatch can't use just AutoScalingGroupName to uniquely identify disks (as it needs to draw one graph per disk). If we have max 1 instance in ASG, we can go with additionally specifying e.g. path:

    "metrics": {
      "append_dimensions": {
        "AutoScalingGroupName":"${aws:AutoScalingGroupName}"
      },
      "aggregation_dimensions": [
        [ "AutoScalingGroupName", "path" ]
      ],
      ...
    }


If we had more than 1 instance per ASG then we need to include InstanceId and path as only combination of InstanceId + path can uniquely identify the resource:

    "metrics": {
      "append_dimensions": {
        "AutoScalingGroupName":"${aws:AutoScalingGroupName}",
        "InstanceId": "${aws:InstanceId}"
      },
      "aggregation_dimensions": [
        [ "AutoScalingGroupName", "InstanceId", "path" ]
      ],
      ...
    }



After we change the agent configuration file, we must restart the agent to have the changes take effect. To restart the agent service:

# systemctl restart amazon-cloudwatch-agent

Upon service restart we can check the local CloudWatch Agent log (on that instance):

# tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log

2024-05-13T16:26:13Z I! {"caller":"service@v0.89.0/telemetry.go:77","msg":"Skipping telemetry setup.","address":"","level":"None"}
2024-05-13T16:26:13Z I! {"caller":"service@v0.89.0/service.go:143","msg":"Starting CWAgent...","Version":"1.300033.0","NumCPU":2}
2024-05-13T16:26:13Z I! {"caller":"extensions/extensions.go:34","msg":"Starting extensions..."}
2024-05-13T16:26:13Z I! {"caller":"extensions/extensions.go:37","msg":"Extension is starting...","kind":"extension","name":"agenthealth/metrics"}
2024-05-13T16:26:13Z I! {"caller":"extensions/extensions.go:45","msg":"Extension started.","kind":"extension","name":"agenthealth/metrics"}
2024-05-13T16:26:13Z I! cloudwatch: get unique roll up list []
2024-05-13T16:26:13Z I! {"caller":"ec2tagger/ec2tagger.go:435","msg":"ec2tagger: Check EC2 Metadata.","kind":"processor","name":"ec2tagger","pipeline":"metrics/host"}
2024-05-13T16:26:13Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 15.443407438s
2024-05-13T16:26:13Z I! {"caller":"ec2tagger/ec2tagger.go:411","msg":"ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes","kind":"processor","name":"ec2tagger","pipeline":"metrics/host"}
2024-05-13T16:26:13Z I! {"caller":"service@v0.89.0/service.go:169","msg":"Everything is ready. Begin running and processing data."}


For the case where I used [ "AutoScalingGroupName", "path" ] as aggregate dimensions, CloudWatch graphs look like here:



And if we click on the aggregate AutoScalingGroupName, path:


References:




No comments: