Hardware optimize

How to add Cloudwatch monitors to auto-scale your Amazon Web Service

Nick Hardiman details the steps involved in adding CloudWatch monitors to keep tabs on your AWS performance so that auto-scaling will work properly.

I am setting up Amazon auto-scaling to satisfy my 12 principles of operational readiness. I've installed the tools, created an AMI, and defined my launch configuration, auto-scaling group, and policies, all covered in the previous posts of this series:

Next, I'm choosing a CloudWatch metric to monitor.

Add Cloudwatch monitors

Choose something to check. I have about 140 different metrics to choose from.

The mon-put-metric-alarm command is the last component needed to make auto scaling work. It will continually check what's happening, compare values and, if necessary, kick off auto scaling according to my policies cap01-add and cap01-del.

Find a suitable metric

All the metrics to choose from can all be listed using the mon-list metrics command, shown here. Those three columns are Metric Name, Namespace and Dimensions -- these are needed when defining the monitor.

My-MacBook-Pro:~ nick$ mon-list-metrics
CPUUtilization        AWS/EC2
CPUUtilization        AWS/EC2  {InstanceId=i-58bd8b11}
CPUUtilization        AWS/EC2  {InstanceId=i-d91e9f91}
...
VolumeWriteOps        AWS/EBS  {VolumeId=vol-a9a63ec1}
VolumeWriteOps        AWS/EBS  {VolumeId=vol-e3851d8b}
VolumeWriteOps        AWS/EBS  {VolumeId=vol-d48e41bc}
My-MacBook-Pro:~ nick$

I'm going to scale depending on how busy the CPU is. This is a little risky because processes sometimes go berserk and permanently ramp up utilization to 100%. I can minimize that risk by choosing the CPU average, across all auto-scaling servers.

View the metric's statistics, to make sure useful measurements are available. Use the mon-get-stats command.

My-MacBook-Pro:~ nick$ mon-get-stats CPUUtilization --statistics "Average" --namespace "AWS/EC2" --dimensions "AutoScalingGroupName=cag01"
2012-06-12 21:53:00  100.0  Percent
2012-06-12 21:54:00  95.24  Percent
2012-06-12 21:55:00  5.07   Percent
2012-06-12 21:56:00  3.23   Percent
2012-06-12 21:57:00  2.78   Percent
...

This command displays one measurement every minute. If nothing appears, then there is nothing to measure and auto-scaling won't work.

The usage in this example is tiny except for the first two minutes - the two high values of 100% and 95%. Yes, these numbers have to be examined (it's not enough to accept that there are some). This is the level of detail a good sysadmin has to sift through to ensure fine customer service.

Auto scaling commands include cool-down periods, to stop blips like these two high values creating unnecessary capacity. High measurements like these happen all the time, and especially in the Amazon cloud. Maybe an application restarted, the XEN hypervisor stole all the CPU, or another customer suffered a traffic spike.

Thank your sysadmin for staring at numbers like these by remembering him or here once a year on System Administrator Appreciation Day (july 27). No more than that, though - you don't want to spoil them.

Create the monitor that will add servers

mon-put-metric-alarm cma01-add \
 --alarm-actions arn:aws:autoscaling:eu-west-1:123494605340:scalingPolicy:45e9ed16-9462-4905-b9d4-cc99fcff0d43:autoScalingGroupName/cag01:policyName/cap01-add \
--comparison-operator  GreaterThanThreshold \
--dimensions "AutoScalingGroupName=cag01"  \
--evaluation-periods 3 \
--metric-name  CPUUtilization \
--namespace   "AWS/EC2" \
--period  60  \
--statistic  Average \
--threshold  90  \
--unit  Percent

What do those options mean?

  • alarm-actions arn:aws:autoscaling:eu-... - Amazon's way of triggering the scale-out policy cap01-add
  • comparison-operator GreaterThanThreshold - Is the measurement greater than the threshold below?
  • dimensions " AutoScalingGroupName=cag01" - There may be many CPU utilization measurements, for many machines and groups of machines. This lets the monitor know which measurement to use.
  • evaluation-periods 3 - Don't trigger the policy on the first scary number. Check this many before acting.
  • metric-name CPUUtilization - This is the thing to measure.
  • namespace "AWS/EC2" - Where to look for numbers. An EBS volume, an SNS topic, and an EC2 instance can all be measured.
  • period 60 - the default interval, in seconds.
  • statistic Average - Don't get fooled by one crazy box.
  • threshold 90 - The value to check against (90%).
  • unit Percent - The type of measurement. Make sure a ratio, like a percentage, is not compared to a count, like seconds.

Create the monitor that will delete servers

The layout is pretty similar, but the check is for a minimum threshold.

mon-put-metric-alarm \
--alarm-name     cma01-del \
--alarm-actions "arn:aws:autoscaling:eu-west-1:123494605340:scalingPolicy:8bf0832a-1234-446b-93bc-3b7a94609943:autoScalingGroupName/cag01:policyName/cap01-del" \
--comparison-operator  LessThanThreshold \
--dimensions     " AutoScalingGroupName=cag01"  \
--evaluation-periods 3 \
--metric-name    CPUUtilization \
--namespace      "AWS/EC2" \
--period         60      \
--statistic      Average \
--threshold      30  \
--unit           Percent

Check your work

My-MacBook-Pro:~ nick$ mon-describe-alarms --headers
ALARM      STATE  ALARM_ACTIONS                             NAMESPACE  METRIC_NAME     PERIOD  STATISTIC  EVAL_PERIODS  COMPARISON            THRESHOLD
cma01-add  OK     arn:aws:autoscalin...olicyName/cap01-add  AWS/EC2    CPUUtilization  60      Average    3             GreaterThanThreshold  90.0
cma01-del  ALARM  arn:aws:autoscalin...olicyName/cap01-del  AWS/EC2    CPUUtilization  60      Average    3             LessThanThreshold     30.0
My-MacBook-Pro:~ nick$

Troubleshoot

I have a talent for breaking things, so for a complex procedure like this I am bound to cause myself grief. These AWS commands are case sensitive. I kept getting the error:

java.lang.IllegalArgumentException: seconds

buried in the Java error stack display. I missed it for a good half an hour. The command was fine when I changed from seconds to Seconds.

Time: 2012-06-12 08:58:52
program error
Configuration error: Failed to get constructor() for type {com.amazonaws.monitoring.StandardUnit} : null:null
null
java.lang.reflect.InvocationTargetException
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
       at com.amazon.webservices.Cli.main(Cli.java:43)
Caused by: java.lang.IllegalArgumentException: seconds
       at com.amazonaws.monitoring.StandardUnit.fromValue(StandardUnit.java:121)
       ... 17 more
mon-put-metric-alarm:  program error
My-MacBook-Pro:~ nick$

One good thing about using a service like AWS is its popularity makes troubleshooting easier. Whatever your problem, someone has probably run into it before. Stick the error in Google and see what appears.

Next: Add auto-scaling notification. If auto-scaling happens, I will receive an e-mail.

See Nick's previous posts on setting up auto-scaling:

About

Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the ...

0 comments