I am setting up Amazon auto-scaling to satisfy my 12 principles of operational readiness. I’ve installed the tools, created an AMI, and defined my launch configuration, auto-scaling group, and policies, all covered in the previous posts of this series:
- Getting to know Amazon’s auto-scaling command line tools
- Autoscaling an EC2 service: Creating a new AMI
- Installing and checking Amazon Auto Scaling and CloudWatch tools
- Add an auto-scaling group and policy for Amazon EC2 machines
Next, I’m choosing a CloudWatch metric to monitor.
Add Cloudwatch monitors
Choose something to check. I have about 140 different metrics to choose from.
The mon-put-metric-alarm command is the last component needed to make auto scaling work. It will continually check what’s happening, compare values and, if necessary, kick off auto scaling according to my policies cap01-add and cap01-del.
Find a suitable metric
All the metrics to choose from can all be listed using the mon-list metrics command, shown here. Those three columns are Metric Name, Namespace and Dimensions — these are needed when defining the monitor.
My-MacBook-Pro:~ nick$ mon-list-metrics
CPUUtilization AWS/EC2
CPUUtilization AWS/EC2 {InstanceId=i-58bd8b11}
CPUUtilization AWS/EC2 {InstanceId=i-d91e9f91}
...
VolumeWriteOps AWS/EBS {VolumeId=vol-a9a63ec1}
VolumeWriteOps AWS/EBS {VolumeId=vol-e3851d8b}
VolumeWriteOps AWS/EBS {VolumeId=vol-d48e41bc}
My-MacBook-Pro:~ nick$
I’m going to scale depending on how busy the CPU is. This is a little risky because processes sometimes go berserk and permanently ramp up utilization to 100%. I can minimize that risk by choosing the CPU average, across all auto-scaling servers.
View the metric’s statistics, to make sure useful measurements are available. Use the mon-get-stats command.
My-MacBook-Pro:~ nick$ mon-get-stats CPUUtilization --statistics "Average" --namespace "AWS/EC2" --dimensions "AutoScalingGroupName=cag01"
2012-06-12 21:53:00 100.0 Percent
2012-06-12 21:54:00 95.24 Percent
2012-06-12 21:55:00 5.07 Percent
2012-06-12 21:56:00 3.23 Percent
2012-06-12 21:57:00 2.78 Percent
...
This command displays one measurement every minute. If nothing appears, then there is nothing to measure and auto-scaling won’t work.
The usage in this example is tiny except for the first two minutes – the two high values of 100% and 95%. Yes, these numbers have to be examined (it’s not enough to accept that there are some). This is the level of detail a good sysadmin has to sift through to ensure fine customer service.
Auto scaling commands include cool-down periods, to stop blips like these two high values creating unnecessary capacity. High measurements like these happen all the time, and especially in the Amazon cloud. Maybe an application restarted, the XEN hypervisor stole all the CPU, or another customer suffered a traffic spike.
Thank your sysadmin for staring at numbers like these by remembering him or here once a year on System Administrator Appreciation Day (july 27). No more than that, though – you don’t want to spoil them.
Create the monitor that will add servers
mon-put-metric-alarm cma01-add \
--alarm-actions arn:aws:autoscaling:eu-west-1:123494605340:scalingPolicy:45e9ed16-9462-4905-b9d4-cc99fcff0d43:autoScalingGroupName/cag01:policyName/cap01-add \
--comparison-operator GreaterThanThreshold \
--dimensions "AutoScalingGroupName=cag01" \
--evaluation-periods 3 \
--metric-name CPUUtilization \
--namespace "AWS/EC2" \
--period 60 \
--statistic Average \
--threshold 90 \
--unit Percent
What do those options mean?
- alarm-actions arn:aws:autoscaling:eu-… – Amazon’s way of triggering the scale-out policy cap01-add
- comparison-operator GreaterThanThreshold – Is the measurement greater than the threshold below?
- dimensions ” AutoScalingGroupName=cag01″ – There may be many CPU utilization measurements, for many machines and groups of machines. This lets the monitor know which measurement to use.
- evaluation-periods 3 – Don’t trigger the policy on the first scary number. Check this many before acting.
- metric-name CPUUtilization – This is the thing to measure.
- namespace “AWS/EC2” – Where to look for numbers. An EBS volume, an SNS topic, and an EC2 instance can all be measured.
- period 60 – the default interval, in seconds.
- statistic Average – Don’t get fooled by one crazy box.
- threshold 90 – The value to check against (90%).
- unit Percent – The type of measurement. Make sure a ratio, like a percentage, is not compared to a count, like seconds.
Create the monitor that will delete servers
The layout is pretty similar, but the check is for a minimum threshold.
mon-put-metric-alarm \
--alarm-name cma01-del \
--alarm-actions "arn:aws:autoscaling:eu-west-1:123494605340:scalingPolicy:8bf0832a-1234-446b-93bc-3b7a94609943:autoScalingGroupName/cag01:policyName/cap01-del" \
--comparison-operator LessThanThreshold \
--dimensions " AutoScalingGroupName=cag01" \
--evaluation-periods 3 \
--metric-name CPUUtilization \
--namespace "AWS/EC2" \
--period 60 \
--statistic Average \
--threshold 30 \
--unit Percent
Check your work
My-MacBook-Pro:~ nick$ mon-describe-alarms --headers
ALARM STATE ALARM_ACTIONS NAMESPACE METRIC_NAME PERIOD STATISTIC EVAL_PERIODS COMPARISON THRESHOLD
cma01-add OK arn:aws:autoscalin...olicyName/cap01-add AWS/EC2 CPUUtilization 60 Average 3 GreaterThanThreshold 90.0
cma01-del ALARM arn:aws:autoscalin...olicyName/cap01-del AWS/EC2 CPUUtilization 60 Average 3 LessThanThreshold 30.0
My-MacBook-Pro:~ nick$
Troubleshoot
I have a talent for breaking things, so for a complex procedure like this I am bound to cause myself grief. These AWS commands are case sensitive. I kept getting the error:
java.lang.IllegalArgumentException: seconds
buried in the Java error stack display. I missed it for a good half an hour. The command was fine when I changed from seconds to Seconds.
Time: 2012-06-12 08:58:52
program error
Configuration error: Failed to get constructor() for type {com.amazonaws.monitoring.StandardUnit} : null:null
null
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
at com.amazon.webservices.Cli.main(Cli.java:43)
Caused by: java.lang.IllegalArgumentException: seconds
at com.amazonaws.monitoring.StandardUnit.fromValue(StandardUnit.java:121)
... 17 more
mon-put-metric-alarm: program error
My-MacBook-Pro:~ nick$
One good thing about using a service like AWS is its popularity makes troubleshooting easier. Whatever your problem, someone has probably run into it before. Stick the error in Google and see what appears.
Next: Add auto-scaling notification. If auto-scaling happens, I will receive an e-mail.