Jul 2, 2016 14 min read AWS

AWS Autoscaling Best Practices

Welcome to Auto Scaling in the Amazon Cloud

If the fundamental premise of the "cloud" is

use only the resources you need for as long as you need them

So for a website you save money when traffic is low; dynamic scaling based on traffic.
Enterprise SaaS is great for AWS since customers are using your product during typical business hours, resulting in very low traffic at night.

So what does it take to make the switch?

Ideally, you want to be dealing with a stateless application, where terminating one node won’t produce side effects on user experience. Typical stateless apps don’t rely on keeping session state in memory on the same server that processes the web request.

Unfortunately not all software is created equal.

One thing with dynamically terminating server instances is that you can’t rely on them being accessible at any time, two main considerations have to be made;
Provisioning an instance is automated through simple CloudFormation templates, and logs need to be forwarded to a remote host.

Consider what would happen if one instance were to terminate with active sessions; you’re impacting user experience. But there is a solution; using the Elastic Load Balancer (ELB) to which we’ll come back shortly.

Architecting the app around Auto Scaling

First and foremost, you'll need the app to properly report your chosen autoscaling metric to CloudWatch. There are some built-in metrics such as memory utilisation or CPU, network, etc. but I assure you that these measurements alone will work only for the very limited few apps and you will need to identify and create your own custom metrics to be reported after just a few days of running in an autoscaled environment.

I chose to develop an algorithm which gathers metrics from all running instances in the autoscale group and reports how efficient that group will operate with the current load if we were to remove just one single instance. Using this custom metric we compare the value to what we have set as the scale up policy (when we would need to add one instance) and if we would not trigger the scale up policy by removing one instance we consider the scale down event to be positive and therefore remove one instance. Immediately, the metric adjusts to the new count every time an instance is added or removed.

On EC2 Auto Scale groups

The EC2 Auto-Scale group detects "unhealthy" instances and terminates them, then adjusting (sometimes taking no action) dependant on the current "desired" instance count. The desired count is determined by the current state of the group which uses your defined CloudWatch metric measurements in your auto scale policies.

Some consultants and developer bloggers will tell you that the Auto Scale group needs the instance to respond a 200 HTTP status code when it is polled by the Auto Scale group or it will be terminated, this is simply untrue. There is no HTTP request being made, that implies a running web server must be installed on the instance, and in turn that the fact a running webserver is the definition of "healthy" which is simply preposterous. EC2 instances are not exclusive for web server use.

In fact, these people are getting confused with the EC2 Elastic Load Balancer service (ELB) health check; which will never be concerned about auto scale groups or CloudWatch metrics and deals only with incoming HTTP requests routing them to a healthy EC2 instance with a running web server. The ELB health check to determine "healthy" does indeed expect the EC2 instance to have a running web server of some sort; on a specific port; and returning that HTTP 200 status for a specified file name. Then, if this ELB health check fails the instance is not terminated as some would suggest, the ELB simply recognises that it is unhealthy and will not route any new incoming HTTP requests to that instance until it is healthy.

So back to the Auto Scale group in terms of how the instance health check is done and terminates unhealthy hosts, this is not very well documented by AWS unfortunately, leading to the misunderstandings among many developers and consultants. But as an engineer I have evaluated through experience that an instance is considered unhealthy if one of the following occurs;

An action performed by the underlying EC2 system is not carried out successfully. For example; if an auto scale event is started and was unable to complete successfully.
EC2 predefined CloudWatch metrics fail to report, i.e. INSUFFICIENT_DATA. This may include network, CPU, or memory.
The instance becomes unresponsive, Which is the most interesting of the 3;

For an example of what I mean by unresponsive; the EC2 console makes requests to the EC2 backend when you do things like 'refresh' a view, and this, in turn, requires the EC2 backend to communicate with the instance and here it may be determined as unresponsive. I encountered this when the internal AWS networks in the Sydney region became irregular during floods. I could SSH into my instance using the public IP from the office in Melbourne but I was not able to communicate over the internal AWS network using a private IP from another instance in the same availability zone and the Auto-Scale group eventually terminated all of the running and publicly accessible instances!

Capacity planning

In terms of what needs autoscaling, also covering what can be put behind ELBs, and what can operate without ELBs.

Memory optimised: Auto scaled instances that do not process web requests. The best example that comes to mind is on-demand 3rd party data processing.
For example, you may have a requirement to gather information about flights and translate the multiple data sources into a standard your application can recognise and process. The resource requirements would likely fluctuate based on the estimated departure or arrival times, e.g. You might need to check terminal and gate information from a static data source, which would need to be done less often in the days leading to the time of arrival/departure but more far frequently in the hour before the arrival/departure when the data is likely to change more suddenly. This priority queueing would be high demand during peak periods for the airport and almost negligible during the hours the airport is in night curfew (for some cities).

You will likely have a lot of stateful long-running processes in this scenario which require little in terms of CPU or HDD space but need loads of RAM memory. Choosing a memory optimised instance type works really well.

General purpose: great for a web stack behind an ELB, this is fairly self-explanatory so I'll be brief. Considerations for high demand are usually based on daylight hours and depending on the domain you operate in you might also need to react to certain events, for example, the airing of an interview or advertisement on a free to air TV show where you'll go from superficial traffic to a flood of interested and engaging traffic.

There are arguments on Memory optimised vs CPU optimised instances, this generally comes down to a decision of being Highly Available (HA) and granular or not HA and prepared for the heavy load.

I should make it clear that any web server should have a minimum of 2 CPUs, even when implementing granular scaling methodologies. The webserver (Apache or Nginx) should be able to operate effectively while the code is executing. Running one CPU quickly leads to the ELB queueing incoming requests while the single CPU instance manages the routeing of web requests, processing them, and serving them, all with minimum traffic conditions. The results are sporadic and exaggerated metric reporting on resources triggering far too many conflicting scale actions. Having 2 CPUs is more stable as the instance can route and respond while requests are processed asynchronously.

Compute Optimised; Being generic here, these instances are not recommended for autoscale with a few exceptions. I would generally advise using compute optimised for a batch processing server, where; you deal with extremely short scripts that do one job without consideration for resource usage, like sending emails, running ETLs, doing backups, push notifications, etc. Basically, anything that runs for a single purpose on a schedule (e.g. cron) and needs to perform its task immediately.
If you are looking at using compute optimised instances in an auto scale group you're going to be outside what I can help you with here. Things like DNA processing and Machine Learning or Decision Tree territory (from personal experience) and needs more analysis that I feel is appropriate for this article.

On being Granular: This idea is based on getting the most of the resources you operate on, and using (paying for) only what you need to operate efficiently.
Personally i support the idea of being granular, but, I generally have doubts that anyone can truly be prepared for all eventualities whilst sticking to being purely granular. In my experience, high traffic events on a web stack can not be predicted in all cases, therefore you rely almost entirely on the fact you have the right auto scale rules in place to deal with the unknown an inevitable high traffic events. If i sugest you might need to scale from 100 concurrent users to 1000 in a matter of 1min, you can prepare for that because it has been addressed, what i am talking about is being able to stay online and be granular during unforeseeable events we cannot predict to address in advance. this is where your senior management needs to acknowledge that if the unforeseeable occurs it is an acceptable risk, or, you implements HA architecture as well as being granular in preparation of these unforeseeable events.

On being Highly Available
The considerations for being Highly Available could be;

Physical faults in one availability zone (AZ) fall back to an alternate AZ.
Major network problems in regions will fallback to alternate a region.
High traffic events on your ELB distribute evenly across multiple alternate AZ. Because an auto scale group is per a single AZ.

The single obvious difference between granular and HA is granular is localised to an auto scale group on a single AZ, and HA is having redundancies which basically means capacity needs to be 100% more than current demand at any given time.
Or more simply, granular intends to save money at the cost of being failure resistant, and HA aims to ensure that during the event of unforeseeable disasters you still operate at 100%.

You should not choose between these but rather strive to achieve a combination of both!
Being HA only means that you operate 2x what you would have setup up granular or otherwise, granular just ensures you aren't wasting any resources.

Load Balancing

How might you normally manage SSL certificates for HTTPS? An ELB has the ability to manage your SSL and cyphers for you, I have experience using the misunderstood multi-site (UCC) SSL cert. but I won't go into the details on how beneficial UCC will be for you only that I strongly recommend using one for your ELB as it is almost definitely going to be needed as you scale your business.

Basically, the ELB will handle all your incoming port 443 HTTPS requests and route them to the EC2 instance over the encrypted internal AWS private network for your web server to then handle and respond to. The web server usually requires developers to manage the SSL certificates in an unintuitive way prone to errors and usually expire suddenly and forgotten.

With the ELB this archaic pattern can be left to the data center geezers, because we are already communicating across the AWS private network and the ELB can actually route all of its incoming HTTPS port 443 to your EC2 instance as HTTP on port 80, meaning your web servers don't need to manage the SSL certs any longer but external connects are still HTTPS> This centralises the one cert on the ELB instead of the management of SSL for each instance. If your security alarms are ringing I'd suggest only 1 thing, if you already use EC2 you must trust AWS, configuring the ELB in this way is no more insecure than simply using the same EC2 instance.

The ELB also provides a feature called connection draining. This is particularly important for an auto scale group because when a scale event terminates an EC2 instance the ELB only knows about it when it is too late, so instead of terminating requests which to a user appears like you are offline, instead the setting allows the EC2 instance to remain responsive for as long as it takes to serve its current open incoming connections from the ELB and the ELB will not send any new requests to this instance.

Mitigating the bottleneck (database)

When we talk about auto scaling and being efficient, we must address the elephant in the room. All businesses have a bottleneck and it is likely that yours is database I/O.

I have written about database optimisations previously so I'll not go into the specifics here. basically, we should try to reduce the time our auto-scaled instances spend talking to RDS (or any other database) and do their job so they can operate as efficient as possible and scale less.

AWS have tools specifically designed for this;
CloudFront, Elasticache, and DynamoDB.

CloudFront (CF): is a CDN, but it can be more than that;

CF is most commonly utilised to serve static assets (CSS, JS, and Images) stored in S3 buckets. S3 is not distributed, your assets are located in a single region, CF will cache the asset from S3 in edge locations all around the world closest to the user requesting it for faster delivery to the browser and will only contact the S3 bucket if the expiry of the cached resource has expired.
You might need to deliver streaming video transcoded for consumption on the web or using Apple's HLS for iOS apps.
In addition to static assets, it is also a consideration to have CF cache configuration files for your Apps, or even the structured data (JSON, SOAP, XML) responses from your APIs. This means your servers will be hit less often to deliver content you deem cacheable (for a time).
Similar to the last point, you might also wish to cache certain web pages, some examples may include templates typically requested by service workers or dumped in the DOM for an ajax call to hook up with data at a later stage. These are excellent candidates. Other possibilities are pages that attract superficial traffic, so pages like about us, contact us, careers, and generally any page that is considered to not have on demand dynamic content should be cached for at least 7 hours (per Facebook's recommendation).
More on superficial traffic; if you have most of your website dynamic features available only to logged in users, on pages that can also be accessed when not logged in but provide a less than dynamic experience, it will be extremely beneficial to have CF cache these pages for users that are not logged in for at least 2mins (again a Facebook recommendation), this may give the superficial traffic a slightly stale page whereas logged in users get to pass-through CF while holding a session and get on-demand access to the dynamic features, which is great news for your Auto Scale group as it will receive far less hits from these superficial users.

Elasticache is a service that abstracts Memcache or Redis for scalable in-memory caching.
If you've ever setup either Memcache or Redis you may have the impression it is best used for simple data like storing sessions for a clustered Node.js app or for storing simple stateful objects for when a user is load balanced across multiple servers.

Although these are common uses of the technology, they barely scrape the surface of usefulness. You can utilise Elasticache in a way that effectively makes your database utilisation drop in such a way that any single piece of data is only requested from the DB directly only as often as it changes, with one exception; the Elasticache optimisation algorithms sometimes purge infrequently accessed data from memory, at which point your database will be hit for the requested data even if it is unchanged since it was last accessed.

The key to implementing this type of technique has two major focusses;

a) Well structured cache keys; for storing the relevant lookup data in the key itself so the data can be targeted for easy fetching and invalidation.

b) Cache stampeding; This occurs when highly depended upon objects in memory have suddenly expired and there is a stampede of requests incoming for that data all at virtually the same moment. This will result in the following pattern;

Request cache exists > cache expired
Request data from DB, wait for it to be received during which more requests have com in and are doing the same.
Tell Elasticache to store the new data. the all other requests that came through before also tell Elasticache to store thier version of the data from the DB.
Following requests get served from the cache.

This is obviously inefficient, and in fact, can generate unwanted read IOPS on your DB and crippling write IOPS concerns for your Elasticache nodes.

To fix this we need to tell Elasticache to store our objects a little longer then the applications needs, for instance, when we intend to store our object for 7hrs we have that 7hr value stored in our application but when our call goes to Elasticache simply take the 7hrs and add 10 seconds. This is so that later after the 7hrs has passed when our application expects the object to be expired we still can access that stored object for 10 seconds longer! It'll make more sense in a moment.
For our application will be able to have this we to actually store two pieces of data into Elasticache; our object data, and when it was stored. Our application code is already aware that this object is to be stored for 7hrs because it is coded into the the request so that when it encounters expired caches it can request and then store the new data for the appropriate time, therefore, by having the created time stored with the object we actually want we can instantly return the requested object to our requester and spawn off another thread to update the object from the DB with fresh data without interfering with the requester.

This also allows the cache stampede to be mitigated because the object is still returned, and during the few moments after the first requester is taking to update the object we still can serve it from cache while it is being updated!

I use this not only to reduce I/O from my Auto Scaled instances to my RDS, but to mitigate the IOPS limitations imposed by AWS on their RDS product. This Elasticache layer is far cheaper than PIOPS and considerably more performant and powerful than RDS.

I mentioned session data is a candidate for usual Memcache or Redis implementations earlier, I'd like to make it clear that although sessions are commonly saved to these technologies they are by far not appropriate for Elasticache. For one reason alone, AWS controls the policy for purging data to keep the service optimised, whereas your own installation of Redis or Memcache can be de-optimised so as to not aggressively purge the important session data.
The last thing you want is to have your logged in users kicked out sporadically.

Elasticache is a powerful tool and should be respected for its use cases and sessions are better stored in a more static centralised way, for example, a document storage DB;

DynamoDB is built on top of MongoDB, a document storage database, ready for scalability through abstraction but nonetheless still an amazing document storage database.

I have experience using DynamoDB for high demand, real-time, and reliable data. DynamoDB costings can easily get out of hand so you must be mindful that the data stored in DynamoDB actually belongs there.

Other than Session data being a great candidate due to sessions lasting usually no longer than a few hours a day, another great candidate for DynamoDB is auction data.

Any great candidates for DynamoDB are data that is only meaningful for a defined timeframe, and may be aggressively updated whilst just as aggressively being requested.

Think about the data for an eBay auction, it's current highest bid and all of the participating user's max bids that are only meaningful before the auction closes. After the auction this data becomes static, it won't change anymore and can be stored more permanently in RDS.

DynamoDB is suited perfectly for this task as it is fast and handles concurrent changes well without blocking requests. Also, the cost is a non-issue if you are dealing with a single object for all your changes and remove the data as soon as it is no longer meaningful.

Using DynamoDB in this way relieves tremendous pressure off your RDS and generally the entire web stack, so I urge you to identify data such as this in your own applications and utilise document storage the right way. Technology like MongoDB is not best suited to store all of your data, your application and business will be better off using relational databases over document storage for long-term data needs.

To sum up

I've covered a huge amount of topics and given you many techniques to consider that it's likely making your head spin, so let me bullet point this for you;

Ensure your EC2 instances are able to be thrown away (terminated) by centralising logs in S3 and delegate responsibility for data and state storage elsewhere
Craft your auto scale actions and CloudWatch metrics carefully. Be proactive in reporting engineered data for your metrics as scaling should be intimately unique for your particular business case.
Choose the right EC2 instance for the task it performs. Make sure you understand being granular might affect the instances ability to do its job, i.e. ensure web server instances have at least 2x CPUs
Have the ELB manage your SSL cert to reduce complexity in each instance
Implement HA methodology, it is actually more important to do this for granular setups contrary to what you may have been told before.
Use CloundFront, it is very cheap and provides the most significant performance bang for buck.
Identify and store all data in Elasicache, bake it into the way you communicate with your relational database/s that are your bottleneck, always. Your IOPS will become significantly better.
Don't be afraid of the cost to use DynamoDB, in fact, it can be cheaper than relational or memory alternatives for the right data when you consider the impact that data might have on your RDS throughput, and the user reliability and performance benefits are second to none.

Understanding how to make the right decision on AWS will save you unmeasurably when you are able to scale without incident.
Choosing MVPs and outsourcing development over proper AWS planning and architectured the right way could be the difference between success and failure for a startup.

Welcome to Auto Scaling in the Amazon Cloud

Architecting the app around Auto Scaling

On EC2 Auto Scale groups

Capacity planning

Load Balancing

Mitigating the bottleneck (database)

To sum up

You might also like...

Private AWS S3 - How hard could that be?

Everything in AWS is an API, is it secure?

PCI DSS - Are AWS KMS and CloudHSM suitable?

Software Engineers guide to AWS Solution Architecture

Problems with AWS API Gateway stemmed from CloudFront