12 min read

Strategy To Manage Cache At Scale

Strategy To Manage Cache At Scale

One of the major challenges we face as modern developers in our infrastructure is how do we cache effectively so that we don't have to perform heavy I/O to serve identical requests.

The application is only as fast as its slowest component, usually that is relational databases

If we have a solid caching strategy the database is not only relieved of immense pressures we also increase performance as the slowest cog in some cases may never be contacted.

I'll explain two main strategies you might follow to manage cache at scale in an agnostic to technology way but first lets explore what a cache is quickly and what our expectations are when we use the word "cache".

So what exactly is a distributed cache

The difference between a cache and a distributed cache is the distributed version takes traditional cache and provides its functions at scale, that is to say the cache technology is designed to scale indefinitely based on demand whereas the traditional cache will only exist in the same server as the application using it, sharing memory, and limited memory space.

What functions does a cache provide

Caches are pretty simple in terms of functionality they expose, the common ones are;

  • store: takes a datam and a key to identify it
    • The datam is usually a string
    • When a key that was previously used is given the new data will over-write the existing data for that key
    • The cache has no constraints on datam uniqueness, it is happy to store the same datam for as many keys it is sent
    • TTL: some cache's offer an expiry time for the given key, if a request is received after the set expiry the the datam is purged
  • retrieve: takes a key identifier and attempts to return stored datam
    • If there was never any datam stored for the given key or if the TTL has passed the retrieve will fail
  • delete: takes a key identifier and attempts to purge stored datam
    • If there was never any datam stored for the given key this function still runs successfully in most cache's
  • LRU: Least Recently Used algorithm
  • This is a common case for cache's; due to most cache's having a set amount of memory space it must free up space so that newer or prioritised datam's can be served
  • Locking: a function that deals with race conditions in a cache that is not sequential/transactional by design
  • This is not common but it exists so if you're goal is to be scalable you might want to consider a cache that uses record locking for one that is going to be reliably scalable
  • O(1): commonly referred to as the Big-O
  • This is a function that is a scalability dream, it is a proven method to ensure ALL functions are performed consistently at an identical timing as opposed to 2 different functions taking (n)seconds to complete.
  • A caveat to this is when you are asking the cache to delete something that is not even set, or you attempt to retrieve data that never existed, it is going to take the same time as if it did the task and it would be better performance and scale wise to never have made that request to the cache in the first place.

Node discovery

If the cache doesn't provide node discovery there is a simple way to implement it yourself in code;

  • Simple maths utilising modulus
  • crc32; Consistent hashing, sometimes called a continuum due to its method of plotting being based on the idea of an analogue clock.

This is essential for a distributed cache, without it the cache cannot effectively scale.

Many just go straight onto using the modulus method because developers are generally comfortable with Math, but the engineer in me saw the flaw in its use and needed to master consistent hashing.

Turns out there is already a PHP implementation of Consistent hashing, a Python one, and one in pure JavaScript too, so I never did get to write one myself.

Using modulus you would put your N cache node server IPs in an array and pick one using key % N.
Now the problem with the modulus method that crc32 doesn't have is as soon as you add or remove a node(server) i.e. change N, most of your cache will become invalid and any key that continues to work is pure coincidence.
This leads to some pretty crazy race conditions as your data sources get smashed as the cache hydrates, this is what we refer to in the cache world as cache stampede.

With consistent hashing the only datam that cannot be retrieved is the data that was stored on the removed node.

Cache keys

As mentioned previously, the cache will not prevent you storing non-unique datam in the store, it only cares about the key to identify a relationship to arbitrary datam.

However, we use caches to uniquely store our data conveniently and uniquely. Let focus on "uniqueness", because what the cache understands as unique is the key and what we expect to be unique is the datam.

The responsibility of uniqueness is placed entirely on the application, the cache doesn't care

There are many ways for an application to generate a key, and every application will define their expectation of uniqueness entirely on their own use cases, however there is a way to ensure uniqueness and it requires having a key identifier store which manages things like these keys, and an assortment of application logic bound to the cache which might include strategies to invalidate associative data which is discussed in more detail within the invalidation section.

Cache key generation

Comparing uniqueness of the options you have for a caches cache key;

  • String
    • Generated by the app, usually inline for every item you cache, this article proposes below to standardise its creation
    • Only as unique as your app logic defines it
    • Character limits are not an issue, write performance will be consistent regardless of key size
    • memcached max key size is 1MB
    • redis; max 512MB size for cache keys, (interesting because the max data size is also 512MB)
  • hash:
    • Less unique than using app string above without hashing
    • Hashing functions become more prone to collisions as the inputs compared become longer
  • UUID:
    • Always unique concurrently across hosts, down to the microsecond or better entropy;
    • Basically, your chances of collision in the foreseeable future of our computing power is inconceivable.

The go-to is usually the worst choice, a hash function like md5 or sha1.
I strongly argue that a hash is problematic and redundant for use as a cache key.

The fact that your hash seed must be the way you define your expectation of uniqueness, the hashing itself has no additional purpose to the cache. Using the seed as the key is all the cache actually needs, hashing is simply unnecessary overhead and in context of scalability (why we use caches) any unnecessary overhead should be eliminated.

Hashing functions are problematic and redundant for use as a cache key

An example in php would be sha1(json_encode($_REQUEST)) and one who uses this would argue that is not manual, yet you have not actually done what you think you have done. Your goal in fact was to ensure that the cache key is unique for this use case, and by using this key you will fetch the expected data. My question is do you know the unique key to use to fetch the data saved with the above method? You might think you do, but you do not really know I assure you.

When you save data with this method you are opening the way for things like cache-busting to populate the cache with N duplications.

Let's say you have ["userId"=>123] as your input; the cache key would be 5b9dfbd8045853c992f66e9caa1ef3c7cd76cf4a and if you used that key you would get the expected data, however the data you get was not unique for that key, there are multiple keys and duplications of that data in the cache. Hashing in this way is having no control methods to prevent N more of the same data being stored for different keys generated with unhandled input. Your data was created by the handled input but your key creation method allowed any unhandled input to generate more unique keys to store the same data.

This problem isn't noticeable and may go undetected because the application is doing what it was designed to do.

This is actually a massive performance problem as your cache is now full of data you will never request, this might lead to premature nodes being spawned to handle the extra demand, it might even lead to unmanageable IOPS while this extra storing happens and as the LRU runs to clear out this bad data over time.

Needless to say, If you've not created and enforced cache key generation strategicly you are open to a lot of unwanted problems.

Back onto uniqueness, consider that the way a data set is considered unique, I mentioned earlier uniqueness is an expectation that differs with every application. If you have sessions there is a high chance that you're going to want to provide an active session different data to an anonymous user, both users have the same entry point to the data, and in fact there is only the difference that one has an active session and the other doesn't, you're going to have to properly handle this in your cache key generation or you will encounter the issue of non-active session users seeing the cached version stored by a user with an active session, or worse, one active session sees the data from another active session as their own.
There are many more complexities to consider if you must manually generate your cache keys and these could be time based for things like auctions or events, and could even have geolocation reasons to have 1 resource slightly different for 2 users.

We haven't addressed the fact that you should ever even need to care about manually creating a cache key that is unique at all when writing code.

Let me introduce you to UUID.
A UUID takes no input from the application, it guarantees uniqueness across hosts simultaneously down to the microsecond or better.
Using a UUID for the cache key there is no need to introduce complex rules and assumptions into your application at all, the cache technology you choose gets the uniqueness IT cares about.
But the uniqueness our application cares about is still to be clarified, to ensure we enforce our uniqueness we need a cache key strategy which acts as a way to have a lookup to retrieve the keys we used (the UUIDs) with logic our application understands.
A cache key store gives us not only the ability to prevent mistakes in our code by making bad keys or not handling keys correctly for our expected outcome, but it also provides a programmatic way to access associated keys for when we need to invalidate related data, something the hashing method also cannot achieve.

Cache key store

Due to the Big-O, caches do not expose a way to iterate through the keys it is using, doing so would be a resource drain, time consuming, and with LRU and time factors the iterable list would be extremely difficult to implement with its changing state. It is far better for your application logic to maintain it's own store that if the need arises provide you the ability to iterate through keys.

Quickly on iteration, when you think you need to do it you're more likely really in need of some Regex like preg_match_all in PHP, or re.findall in Python to extract the keys you actually wish to iterate instead of all known keys.

This idea of a cache key store is likely new to you and the fact the cache is using UUIDs you might not yet grasp how any application using UUIDs could possibly operate, so i'll provide an OO-esq interface to ease the learning curve for some of you;

protocol CacheKeyStore
  prop storeName = 'my cache key store name'
  prop delimiter = '|'
  func <void> 
       set => <str> cacheKey,
              <str> json
  func <str|cacheKey> 
       makeKey => <str> __CLASS__, 
                  <str> __METHOD__, 
                  <str> handledInput
  func <str>
       exists => <bool>  
  func <str>
       fetch => <str> cacheKey  
  func <void>
       delete => <str> cacheKey  
  func <void>
       deleteAssoc => <str> cacheKeyPart

The only part of that self-documenting abstract that needs clarification is the delimiter; which is used to create logical ways to identify keys by parts.

This is a visual aid of where that code sits in your stack;

Cache Key Strategy

A general purpose use case in PHP would be;

class User {
  function getUserById($id) {
    $cacheKey = CacheKeyStore::makeKey(__CLASS__, __FUNCTION__, "userId/$id");
    if (CacheKeyStore::exists($cacheKey)) {
      $data = CacheKeyStore::fetch($cacheKey);
    } else {
      $data = $someModel->someGetter();
      CacheKeyStore::set($cacheKey, $data);
    }
  }
}

You would add additional uniqueness rules to the makeKey function based on application specific goals, such as time of day, geolocation, sessions to name a few but how that is done is up to you, the important thing here is there is no unnecessary hashing and there is no implied knowledge at all how you need to create the seed for your hash to ensure uniqueness of the data, nor is there going to be unwanted duplication in data being stored accidentally.

All of these benefits with a small, simple, reusable bit of code, and without a hash. You could go as far as to build it inherently in the $someModel->someGetter() to remain DRY.

Using such a technique instead of straight up hashing not only handled the majority of edge cases, we just removed all need to know anything at all about how the cache works, it is now completely enclosed with fetching data which you would fundamentally do well already in your application.

This may seem too good to be true, and it is in practise not this easy. You still need to consider how you define TTL and most importantly how to implement an invalidation strategy that works as seamlessly as our above DRY example.

To answer the question of expiry i'll refer to the policy used at Facebook AFAIK it breaks down into 2 use cases;

  1. 7hrs; Data that can be invalidated programatically;
  2. 2mins; for data that cannot

For data we cannot invalidate we are likely thinking of time based data, for example a list of the last x sold items. This is not practical because we cannot predict the input which is a yet unknown timestamp.
So we either cache it for 2 mins and allow it to naturally become fresh via the TTL or we don't cache it at all putting pressure on the data source for every request. Not caching something should typically only be done for existence checks, therefore this theoretical example would be cached for 2mins.

All other data should be cached for 7 hours because the theory is; if we can invalidate it we trust our application keeps it fresh with our implementation, if this is ever untrue the problem extends beyond any single use case.

To address cache invalidation strategies; one could only theorise numerous ways this might be done because there are limitless possibilities in terms of what your uniqueness parameters will be for your application, and any invalidation strategy is directly bound to what is considered unique.

A simple use case would be a users list, assuming there are no other factors around uniqueness like sessions. One would implement invalidation something like this;

Route: POST /user/:unique_constraint {arbitrary user data}
Route: GET /user/:unique_constraint {user data}

  • POST /user/chris {"name":"stoff","role"=>"Admin"}
  • POST /user/ben {"name":"ben","role"=>"Std"}
  • POST /user/rodney {"name":"mckay","role"=>"Std"}

getting data;

  • GET /user/chris

returns;
{"name":"stoff"}

Route: GET /users {list of users}
returns;
[{"name":"stoff"}, {"name":"ben"}, {"name":"mckay"}]

Route: GET /users/standard {list of users with role Std}
returns;
[{"name":"ben"}, {"name":"mckay"}]

DELETE /user/mckay

GET /users
[{"name":"stoff"}, {"name":"ben"}]

GET /users/standard
[{"name":"ben"}]

To implement invalidation of the collections you need to be able to find the right keys to invalidate, this is true for the hashing technique and the cache key store technique with one very important difference, you need to find all collections on you own if you use hashing, maybe all related collections can be grepd if the devs were strict enough to name them a certain way, but with the cache key store you only need to ask for it to give you all the keys for associated collections (correction, you dont care about the UUID keys, they are just invalidated), way easier and will not leave any collection missed by a dev building their own hashes to send to the cache delete api.

Conclusion

Not only are hashes used as cache keys redundant even in the most zealot of use cases arguing for them, the alternative UUID method can be completely silent in practice if implemented correctly and have none of the side effects or edge cases that come with using hashes.

The implementation of the above concepts has come to be based on extremely high traffic conditions and demand on the cache, combined with growing teams and applications. You simply cannot afford a bad cache strategy at scale, and having the right one before you have heavy load conditions can save you immeasurably in cold hard cash when the scale is inevitably necessary.

The biggest benefit for using a cache key store is that you can change from one cache technology to another without the application itself needing a refactor, all you do is implement the wrapping abstraction methods for the chosen technology. This can save you if say you chose memcached early on for its ease of use but now need to scale and have now decided redis or AWS Elasticache are better for the business.