Using the Django Per-Site Cache with the Nginx HTTP Memcached Module

For a long time I thought that the most interesting problems in my field were in scalability. Some people may be more interested in scaling, and others might be more into slick interfaces and fast animations. But for me, scalability has continued to be my passion. For awhile though, it was a unicorn. That unattainable thing that I wanted to work on but couldn’t find anywhere to do it at. That is, until I started work at Future US.

Future is a media company. Originally they started in old media focusing heavily on gaming and tech magazines. Eventually the internet became prominent in everyday life, so more of their old media properties made the transition to the web. The one that really matters to me though is PC Gamer. I’ve been a huge fan of PC Gamer since I was about 7 years old. I still have fond memories getting demo disks in the mail with my subscription.

When I was hired at Future it was to help facilitate the move of PC Gamer from its existing platform (WordPress) to Django. Future had experienced success moving other properties to Django, so it made sense to do it with PC Gamer. When it eventually came time to implement our caching layer, we thought about a lot of different ways that it could be done. Varnish came up as an option, but we decided against it since nobody on the team had experience configuring it (and people elsewhere in the organization had experienced issues with it). Eventually we settled on having Nginx serve pages directly from Memcache. For us, this method works great because PC Gamer doesn’t have a lot of interaction (its almost completely consumption from the user end). Anything that does require back-and-forth between the server is handled via javascript, which makes full page caching super easy to do.

The high level architecture for pc gamer.
The high level architecture for pc gamer.

So how does it all work? The image above describes PC Gamer’s server architecture from a high level. Its pretty basic and works quite well for us. We end up having two types of requests: cache hits & cache misses. The flow for a cache hit is: request -> load balancer -> nginx -> memcache -> your browser. The flow for a cache miss is: request -> load balancer -> nginx -> application server (django) -> (store page in cache) -> your browser.

Since we’re basically running a static site, deciding what content to cache is easy: EVERYTHING!

Cache all the things!
Cache all the things!

Luckily for us Django already has a nice way of doing this: The per-site cache. But it is not without its issues. First of all, the cache keys it creates are insane. We needed something a little simpler for our setup so Nginx could build the cache key of the current request on the fly.

How It Works

The meat and potatoes of overriding Django’s per-site cache key comes in the `_generate_cache_key` function.

def _generate_cache_key(request, method, headerlist, key_prefix):
    if key_prefix is None:
        key_prefix = settings.CACHE_MIDDLEWARE_KEY_PREFIX
    cache_key = key_prefix + get_absolute_uri(request)
    return hashlib.md5(cache_key).hexdigest()

To make things easier for Nginx to understand we just take the url and md5 it. Simple!

On the Nginx side of things, the setup is equally simple.

        set            $combined_string "$host$request_uri";
        set_by_lua     $memcached_key "return ngx.md5(ngx.arg[1])" $combined_string;
 
        # 404 for cache miss
        # 502 for memcached down
        error_page     404 502 504 = @fallback;
 
        memcached_pass {{ cache.private_ip }}:11211;

All this setup does is take the MD5 of the host + request URI and then check to see if that cache key exists in memcache. If it does then we serve the content at that cache key, if it doesn’t we fall back to our Django application servers and they generate the page.

Thats it. Seriously. It’s simple, extremely fast, and works for us. Your mileage may vary, but if you have relatively simple caching requirements I highly suggest looking into this method before looking at something like Varnish. It could help you remove quite a bit of complexity from your setup.

Getting around memory limitations with Django and multi-processing

I’ve spent the last few weeks writing a data migration for a large high traffic website and have had a lot of fun trying to squeeze every bit of processing power out of my machine. While playing around locally I can cluster the migration so it executes on fractions of the queryset. For instance.

./manage.py run_my_migration --cluster=1/10
./manage.py run_my_migration --cluster=2/10
./manage.py run_my_migration --cluster=3/10
./manage.py run_my_migration --cluster=4/10

All this does is take the queryset that is generated in the migration and chop it up into tenths. No big deal. The part that is a big deal is that the queryset contains 30,000 rows. In itself that isn’t a bad thing, but there are a lot of memory and cpu heavy operations that happen on each row. I was finding that when I tried to run the migration on our Rackspace Cloud servers the machine would exhaust its memory and terminate my processes. This was a bit frustrating because presumably the operating system should be able to make use of the swap and just deal with it. I tried to make the clusters smaller, but was still running into issues. Even more frustrating was that this happened at irregular intervals. Sometimes it took 20 minutes and sometimes it took 4 hours.

Threading & Multi-processing

My solution to the problem utilized the clustering ability I already had built into the program. If I could break the migration down into 10,000 small migrations, then I should be able to get around any memory limitations. My plan was as follows:

  1. Break down the migration into 10,000 clusters of roughly 3 rows a piece.
  2. Execute 3 clustered migrations concurrently.
  3. Start the next migration after one has finished.
  4. Log the state of the migration so we know where to start if things go poorly.

One of the issues with doing concurrency work with Python is the global interpreter lock (GIL). It makes writing code a lot easier, but doesn’t allow Python to spawn proper threads. However, its easy to skirt around if you just spawn new processes like I did.

Borrowing some thread pooling code here, I was able to get pretty sweet script running in no time at all.

import sys
import os.path
 
from util import ThreadPool
 
def launch_import(cluster_start, cluster_size, python_path, command_path):
    import subprocess
 
    command = python_path
    command += " " + command_path
    command += "{0}/{1}".format(cluster_start, cluster_size)
 
    # Open completed list.
    completed = []
    with open("clusterlog.txt") as f:
        completed = f.readlines()
 
    # Check to see if we should be running this command.
    if command+"\n" in completed:
        print "lowmem.py ==> Skipping {0}".format(command)
    else:
        print "lowmem.py ==> Executing {0}".format(command)
        proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = proc.stdout.read() # Capture the output, don't print it.
 
        # Log completed cluster
        logfile = open('clusterlog.txt', 'a+')
        logfile.write("{0}\n".format(command))
        logfile.close()
 
 
if __name__ == '__main__':
 
    # Simple command line args checking
    try:
        lowmem, clusters, pool_size, python_path, command_path = sys.argv
    except:
        print "Usage: python lowmem.py <clusters> <pool_size> <path/to/python> <path/to/manage.py>"
        sys.exit(1)
 
    # Initiate log file.
    if not os.path.isfile("clusterlog.txt"):
        logfile = open('clusterlog.txt', 'w+')
        logfile.close()
 
    # Build in some extra space.
    print "\n\n"
 
    # Initiate the thread pool
    pool = ThreadPool(int(pool_size))
 
    # Start adding tasks
    for i in range(1, int(clusters)):
        pool.add_task(launch_import, i, clusters, python_path, command_path)
 
    pool.wait_completion()

Utilizing the code above, I can now run a command like:

python lowmem.py 10000 3 /srv/www/project/bin/python "/srv/www/project/src/manage.py import --cluster=" &

Which breaks the queryset up into 10,000 parts and runs the import 3 sets at a time. This has done a great job of keeping the memory footprint of the import low, while still getting some concurrency so it doesn’t take forever.