0 to 1 Million: Scaling my side project to 1 million requests a day

In the Beginning

In late 2014 I decided that I needed a side project.  There were some technologies that I wanted to learn, and in my experience building an actual project was the best way to do that.  As I sat on my couch trying to figure out what to build, I remembered an idea I had back when I was still a junior dev doing WordPress development.  The idea was that people building commercial plugins and themes should be able to use the automated update system that WordPress provides.  There were a few self-managed solutions out there for this, but I thought building a SaaS product would be a good way to learn some new tech.

Getting Started

My programming history in 2014 looked something like: LAMP (PHP, MySQL, Apache) -> Ruby on Rails -> Django.  In 2014 Node.js was becoming extremely popular and MongoDB had started to become mature.  Both of these technologies interested me, so I decided to use them on this new project.  As to not get too overwhelmed with learning things, I decided to use Angular for fronted since I was already familiar with it.

A few months after getting started, I finally deployed https://kernl.us for the world to see.  To give you an idea of the expectations I had for this project, I deployed it to a $5/month Digital Ocean droplet.  That means everything (Mongo, Nginx, Node) was on a single $5 machine.  For the next month or two, this sufficed since my traffic was very low.

The First Wave

In December of 2014 things started to get interesting with Kernl.  I had moved Kernl out of a closed alpha and into beta, which led to a rise in sign ups.  Traffic steadily started to climb, but not so high that it couldn’t be handled by a single $5 droplet.

Around December 5th I had a customer with a large install base start to use Kernl.  As you can see the graph scale completely changes.  Kernl went from ~2500 requests per day, to over 2000 requests per hour.  That seems like a lot (or it did at the time), but it was still well within what a single $5 droplet could handle.  After all, thats less that 1 request per second.

Scaling Up

Through the first 3 months of 2015 Kernl experienced steady growth.  I started charging for it in February, which helped fuel further growth as it made customers feel more comfortable trusting it with something as important as updates.  Starting in March, I noticed that resource consumption on my $5 droplet was getting a bit out of hand.  Wanting to keep costs low (both in my development time and actual money) I opted to scale Kernl vertically to a $20 per month droplet.  It had 2GB of RAM and 2 cores, which seemed like plenty.  I knew that this wasn’t a permanent solution, but it was the lowest friction one at the time.

During the ‘Scaling Up’ period that Kernl went through, I also ran into issues with Apache.  I started out by using Apache as a reverse proxy because I was familiar with it, but it started to fall over on me when I would occasionally receive requests rates of about 20/s.  Instead of tweaking Apache, I switched to using Nginx and have yet to run in to any issues with it.  I’m sure Apache can handle far more that 20 requests/s, but I simply don’t know enough about tweaking it’s settings to make that happen.

SCaling Out & Increasing Availability

For the rest of 2015 Kernl saw continued steady growth.  As Kernl grew and customers started to rely on it for more than just updates (Bitbucket / Github push-to-build), I knew that it was time to make things far more reliable and resilient than they currently were.  Over the course of 6 months, I made the following changes:

  • Moved file storage to AWS S3 – One thing that occasionally brought Kernl down or resulted in dropped connections was when a large customer would push an update out.  Lots of connections would stay open while the files were being download, which made it hard for other requests to get through without timing out.  Moving uploaded files to S3 was a no-brainer, as it makes scaling file downloads stupid-simple.
  • Moved Mongo to Compose.io – One thing I learned about Mongo was that managing a cluster is a huge pain in the ass.  I tried to run my own Mongo cluster for a month, but it was just too much work to do correctly.  In the end, paying Compose.io $18/month was the best choice.  They’re also awesome at what they do and I highly recommend them.
  • Moved Nginx to it’s own server – In the very beginning, Nginx lived on the same box as the Node application.  For better scaling (and separation of concerns) I moved Nginx to it’s own $5 droplet.  Eventually I would end up with 2 Nginx servers when I implemented a floating ip address.
  • Added more Node servers – With Nginx living on it’s own server, Mongo living on Compose.io, and files being served off of S3, I was able to finally scale out the Node side of things.  Kernl currently has 3 Node app servers, which handle requests rates of up to 170/second.

Final Thoughts

Over the past year I’ve wondered if taking the time to build things right the first time through would have been worth it.  I’ve come to the conclusion that optimizing for simplicity is probably what kept me interested in Kernl long enough to make it profitable.  I deal with enough complication in my day job, so having to deal with it in a “fun” side project feels like a great way to kill passion.

Using the Django Per-Site Cache with the Nginx HTTP Memcached Module

For a long time I thought that the most interesting problems in my field were in scalability. Some people may be more interested in scaling, and others might be more into slick interfaces and fast animations. But for me, scalability has continued to be my passion. For awhile though, it was a unicorn. That unattainable thing that I wanted to work on but couldn’t find anywhere to do it at. That is, until I started work at Future US.

Future is a media company. Originally they started in old media focusing heavily on gaming and tech magazines. Eventually the internet became prominent in everyday life, so more of their old media properties made the transition to the web. The one that really matters to me though is PC Gamer. I’ve been a huge fan of PC Gamer since I was about 7 years old. I still have fond memories getting demo disks in the mail with my subscription.

When I was hired at Future it was to help facilitate the move of PC Gamer from its existing platform (WordPress) to Django. Future had experienced success moving other properties to Django, so it made sense to do it with PC Gamer. When it eventually came time to implement our caching layer, we thought about a lot of different ways that it could be done. Varnish came up as an option, but we decided against it since nobody on the team had experience configuring it (and people elsewhere in the organization had experienced issues with it). Eventually we settled on having Nginx serve pages directly from Memcache. For us, this method works great because PC Gamer doesn’t have a lot of interaction (its almost completely consumption from the user end). Anything that does require back-and-forth between the server is handled via javascript, which makes full page caching super easy to do.

The high level architecture for pc gamer.
The high level architecture for pc gamer.

So how does it all work? The image above describes PC Gamer’s server architecture from a high level. Its pretty basic and works quite well for us. We end up having two types of requests: cache hits & cache misses. The flow for a cache hit is: request -> load balancer -> nginx -> memcache -> your browser. The flow for a cache miss is: request -> load balancer -> nginx -> application server (django) -> (store page in cache) -> your browser.

Since we’re basically running a static site, deciding what content to cache is easy: EVERYTHING!

Cache all the things!
Cache all the things!

Luckily for us Django already has a nice way of doing this: The per-site cache. But it is not without its issues. First of all, the cache keys it creates are insane. We needed something a little simpler for our setup so Nginx could build the cache key of the current request on the fly.

How It Works

The meat and potatoes of overriding Django’s per-site cache key comes in the `_generate_cache_key` function.

def _generate_cache_key(request, method, headerlist, key_prefix):
    if key_prefix is None:
        key_prefix = settings.CACHE_MIDDLEWARE_KEY_PREFIX
    cache_key = key_prefix + get_absolute_uri(request)
    return hashlib.md5(cache_key).hexdigest()

To make things easier for Nginx to understand we just take the url and md5 it. Simple!

On the Nginx side of things, the setup is equally simple.

        set            $combined_string "$host$request_uri";
        set_by_lua     $memcached_key "return ngx.md5(ngx.arg[1])" $combined_string;
 
        # 404 for cache miss
        # 502 for memcached down
        error_page     404 502 504 = @fallback;
 
        memcached_pass {{ cache.private_ip }}:11211;

All this setup does is take the MD5 of the host + request URI and then check to see if that cache key exists in memcache. If it does then we serve the content at that cache key, if it doesn’t we fall back to our Django application servers and they generate the page.

Thats it. Seriously. It’s simple, extremely fast, and works for us. Your mileage may vary, but if you have relatively simple caching requirements I highly suggest looking into this method before looking at something like Varnish. It could help you remove quite a bit of complexity from your setup.