Uncategorized | Re-Cycled Air

I’ve spent the last few weeks writing a data migration for a large high traffic website and have had a lot of fun trying to squeeze every bit of processing power out of my machine. While playing around locally I can cluster the migration so it executes on fractions of the queryset. For instance.

./manage.py run_my_migration --cluster=1/10
./manage.py run_my_migration --cluster=2/10
./manage.py run_my_migration --cluster=3/10
./manage.py run_my_migration --cluster=4/10

All this does is take the queryset that is generated in the migration and chop it up into tenths. No big deal. The part that is a big deal is that the queryset contains 30,000 rows. In itself that isn’t a bad thing, but there are a lot of memory and cpu heavy operations that happen on each row. I was finding that when I tried to run the migration on our Rackspace Cloud servers the machine would exhaust its memory and terminate my processes. This was a bit frustrating because presumably the operating system should be able to make use of the swap and just deal with it. I tried to make the clusters smaller, but was still running into issues. Even more frustrating was that this happened at irregular intervals. Sometimes it took 20 minutes and sometimes it took 4 hours.

Threading & Multi-processing

My solution to the problem utilized the clustering ability I already had built into the program. If I could break the migration down into 10,000 small migrations, then I should be able to get around any memory limitations. My plan was as follows:

Break down the migration into 10,000 clusters of roughly 3 rows a piece.
Execute 3 clustered migrations concurrently.
Start the next migration after one has finished.
Log the state of the migration so we know where to start if things go poorly.

One of the issues with doing concurrency work with Python is the global interpreter lock (GIL). It makes writing code a lot easier, but doesn’t allow Python to spawn proper threads. However, its easy to skirt around if you just spawn new processes like I did.

Borrowing some thread pooling code here, I was able to get pretty sweet script running in no time at all.

import sys
import os.path
 
from util import ThreadPool
 
def launch_import(cluster_start, cluster_size, python_path, command_path):
    import subprocess
 
    command = python_path
    command += " " + command_path
    command += "{0}/{1}".format(cluster_start, cluster_size)
 
    # Open completed list.
    completed = []
    with open("clusterlog.txt") as f:
        completed = f.readlines()
 
    # Check to see if we should be running this command.
    if command+"\n" in completed:
        print "lowmem.py ==> Skipping {0}".format(command)
    else:
        print "lowmem.py ==> Executing {0}".format(command)
        proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = proc.stdout.read() # Capture the output, don't print it.
 
        # Log completed cluster
        logfile = open('clusterlog.txt', 'a+')
        logfile.write("{0}\n".format(command))
        logfile.close()
 
 
if __name__ == '__main__':
 
    # Simple command line args checking
    try:
        lowmem, clusters, pool_size, python_path, command_path = sys.argv
    except:
        print "Usage: python lowmem.py <clusters> <pool_size> <path/to/python> <path/to/manage.py>"
        sys.exit(1)
 
    # Initiate log file.
    if not os.path.isfile("clusterlog.txt"):
        logfile = open('clusterlog.txt', 'w+')
        logfile.close()
 
    # Build in some extra space.
    print "\n\n"
 
    # Initiate the thread pool
    pool = ThreadPool(int(pool_size))
 
    # Start adding tasks
    for i in range(1, int(clusters)):
        pool.add_task(launch_import, i, clusters, python_path, command_path)
 
    pool.wait_completion()

Utilizing the code above, I can now run a command like:

python lowmem.py 10000 3 /srv/www/project/bin/python "/srv/www/project/src/manage.py import --cluster=" &

Which breaks the queryset up into 10,000 parts and runs the import 3 sets at a time. This has done a great job of keeping the memory footprint of the import low, while still getting some concurrency so it doesn’t take forever.

“The grey rain-curtain turned all to silver glass and was rolled back, and he beheld white shores and beyond them a far green country under a swift sunrise.”

Over the last couple of weeks, I’ve had somewhat of a crisis with grad school. I’ve found that I hate 1 of my classes, despise C++ (which is required for compilers), and just simply don’t have enough time to breath. This pretty much describes a typical computer science student’s life, however, I was pushed over the edge 2 weeks ago.

Two weeks ago, I was hit really hard, all at once. I had 3 exams in one week, and 2 programs due. If you’ve ever tried to write a compiler or a remote procedure program and tried to study for a test at the same time, you’ll know what I mean. To make matters worse, CMU told me I own them $2350. This is all because they over refunded me earlier in the semester, yet failed to realize it until 2 months later. Not only that, but I had 150 papers to grade.

It all stacked up, and made me realize that I don’t really enjoy grad school all that much. So, I started applying for jobs. So far I have been rather successful with getting call backs, but, CMU has decided that they may be willing to work with me.

The grass is always greener on the other side. If I quit grad school, I’ll wish I didn’t. If I don’t quit grad school, I’ll wish I had.