In part 1 of my “Learning Clojure” series, I created a simple program to calculate salary based on how many years someone worked. For this post, I’m going to be attempting something a bit more complicated.
Project Gutenberg
One of my favorite websites in the entire world is Project Gutenberg(PG). PG is an archive of books that have passed into the public domain, which makes is a great resource for text mining data. I use it almost every time I need some words to parse, and you should too! So why does this matter right now? I’m glad that you asked.
Outline
Given how simple the last program was, I decided that I should probably take this one up a notch. Its going to involve fetching a file, writing it to disk, reading the file, and processing command line args. In order, here’s what the program needs to do:
- Validate command line args – We’re going to accept two arguments.  The word that should be counted and a url that points to a .txt file at Project Gutenbergmy web host (Project Gutenberg doesn’t like crawlers apparently) for processing.
- Download the file – It could be large and might fail. We’ll need to be careful here.
- Split the file into a vector – Split the file up on ” ” and load it into a vector.
- Print – Print to standard output how many words were found. If none, make it known.
The Program
The full source code can be found at https://github.com/vital101/learn-clojure-wordcount. It looks pretty simple, but I did expand my Clojure knowledge quite a bit with this one. Some of the things I did:
- Used a 3rd party library
- Messed around with vectors (split word data) and sequences (args).
- Wrote to a file.
- Refactored constantly
I do want to highlight one bit of code that I wrote, because its pretty straight forward but does a lot of stuff.
| (defn process [url word] (write-file (get-source-file url)) (log "Info" "Processing File...") (let [data (cljstr/split (slurp filename) #"\s+")] (log "Result" (str (count (filter #{word} data)) " occurrences of '" word "'")))) | 
My next program needs to be more complicated from a data perspective, so that I’m forced to use things like “map”, “reduce”, and other functional elements on data sets.
