Exploring data with Clojure, Incanter, and Leiningen
I’m working through Machine Learning in Action at the moment, and it’s done in Python. I don’t really know Python, but I’d prefer to learn Clojure, so I’m redoing the code samples.
This blog posts show how to read a CSV file, manipulate it, then graph it. Turns out Clojure is pretty good for this, in combination with the Incanter library (think R for the JVM). It took me a while to get an environment set up since I’m unfamiliar with basically everything.
Install Clojure
I already had it installed so can’t remember if there were any crazy steps to get it working. Hopefully this is all you need:
1 |
sudo brew install clojure |
Install Leiningen
Leiningen is a build tool which does many things, but most importantly for me is it manages the classpath. I was jumping through all sorts of hoops trying to get Incanter running without it.
There are easy to follow instructions in the README
*UPDATE: * As suggested in the comments, you can probably just `brew install lein` here and that will get you Leiningen and Clojure in one command.
Create a new project
1 |
lein new hooray-data && cd hooray-data |
Add Incanter as a dependency to the project.clj
file, and also a main target:
1 2 3 4 5 6 |
(defproject clj "1.0.0-SNAPSHOT" :description "FIXME: write" :dependencies [[org.clojure/clojure "1.2.0"] [org.clojure/clojure-contrib "1.2.0"] [incanter "1.2.3-SNAPSHOT"]] :main hooray_data.core) |
Add some Incanter code to src/hooray_data/core.clj
1 2 3 4 5 6 |
(ns hooray_data.core (:gen-class) (:use (incanter core stats charts io datasets))) (defn -main [& args] (view (histogram (sample-normal 1000))) |
Then fire it up:
1 2 |
lein deps lein run |
If everything runs to plan you’ll see a pretty graph.
Code
First, a simple categorized scatter plot. read-dataset
works with both URLs and files, which is pretty handy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
(ns hooray_data.core (:use (incanter core stats charts io))) ; Sample data set provided by Incanter (def plotData (read-dataset "https://raw.github.com/liebke/incanter/master/data/iris.dat" :delim \space :header true)) (def plot (scatter-plot (sel plotData :cols 0) (sel plotData :cols 1) :x-label "Sepal Length" :y-label "Sepal Width" :group-by (sel plotData :cols 4))) (defn -main [& args] (view plot)) |
Second, the same data but normalized. The graph will look the same, but the underlying data is now ready for some more math.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
(ns hooray_data.core (:use (incanter core stats charts io))) ; Sample data set provided by Incanter (def data (read-dataset "https://raw.github.com/liebke/incanter/master/data/iris.dat" :delim \space :header true)) (defn extract [f] (fn [data] (map #(apply f (sel data :cols %)) (range 0 (ncol data))))) (defn fill [n row] (map (fn [x] row) (range 0 n))) (defn matrix-row-operation [operand row matrix] (operand matrix (fill (nrow matrix) row))) ; Probably could be much nicer using `reduce` (defn normalize [matrix] (let [shifted (matrix-row-operation minus ((extract min) matrix) matrix)] (matrix-row-operation div ((extract max) shifted) shifted))) (def normalized-data (normalize (to-matrix (sel data :cols [0 1])))) (def normalized-plot (scatter-plot (sel normalized-data :cols 0) (sel normalized-data :cols 1) :x-label "Sepal Length" :y-label "Sepal Width" :group-by (sel data :cols 4))) (defn -main [& args] (view normalized-plot)) |
I was kind of hoping the normalize
function would have already been written for me in a standard library, but I couldn’t find it.
I’ll report back if anything else of interest comes up as I’m working through the book.