Functional Programming Without Feeling Stupid, Part 5: Project

In the last four installments of Functional Programming Without Feeling Stupid I’ve slowly built up a small utility called ucdump with Clojure. Experimenting and developing with the Clojure REPL is fun, but now it’s time to give some structure to the utility. I’ll package it up as a Leiningen project and create a standalone JAR for executing with the Java runtime.

Creating a new project with Leiningen

You can use Leiningen to create a skeleton project quickly. In my project’s root directory, I’ll say:

    lein new app ucdump

Leiningen will respond with:

    Generating a project called ucdump based on the 'app' template.

The result is a directory called ucdump, which contains:

    .gitignore   README.md    project.clj  src/
    LICENSE      doc/         resources/   test/

For now I’m are most interested in the project file, project.clj, which is actually a Clojure source file, and the src directory, which is intended for the app’s actual source files.

Leiningen creates a directory called src/ucdump and seeds it with a core.clj file, but that’s not what actually what I want, for two reasons:

I want ucdump to be a good Clojure citizen, so I’m going to put it in a namespace called com.coniferproductions.ucdump.
My [Git repository for `ucdump`][4] also contains the original Python version of the application, which is in projectroot/python, and I want the Clojure version to live in projectroot/clojure.

So first I’ll rename the ucdump directory created by Leiningen to clojure:

    mv ucdump clojure

Then I’ll make the namespace directories and rename core.clj to udump.clj:

    mkdir -p clojure/src/com/coniferproductions
    mv clojure/src/ucdump/core.clj clojure/src/com/coniferproductions/ucdump.clj
    rmdir clojure/src/ucdump
    mkdir -p clojure/test/com/coniferproductions
    mv clojure/test/ucdump/core_test.clj clojure/test/com/coniferproductions/ucdump_test.clj
    rmdir clojure/test/ucdump

This method of having each namespace in a separate file was suggested in the book Clojure Programming. The result looks like this:

    clojure
    ├── LICENSE
    ├── README.md
    ├── doc
    │   └── intro.md
    ├── project.clj
    ├── resources
    ├── src
    │   └── com
    │       └── coniferproductions
    │           └── ucdump.clj
    └── test
        └── com
            └── coniferproductions
                └── ucdump_test.clj

There are some namespace references in the source files created by Leiningen which are now obsolete, so I’ll fix them eventually, but I’ll focus first on the project file. At this point it looks like this:

    (defproject ucdump "0.1.0-SNAPSHOT"
      :description "FIXME: write description"
      :url "http://example.com/FIXME"
      :license {:name "Eclipse Public License"
                :url "http://www.eclipse.org/legal/epl-v10.html"}
      :dependencies [[org.clojure/clojure "1.6.0"]]
      :main ^:skip-aot ucdump.core
      :target-path "target/%s"
      :profiles {:uberjar {:aot :all}})

You can read up on the settings in the Leiningen tutorial. These are suitable for a standalone application, but the actual values still need to be fixed. When I’m done with the project file, it looks like this:

    (defproject ucdump "0.1.0-SNAPSHOT"
      :description "Unicode character dump for UTF-8 encoded files"
      :url "https://github.com/jerekapyaho/ucdump"
      :license {:name "MIT License"
                :url "http://opensource.org/licenses/MIT"}
      :dependencies [[org.clojure/clojure "1.6.0"]]
      :main ^:skip-aot com.coniferproductions.ucdump
      :target-path "target/%s"
      :profiles {:uberjar {:aot :all}})

Putting the source code in its place

The source file created by Leiningen, which we moved to src/com/coniferproductions/ucdump.clj, initially looks like this:

    (ns ucdump.core
      (:gen-class))

    (defn -main
      "I don't do a whole lot ... yet."
      [& args]
      (println "Hello, World!"))

I won’t bother running that now (but I’ve done that with other projects before — it’s a useful smoke test). Instead it’s time to pour all the code we wrote in the earlier parts of this series into the ucdump.clj source file. I’ll also fix the namespace definition at the top of the file, and add some comments to the functions:

    (ns com.coniferproductions.ucdump
      (:gen-class))

    (def test-str "Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!")
    (def test-ch { :offset 0 :character \u20ac })
    (def short-test-str "Na\u00EFve")

    (defn character-name [x]
      (java.lang.Character/getName (int x)))

    (defn character-line [pair]
      (let [ch (:character pair)]
        (format "%08d: U+%06X %s" (:offset pair) (int ch) (character-name ch))))

    (defn octet-count [cp]
      "Determines the length of a Unicode codepoint when encoded in UTF-8.
      See RFC 3629 for the details."
      (cond
        (and (>= cp 0x000000) (<= cp 0x00007F)) 1
        (and (>= cp 0x000080) (<= cp 0x0007FF)) 2
        (and (>= cp 0x000800) (<= cp 0x00FFFF)) 3
        (and (>= cp 0x010000) (<= cp 0x10FFFF)) 4
        :else 0))

    (defn octet-counts [s]
      (map octet-count (map int s)))

    (defn character-lines [s]
      (let [offsets (butlast (cons 0 (reductions + (octet-counts s))))
            pairs (map #(into {} {:offset %1 :character %2}) offsets s)]
        (map character-line pairs)))

    (defn -main
      [& args]
      (doseq [line (character-lines test-str)] (println line)))

The main program creates a line for each character in test-str, and prints them to the standard output.

Leiningen knows from the project file’s :main setting that the function to call when starting the program is in the com.coniferproductions.ucdump namespace, so the -main function from there is the one to use.

Time for a test run!

The application can be tested by changing to the project root directory and saying:

    lein run

The result should be:

    00000000: U+00004E LATIN CAPITAL LETTER N
    00000001: U+000061 LATIN SMALL LETTER A
    00000002: U+0000EF LATIN SMALL LETTER I WITH DIAERESIS
    00000004: U+000076 LATIN SMALL LETTER V
    00000005: U+000065 LATIN SMALL LETTER E
    00000006: U+000020 SPACE
    00000007: U+000072 LATIN SMALL LETTER R
    00000008: U+0000E9 LATIN SMALL LETTER E WITH ACUTE
    00000010: U+000073 LATIN SMALL LETTER S
    00000011: U+000075 LATIN SMALL LETTER U
    00000012: U+00006D LATIN SMALL LETTER M
    00000013: U+0000E9 LATIN SMALL LETTER E WITH ACUTE
    00000015: U+000073 LATIN SMALL LETTER S
    00000016: U+00002E FULL STOP
    00000017: U+00002E FULL STOP
    00000018: U+00002E FULL STOP
    00000019: U+000020 SPACE
    00000020: U+000066 LATIN SMALL LETTER F
    00000021: U+00006F LATIN SMALL LETTER O
    00000022: U+000072 LATIN SMALL LETTER R
    00000023: U+000020 SPACE
    00000024: U+000030 DIGIT ZERO
    00000025: U+000020 SPACE
    00000026: U+0020AC EURO SIGN
    00000029: U+00003F QUESTION MARK
    00000030: U+000020 SPACE
    00000031: U+00004E LATIN CAPITAL LETTER N
    00000032: U+00006F LATIN SMALL LETTER O
    00000033: U+000074 LATIN SMALL LETTER T
    00000034: U+000020 SPACE
    00000035: U+000062 LATIN SMALL LETTER B
    00000036: U+000061 LATIN SMALL LETTER A
    00000037: U+000064 LATIN SMALL LETTER D
    00000038: U+000021 EXCLAMATION MARK

However, I want to read the text from a UTF-8 encoded file, so let’s make the -main function do just that:

    (defn -main
      [& args]
      (let [characters (slurp (nth args 0) :encoding "UTF-8")]
        (doseq [line (character-lines characters)] (println line))))

The slurp function reads the contents of the file, and here I specify the encoding of the file as “UTF-8”. (See the slurp documentation for details.)

The args vector contains the command-line arguments supplied to the application, so I take the first argument with (nth args 0) (the index of the first argument is zero) and use it as the filename.

For a very detailed look at running Clojure applications with Leiningen, see How Clojure Babies Are Made: Understanding lein run by Flying Machine Studios.

If I now specify the filename:

    lein run ~/tmp/testfile-utf8.txt

then the application will produce same output as above, because my testfile-utf8.txt contains the same text as test-str in the code.

Put it in a JAR

Leiningen has already equipped the project file with the means to make a standalone application. That is done by creating an “uberjar”, which packages up the application and all its dependencies so that it can be run using the Java VM. So if, in the project directory, I say:

    lein uberjar

Leiningen responds with:

    Compiling com.coniferproductions.ucdump
    Created /Users/Jere/Projects/ucdump/clojure/target/uberjar/ucdump-0.1.0-SNAPSHOT.jar
    Created /Users/Jere/Projects/ucdump/clojure/target/uberjar/ucdump-0.1.0-SNAPSHOT-standalone.jar

Now I can take this JAR and run it as a normal Java application:

    cp target/uberjar/ucdump-0.1.0-SNAPSHOT-standalone.jar ~/tmp
    java -jar ~/tmp/ucdump-0.1.0-SNAPSHOT-standalone.jar ~/tmp/testfile-utf8.txt

The output is the same as above. However, if you neglect to provide the filename when you run the application, you will get an ugly error message:

    Exception in thread "main" java.lang.IllegalArgumentException: No implementation of method: :make-reader of protocol: #'clojure.java.io/IOFactory found for class: nil

and a stack trace, which might make no sense at all. There is no need to add extensive command-line argument handling to the application (if you need that, take a look at the tools.cli library), but it’s good to do a quick check for the missing argument. This requires one little change in the -main function:

    (defn -main
      [& args]
      (when (not= (count args) 0)
        (let [characters (slurp (nth args 0) :encoding "UTF-8")]
          (doseq [line (character-lines characters)] (println line)))))

If the argument count is not zero, read from the file specified in the first argument; otherwise do nothing.

To make ucdump a proper UNIX-style tool, it should read from standard input if there is no filename. Maybe I’ll update it to do so when I find out how. For the latest version of the source, see the ucdump GitHub repository.

Onwards

This concludes the series. I realise I have perhaps irrevocably managed to combine the words “functional”, “programming” and “stupid”, but the real intent is in the “without feeling” part. I’ve sometimes felt that I would need to be some sort of genius programmer to understand Clojure, and certainly some proponents make Clojure sound so obvious that you can’t help thinking if there’s really something wrong with me. There must be something in the air (and not just Clojure/conj coinciding, which I honestly didn’t know about), since I just found out that Adam Bard had published Clojure is not for geniuses on 18 November 2014, a day after I started this series. That’s parallel evolution at work!

I wanted to tease out some practical aspects of Clojure without theory or condescension, and hope that this series helps you learn a little more about Clojure programming.

UPDATE 2014-12-14: After nearly a month, I’m closing the comments on all parts of this series, because nothing but SPAM appeared.

UPDATE 2020-02-23: Corrected typos and formatting.

UPDATE 2021-09-19: Corrected an embarrassing logic error in combining offsets and characters.