In my recent post Functional Programming Without Feeling Stupid I took a quick look at how functional programming can be a little off-putting for the non-initiated. I promised to provide some examples of my own first steps with FP, and now I would like to present some to you.
If you are thinking about getting intimate with Clojure, you will need to get to know the REPL. It is your playground, and will always be, even if you later start packaging and organizing your code.
Clojure depends on the Java Virtual Machine (JVM) and is actually distributed as a normal JAR file (Java ARchive), like most Java libraries are. You can start Clojure from the JAR, but you will save yourself some trouble and prepare for the future if you install Leiningen, the dependency management tool for Clojure. It is simple to install and run, and I will assume that you will follow the instructions on the Leiningen web site sooner or later. Now would be a good time.
When you’re done with the installation, you only need to say
to start a Clojure REPL. I’m using OS X, so what I describe here was done from Terminal. You don’t need to create a project with Leiningen if you just want to play around in the REPL.
Of course, if you don’t have Java installed, you need to get it first. Refer to the Java web site of Oracle for details as necessary. Furthermore, some of the things I will describe require Java version 7 or later.
What I want to do with Clojure is an example of a real-world problem, at least for me. A couple of years ago I wrote a very small tool called udump for listing all the Unicode characters and their names in a UTF-8 encoded text file. Because Python had the character names as part of the platform libraries, and also supports the UTF-8 character encoding, I was able to quickly whip up a tool which gave me the results I wanted, and it has saved me a lot of trouble on several occasions.
Java also supports Unicode extremely well, and Clojure inherits all that goodness from its host, so it should be almost trivial to create a version of udump in Clojure. To learn Clojure better I wanted to do just that, and put the result on GitHub.
So for the rest of this post, and hopefully future posts as well, you can watch me as I build parts of the utility — now called ucdump for “Unicode character dump” — in the Clojure REPL, slowly progressing towards something I can package up and use myself, and maybe others can also benefit from it.
After you have started the Clojure REPL in Leiningen, you’ll see something like this:
nREPL server started on port 52478 on host 127.0.0.1 - nrepl://127.0.0.1:52478
Docs: (doc function-name-here)
Source: (source function-name-here)
Javadoc: (javadoc java-object-or-class-here)
Exit: Control+D or (exit) or (quit)
Results: Stored in vars *1, *2, *3, an exception in *e
user=> is the prompt from the REPL.
ACT 1: “YOU ARE IN A MAZE OF TWISTY LITTLE PASSAGES, ALL ALIKE.”
OK, what now?
Many of the Clojure tutorials on the web and in books start with an exploration of values. This is not really a tutorial, so I’m going to defer to those other sources (any of them will do, but I would start with Clojure from the ground up), and cut to the chase.
I’m interested in strings which consist of Unicode characters. I know that if I have a UTF-8 encoded text file, I most likely can read it into memory as one big string, or read each line into a separate string. Either way will do, and if the files are less than a few megabytes, that will not be a problem. I won’t get to actually reading the file in a long time, but it pays to think ahead just a little bit. At this point, however, the problem pretty much boils down to this:
Given a string, for each character in the string print out the Unicode code point and the official name of the character (if any). Precede that with an offset that points to the location in the file where the character appears.
No matter if I have just one string, or I have several lines stored in separate strings, I can do the same for all of them.
By the way, if you don’t know (much) about Unicode, now would be a good time to find out (more). I recommend Joel Spolsky’s classic The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). If that piques your interest, hop on over to the web pages of The Unicode Consortium to find out more about the Unicode standard.
So, let’s say my file contains only the following text:
Naïve résumes... for 0 €? Not bad!
If you look carefully, there are some characters which may not use every day. Then again, you may not think of them as being in any way special, and that is as it should be. Either no characters are special, or they all are. A character in Unicode is a character, that’s all.
What does it take to make this a string in Clojure? Well, you could just enclose it in quotes and enter it in the REPL, and Clojure will echo it back to you:
user=> "Naïve résumés... for 0 €? Not bad!"
"Naïve résumés... for 0 €? Not bad!"
On my Finnish Apple keyboard I need to use a couple of “dead keys” to get the diacritics above some of the characters. The euro symbol is even easier: just Shift-4 will do. This direct entry works because the Terminal in OS X uses UTF-8 as its character encoding,
so I can type in any character and expect it to be preserved. However, if you
want to be extra safe, you could use Unicode escapes like in Java to designate
any non-ASCII characters, but you may need to look the character codes up
in some reference:
"Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!"
As you can see, all the non-ASCII characters have been escaped, using their hexadecimal Unicode character code. The result is still the same in the REPL:
user=> "Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!"
"Naïve résumés... for 0 €? Not bad!"
You can ask Clojure about the type of this value:
user=> (type "Naïve résumes... for 0 €? Not bad!")
So, that is just a plain old Java
What about just one character, like the capital N that starts the string? For that you will need to use a backslash:
Note that this value is not a string:
user=> (type \N)
If you’re coming from basically any language somehow derived from C, you might think that
\n is an escape sequence for a newline character, but oh no it isn’t:
user=> (type \n)
user=> (print \n)
What was that last response from the REPL? The lowercase character n, followed by the value returned by whatever printed it — in this case,
nil, as in no value or nothing (a nod to the Lisp tradition).
ACT 2: “WE DEMAND RIGIDLY DEFINED AREAS OF DOUBT AND UNCERTAINTY!”
You may notice I’ve sneaked something past you without explaining it, namely these:
What are they? First of all, they are lists. Both of them have two items. But, more importantly, they are also function applications. In Clojure, like in all Lisps, the item in the first position is a function. We say that the item is in the “function position”. The rest of the items are arguments to the function. It depends on the function how many arguments it takes.
In the Clojure REPL, you can get a quick overview of some symbol with the
doc function, like this:
user=> (doc print)
Prints the object(s) to the output stream that is the current value
of *out*. print and println produce output for human consumption.
If you read the documentation carefully, you can determine that
println will probably add a newline. Let’s try it:
user=> (println \N)
Notice that you got some documentation from
doc, and you got output followed by a newline from
println, but you also got
nil from both. They don’t produce a value, but they both have a side-effect (more on that in another post).
(type \N) did not return
nil. Instead it returned the fully-qualified type of the value, namely
As it turns out, Clojure has a couple of functions we can use to examine characters and their character codes:
user=> (int \N)
user=> (char 78)
Works both ways! Or, you could use a Unicode escape sequence:
user=> (int \u004E)
How about this, then?
user=> (= (int \N) (int (char 78)))
=? It is a function to test equality. What is being tested? The value of
(int \N) and the value of
(int (char 78)), that’s what. The result is
true, because as we saw above,
(int \N) is 78, and
(char 78) is
\N, so by substitution
(int (char 78)) must be 78, no question about it. And since 78 is 78, the
= function returns
Note that in Clojure,
= is not an operator. It is a function. (There are many others like that. If you feel up to it, try
(doc =) or even
As you just saw, you can have nested lists, and if you can have nested lists, you can have nested functions, too. However, the notation is what most people find irritating and even silly about Lisp and the like: there are a lot of parentheses. (I will not repeat what some wags have said Lisp to be an acronym for.) Then again, there are both a lot of parentheses and a lot of curly braces in Java.
println a string? Yep!
user=> (println "Yep!")
But can you apply
char to a string?
user=> (int "Nope.")
ClassCastException java.lang.String cannot be cast to java.lang.Character clojure.lang.RT.intCast (RT.java:1087)
No, you can’t, and in trying, we also got a glimpse of the ugly side of Clojure: since it is built in Java, on top of the JVM, we also get Java-style error messages when something goes wrong, and it can be a little difficult to determine what is going on. This is a simple case: the
int function is expecting a character, but we gave it a string, and it can’t handle that.
char function with a string, and see what happens. (Hint: more of the same.)
ACT 3: “THANK YOU MARIO! BUT OUR PRINCESS IS IN ANOTHER CASTLE!”
So, back to the original problem. Given this:
Naïve résumés... for 0 €? Not bad!
I want to see this:
00000000: U+00004E LATIN CAPITAL LETTER N
00000001: U+000061 LATIN SMALL LETTER A
00000002: U+0000EF LATIN SMALL LETTER I WITH DIAERESIS
00000004: U+000076 LATIN SMALL LETTER V
00000005: U+000065 LATIN SMALL LETTER E
00000006: U+000020 SPACE
00000007: U+000072 LATIN SMALL LETTER R
00000008: U+0000E9 LATIN SMALL LETTER E WITH ACUTE
00000010: U+000073 LATIN SMALL LETTER S
00000011: U+000075 LATIN SMALL LETTER U
00000012: U+00006D LATIN SMALL LETTER M
00000013: U+0000E9 LATIN SMALL LETTER E WITH ACUTE
00000015: U+000073 LATIN SMALL LETTER S
00000016: U+00002E FULL STOP
00000017: U+00002E FULL STOP
00000018: U+00002E FULL STOP
00000019: U+000020 SPACE
00000020: U+000066 LATIN SMALL LETTER F
00000021: U+00006F LATIN SMALL LETTER O
00000022: U+000072 LATIN SMALL LETTER R
00000023: U+000020 SPACE
00000024: U+000030 DIGIT ZERO
00000025: U+000020 SPACE
00000026: U+0020AC EURO SIGN
00000029: U+00003F QUESTION MARK
00000030: U+000020 SPACE
00000031: U+00004E LATIN CAPITAL LETTER N
00000032: U+00006F LATIN SMALL LETTER O
00000033: U+000074 LATIN SMALL LETTER T
00000034: U+000020 SPACE
00000035: U+000062 LATIN SMALL LETTER B
00000036: U+000061 LATIN SMALL LETTER A
00000037: U+000064 LATIN SMALL LETTER D
00000038: U+000021 EXCLAMATION MARK
From what we know now about Clojure, that’s still a long way away, and this post is getting a little long already, so let’s continue in a future post.
In the meantime, play around in the REPL, and when you’ve had enough for one session, it’s time for this:
Bye for now!
If you want to know more about Unicode, read Unicode Explained* by Jukka K. Korpela.
For an overview of how you can do practical stuff in Clojure, in bite-sized pieces, get
Clojure Cookbook* by Luke VanderHart and Ryan Neufeld.
Next time around we’ll look how Clojure interacts with Java to provide a crucial piece of the ucdump utility.
The comments are open, which is unusual. We’ll see how long it will stay that way (spam, spam, spam…). In the meantime, tell me what you think, or just call out the computer games I refer to in the headings.
UPDATE 2014-12-14: After nearly a month, I’m closing the comments on all parts of this series, because nothing but SPAM appeared.
UPDATE: Part 2: Definitions is now also available.
- = O’Reilly web store link