In the previous parts of “Functional programming without feeling stupid” we have slowly been building ucdump, a utility program for listing the Unicode codepoints and character names of characters in a string. In actual use, the string will be read from a UTF-8 encoded text file.
We don’t know yet how to read a text file in Clojure (well, you may know, but I only have a foggy idea), so we have been working with a single string. This is what we have so far:
(def test-str "Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!") (def test-ch { :offset 0 :character \u20ac }) (def short-test-str "Na\u00EFve") (defn character-name [x] (java.lang.Character/getName (int x))) (defn character-line [pair] (let [ch (:character pair)] (format "%08d: U+%06X %s" (:offset pair) (int ch) (character-name ch)))) (defn character-lines [s] (let [offsets (repeat (count s) 0) pairs (map #(into {} {:offset %1 :character %2}) offsets s)] (map character-line pairs)))
I’ve reformatted the code a bit to keep the lines short. You can copy and paste all of that in the Clojure REPL, and start looking at some strings in a new way:
user=> (character-lines "résumé") ("00000000: U+000072 LATIN SMALL LETTER R" "00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE" "00000000: U+000073 LATIN SMALL LETTER S" "00000000: U+000075 LATIN SMALL LETTER U" "00000000: U+00006D LATIN SMALL LETTER M" "00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE")
But we are still missing the actual offsets. Let’s fix that now.
[]