Functional Programming Without Feeling Stupid, Part 4: Logic

In the previous parts of Functional programming without feeling stupid we have slowly been building ucdump, a utility program for listing the Unicode codepoints and character names of characters in a string. In actual use, the string will be read from a UTF-8 encoded text file.

We don’t know yet how to read a text file in Clojure (well, you may know, but I only have a foggy idea), so we have been working with a single string. This is what we have so far:

    (def test-str
      "Na\u00EFve r\u00E9sum\u00E9s... for 0 \u20AC? Not bad!")
    (def test-ch { :offset 0 :character \u20ac })
    (def short-test-str "Na\u00EFve")

    (defn character-name [x]
      (java.lang.Character/getName (int x)))

    (defn character-line [pair]
      (let [ch (:character pair)]
        (format "%08d: U+%06X %s"
          (:offset pair) (int ch)
          (character-name ch))))

    (defn character-lines [s]
      (let [offsets (repeat (count s) 0)
            pairs (map #(into {} {:offset %1 :character %2})
              offsets s)]
        (map character-line pairs)))

I’ve reformatted the code a bit to keep the lines short. You can copy and paste all of that in the Clojure REPL, and start looking at some strings in a new way:

    user=> (character-lines "résumé")
    ("00000000: U+000072 LATIN SMALL LETTER R"
    "00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE"
    "00000000: U+000073 LATIN SMALL LETTER S"
    "00000000: U+000075 LATIN SMALL LETTER U"
    "00000000: U+00006D LATIN SMALL LETTER M"
    "00000000: U+0000E9 LATIN SMALL LETTER E WITH ACUTE")

But we are still missing the actual offsets. Let’s fix that now.

What should the offsets actually even be? Since the idea is to read a UTF-8 encoded text file, the offsets should reflect the position of each character in the file.

UTF-8 is a variable-length encoding. Some characters, like all the ASCII characters, can be encoded with one byte. Others, like those in the Latin-1 supplement, Greek characters, Cyrillic characters, and many more, will require two bytes. As the codepoint increases in value, the more bytes are needed, up to four bytes in total.

What we need is a way to determine how many bytes a character needs when it is encoded in UTF-8. We don’t actually need to encode the character, although that is not as hard. All we need to do is look at the codepoint value and determine the number of bytes from it.

The UTF-8 encoding rules state the following about the characters and bytes:

From	To	Bytes
U+000000	U+00007F	1
U+000080	U+0007FF	2
U+000800	U+00FFFF	3
U+010000	U+10FFFF	4

This means about the same as these statements:

If the codepoint of the character is between 0 and 0x7F inclusive, the byte count is 1.
If the codepoint is between 0x80 and 0x7FF inclusive, the byte count is 2.
If the codepoint is between 0x800 and 0xFFFF inclusive, the byte count is 3.
If the codepoint is between 0x10000 and 0x10FFFF inclusive, the byte count is 4.

How to express this in Clojure? Let’s see. The language has functions for comparing values, and we need two of those functions: >= and <=. Hexadecimal number literals are the obvious choice in this context, and in Clojure they are the same as in Java. So how about if we define a test code point called cp and widen all the hexadecimal literals to six digits for neatness' sake:

    user=> (def cp 0x0020AC)
    #'user/cp
    user=> (>= cp 0x000000)
    true
    user=> (<= cp 0x00007F)
    false

Seems solid. However, we wanted to know whether the codepoint was inside a certain inclusive range. For that we need the logical and function, called and:

    user=> (and (>= cp 0x000080) (<= cp 0x0007FF))
    false
    user=> (and (> cp 0x000800) (<= cp 0x00FFFF))
    true

All good, since 0x20AC definitely falls between 0x800 and 0xFFFF.

Now, if we have a codepoint, and it can fall into one of four ranges, then how can we decide which range, and how do we return the right byte count? For that, we need the cond macro. It takes a set of test and expression pairs. The test is evaluated, and if it returns logical true, the expression paired with the test is evaluated, and its value is returned.

What's "logical true" in Clojure? Everything else except nil and false, that's what. Those two are "logical false". To better understand the concepts of logical true and logical false it pays to take a detour via if, which is a special form. There are a couple of them in Lisp-style languages, and actually def, defn and let are also special forms.

    user=> (if (<= cp 0x7f) "one byte" "more than one byte")
    "more than one byte"

The if special form has two or three expressions: test, then and else (which is optional). If test is logical true, the then expression is evaluated. If it is logical false, the else expression is evaluated (otherwise nil is returned). This particular if returns `"more than one byte"` because (<= cp 0x7f) is logical false, at least if cp is still 0x20AC when you run this.

The cond macro is like a multi-way if. I wrote about the if special form, and told you that it has test/expression pairs, but it can also have an :else part. All of this is lovingly wrapped in beautiful parentheses, like this:

    (cond
      (and (>= cp 0x000000) (<= cp 0x00007F)) 1
      (and (>= cp 0x000080) (<= cp 0x0007FF)) 2
      (and (>= cp 0x000800) (<= cp 0x00FFFF)) 3
      (and (>= cp 0x010000) (<= cp 0x10ffff)) 4
      :else 0)

This cond has four test/expression pairs and an :else part. The tests are exactly those we saw earlier when we first combined the and function and the comparison functions.

It is easier to play with this cond if it is promoted to a function:

    (defn octet-count [cp]
      (cond
        (and (>= cp 0x000000) (<= cp 0x00007F)) 1
        (and (>= cp 0x000080) (<= cp 0x0007FF)) 2
        (and (>= cp 0x000800) (<= cp 0x00FFFF)) 3
        (and (>= cp 0x010000) (<= cp 0x10FFFF)) 4
        :else 0))

The function is called octet-count because in the UTF-8 standard the 8-bit units are called octets.

Copy and paste the definition of the octet-count function into the REPL. Then try it out with a few values:

    user=> (octet-count 0x44)
    1
    user=> (octet-count 0x444)
    2
    user=> (octet-count 0x4444)
    3
    user=> (octet-count 0x104444)
    4

Looks about right!

Now, given a string, how would you get the octet counts for all the characters in the string? Two nested map functions, maybe?

    user=> test-str
    "Naïve résumés... for 0 €? Not bad!"
    user=> (map #(format "0x%06X" (int %1)) test-str)
    ("0x00004E" "0x000061" "0x0000EF" "0x000076" "0x000065"
    "0x000020" "0x000072" "0x0000E9" "0x000073" "0x000075"
    "0x00006D" "0x0000E9" "0x000073" "0x00002E" "0x00002E"
    "0x00002E" "0x000020" "0x000066" "0x00006F" "0x000072"
    "0x000020" "0x000030" "0x000020" "0x0020AC" "0x00003F"
    "0x000020" "0x00004E" "0x00006F" "0x000074" "0x000020"
    "0x000062" "0x000061" "0x000064" "0x000021")
    user=> (map octet-count (map int test-str))
    (1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1
    1 1 1 1 1 1 1 1)

If you look at test-str and compare it with the codepoints, you can see that most of them are one-byters, but the ï and é characters take two bytes, and the euro symbol must be the sole three-byter.

Let's promote that last map thingy into a function for easy use:

    (defn octet-counts [s]
      (map octet-count (map int s)))

Are we done yet? Dig out the character-lines function again:

    (defn character-lines [s]
      (let [offsets (repeat (count s) 0)
            pairs (map #(into {} {:offset %1 :character %2})
              offsets s)]
        (map character-line pairs)))

There they are still, those dummy zero offsets. If I just replace them with the octet counts, I'm surely done:

    (defn character-lines [s]
      (let [offsets (octet-counts s)
            pairs (map #(into {} {:offset %1 :character %2})
              offsets s)]
        (map character-line pairs)))

Copy and paste into the REPL, then test:

    user=> (character-lines "Naïve")
    ("00000001: U+00004E LATIN CAPITAL LETTER N"
    "00000001: U+000061 LATIN SMALL LETTER A"
    "00000002: U+0000EF LATIN SMALL LETTER I WITH DIAERESIS"
    "00000001: U+000076 LATIN SMALL LETTER V"
    "00000001: U+000065 LATIN SMALL LETTER E")

Hey, what? All the offsets are ones, with one two. These are not the droids we are looking for. They are the octet counts of the characters, not the offsets.

This is not the time to give up. I've heard about something interesting called a "reduce" in the context of functional programming. For example, you can sum up a bunch of numbers by reducing them into one number. You accumulate as you go, and the result is your... well, result. Like this summation:

    user=> (reduce + 0 [1 2 3 4 5])
    15

Magic! Well, not quite; only 0 + 1 + 2 + 3 + 4 + 5. It's a well-defined and often used paradigm, which can also be used with other functions, like multiplication:

    user=> (reduce * 1 [1 2 3 4 5])
    120

In this case the result will be 1 * 1 * 2 * 3 * 4 * 5.

By the way, this paradigm is called a fold in most other functional programming languages, like Haskell. There are also left folds and right folds. Think for a minute why the reduce operation (or a fold) with the + operator starts accumulating from zero, but the reduce operation with the * operator starts accumulating from one.

But this still won't do! We are not looking for a single number, but a collection of cumulative numbers. You could spend hours writing a function which assembles it for you, but as luck would have it, Clojure already has one available. It's called reductions.

    user=> (reductions + [1 2 3 4 5])
    (1 3 6 10 15)

All right! That would be the sequence 1, 1+2, 1+2+3, 1+2+3+4 and 1+2+3+4+5, and you can do the math to convince yourself. Let's try the same trick with the actual octet counts:

    user=> (reductions + (octet-counts test-str))
    (1 2 4 5 6 7 8 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25
    26 29 30 31 32 33 34 35 36 37 38 39)

Well, I'll buy that -- just look at what offsets come after 2, 8, 13 and 26. The first character is encoded with one byte, the second needs two bytes, the third again one byte, and so on. The 26 is the three-byte euro symbol. The byte count of each character is added to the previously accumulated count.

UPDATE 2021-09-19: There is still the problem of connecting the right offsets to the right characters, and that's what tripped me originally. I'll write more about this in another blog post, but right now let's just try to get it right.

The offset of the first character in the file is zero, so let's call it offset. The offset of the second character is offset + c, where c is the octet count of that character. If we just naïvely combine the reductions and the characters, we end up with the wrong offsets! That's what happened to me the first time around.

The correct way to do this is to insert the initial offset zero to the head of the list of the redcutions, and then discard the last element of the list, because there won't be a character for that offset. In Clojure it's easy to construct a new list using cons:

    user=> (cons 0 (reductions + (octet-counts test-str)))
    (0 1 2 4 5 6 7 8 10 11 12 13 15 16 17 18 19 20 21 22 23 24
    25 26 29 30 31 32 33 34 35 36 37 38 39)

Then we take all but the last element of the list, using the handy and appropriately named butlast function in Clojure:

    user=> (butlast (cons 0 (reductions + (octet-counts test-str))))
    (0 1 2 4 5 6 7 8 10 11 12 13 15 16 17 18 19 20 21 22 23 24
    25 26 29 30 31 32 33 34 35 36 37 38)

So, now we have inserted the first offset of zero at the head of the list and discarded the last one. (In the original edition I mistakenly just decremented each offset by one, which was totally wrong, and I honestly don't know where that came from.)

This is the moment we've all been waiting for; everything will now fall into place. Just a few more edits to the character-lines function... We'll drop in the reductions code for getting the offsets, insert the zero offset and discard the last one, with the code shown above:

    (defn character-lines [s]
      (let [offsets (butlast (cons 0 (reductions + (octet-counts s))))
            pairs (map #(into {} {:offset %1 :character %2})
              offsets s)]
        (map character-line pairs)))

See if your hands are not trembling a bit with excitement as you copy and paste that into the REPL... and test:

    user=> (character-lines test-str)
    ("00000000: U+00004E LATIN CAPITAL LETTER N"
    "00000001: U+000061 LATIN SMALL LETTER A"
    "00000002: U+0000EF LATIN SMALL LETTER I WITH DIAERESIS"
    "00000004: U+000076 LATIN SMALL LETTER V"
    "00000005: U+000065 LATIN SMALL LETTER E"
    "00000006: U+000020 SPACE"
    "00000007: U+000072 LATIN SMALL LETTER R"
    "00000008: U+0000E9 LATIN SMALL LETTER E WITH ACUTE"
    "00000010: U+000073 LATIN SMALL LETTER S"
    "00000011: U+000075 LATIN SMALL LETTER U"
    "00000012: U+00006D LATIN SMALL LETTER M"
    "00000013: U+0000E9 LATIN SMALL LETTER E WITH ACUTE"
    "00000015: U+000073 LATIN SMALL LETTER S"
    "00000016: U+00002E FULL STOP"
    "00000017: U+00002E FULL STOP"
    "00000018: U+00002E FULL STOP"
    "00000019: U+000020 SPACE"
    "00000020: U+000066 LATIN SMALL LETTER F"
    "00000021: U+00006F LATIN SMALL LETTER O"
    "00000022: U+000072 LATIN SMALL LETTER R"
    "00000023: U+000020 SPACE"
    "00000024: U+000030 DIGIT ZERO"
    "00000025: U+000020 SPACE"
    "00000026: U+0020AC EURO SIGN"
    "00000029: U+00003F QUESTION MARK"
    "00000030: U+000020 SPACE"
    "00000031: U+00004E LATIN CAPITAL LETTER N"
    "00000032: U+00006F LATIN SMALL LETTER O"
    "00000033: U+000074 LATIN SMALL LETTER T"
    "00000034: U+000020 SPACE"
    "00000035: U+000062 LATIN SMALL LETTER B"
    "00000036: U+000061 LATIN SMALL LETTER A"
    "00000037: U+000064 LATIN SMALL LETTER D"
    "00000038: U+000021 EXCLAMATION MARK")

Yes! Offsets and codepoints and character names, oh my! This is too much for one post, I need to have a lie down. When I have recovered, we'll print those lines properly and package this baby up.

In the meantime, you might be interested to read A Year of Functional Programming by David Barri.

UPDATE 2014-12-14: After nearly a month, I'm closing the comments on all parts of this series, because nothing but SPAM appeared.

UPDATE 2021-09-19: Corrected an embarrassing logic error in combining offsets and characters.