Unicode character dump in Python

Sometimes you just need to see what characters are lurking inside a Unicode encoded text file. Your garden variety dump utility (like the venerable od in UNIX systems and the Windows standard hex dump (though I don't think there is one) only shows you the plain bytes, so you have to head over to the Unicode Consortium website to find out what they mean. But first you need to decode UTF-8 to get the actual code points, or grok UTF-16 LE or BE, and so on. It's fun, but it's not for everyone.

The ucdump utility shows you a nice list of character names, together with their offsets in the file. Currently it only handles UTF-8, so the offset is calculated based on the UTF-8 length of the character.

Here is an example of the ucdump display:

$ python3 ucdump.py testfile2
00000000: U+000030 DIGIT ZERO
00000001: U+000020 SPACE
00000002: U+0020AC EURO SIGN
00000005: U+00003A COLON
00000006: U+000020 SPACE
00000007: U+00006E LATIN SMALL LETTER N
00000008: U+00006F LATIN SMALL LETTER O
00000009: U+000074 LATIN SMALL LETTER T
0000000A: U+000020 SPACE
0000000B: U+000062 LATIN SMALL LETTER B
0000000C: U+000061 LATIN SMALL LETTER A
0000000D: U+000064 LATIN SMALL LETTER D
0000000E: U+000021 EXCLAMATION MARK
0000000F: U+000020 SPACE
00000010: U+00000A (unnamed character)

That also serves as a usage example. As you can see, ucdump is a Python script; you need Python 3.0 or later to use it.

The file in the example had this content:

0 €: not bad!

Here is the complete source code for ucdump:

import sys
import codecs
import unicodedata

input_file = codecs.open(sys.argv[1], 'rb', 'utf-8')
data = input_file.read()
input_file.close()

offset = 0
for ch in data:
  utf8_ch = ch.encode('utf-8')
  print('%08X: U+%06X %s' % (offset, ord(ch), unicodedata.name(ch, '(unnamed character)')))
  offset = offset + len(utf8ch)

Currently ucdump only handles UTF-8, and does not know about surrogate characters, because Python didn't support them back in 2006 (it might now). Here is a list of improvement ideas that quickly come to mind:

Show names of control characters.
Support other Unicode encodings besides UTF-8.
Support surrogate characters too.
Better error handling.

More on the subject matter:

The Python unicodedata module
Evan Jones, How to Use UTF-8 with Python

Also, while looking for something else entirely I discovered John Walker's unum utility. It is a handy Unicode and HTML entity lookup tool, highly recommended.

ucdump is available on GitHub and under the MIT License. It is provided with no warranty for any purpose whatsoever. Share and enjoy. Unicode is a trademark of Unicode, Inc.

(This article was originally published in a slightly different format on the author's personal website in 2006.)

UPDATE 2013-02-21: udump just saved my day.

The following code was lifted from a PDF e-book:

if (sqlite3_prepare_v2(database,
    sqlStatement, −1, &compiledStatement, NULL) == SQLITE_OK) {

The clang compiler reports:

Parse issue: Expected expression.

After much headscratching, the -1 part seems the only problem left.

Copy and paste it into a file and run udump on it:

% python3 ucdump.py minusone.txt
00000000: U+002212 MINUS SIGN
00000003: U+000031 DIGIT ONE
00000004: U+00000A (unnamed character)

Cue minor epiphany.

Explanation: the publishing system for the e-book (or some other step in the production pipeline) has transformed the ordinary minus sign expected by the C compiler into a character good for publishing but confusing for the compiler. With a monospaced font in a programming editor you couldn't really tell them apart by visual inspection.

UPDATE 2014-11-17: udump is now called ucdump, and can be found on GitHub.