Unicode character dump in Python
Sometimes you just need to see what characters are lurking inside a Unicode encoded text file.
Your garden variety dump utility (like the venerable od
in UNIX systems and the
Windows standard hex dump (though I don't think there is one) only shows you the plain bytes,
so you have to head over to the Unicode Consortium website
to find out what they mean. But first you need to decode UTF-8 to get the actual code points,
or grok UTF-16 LE or BE, and so on. It's fun, but it's not for everyone.
The ucdump
utility shows you a nice list of character names, together with their
offsets in the file. Currently it only handles UTF-8,
so the offset is calculated based on the UTF-8 length of the character.
Here is an example of the ucdump display:
$ python3 ucdump.py testfile2 00000000: U+000030 DIGIT ZERO 00000001: U+000020 SPACE 00000002: U+0020AC EURO SIGN 00000005: U+00003A COLON 00000006: U+000020 SPACE 00000007: U+00006E LATIN SMALL LETTER N 00000008: U+00006F LATIN SMALL LETTER O 00000009: U+000074 LATIN SMALL LETTER T 0000000A: U+000020 SPACE 0000000B: U+000062 LATIN SMALL LETTER B 0000000C: U+000061 LATIN SMALL LETTER A 0000000D: U+000064 LATIN SMALL LETTER D 0000000E: U+000021 EXCLAMATION MARK 0000000F: U+000020 SPACE 00000010: U+00000A (unnamed character)
That also serves as a usage example. As you can see, ucdump
is
a Python script; you need Python 3.0 or later to use it.
The file in the example had this content:
0 €: not bad!
Here is the complete source code for ucdump
:
import sys import codecs import unicodedata input_file = codecs.open(sys.argv[1], 'rb', 'utf-8') data = input_file.read() input_file.close() offset = 0 for ch in data: utf8_ch = ch.encode('utf-8') print('%08X: U+%06X %s' % (offset, ord(ch), unicodedata.name(ch, '(unnamed character)'))) offset = offset + len(utf8ch)
Currently ucdump
only handles UTF-8, and does not know about surrogate characters,
because Python didn't support them back in 2006 (it might now). Here is a list of improvement
ideas that quickly come to mind:
- Show names of control characters.
- Support other Unicode encodings besides UTF-8.
- Support surrogate characters too.
- Better error handling.
More on the subject matter:
- The Python unicodedata module
- Evan Jones, How to Use UTF-8 with Python
Also, while looking for something else entirely I discovered John
Walker's unum
utility. It is a handy Unicode and HTML entity lookup tool, highly recommended.
ucdump
is available on GitHub and under the MIT License. It is provided with no warranty
for any purpose whatsoever. Share and enjoy. Unicode is a trademark of Unicode, Inc.
(This article was originally published in a slightly different format on the author's personal website in 2006.)
UPDATE 2013-02-21: udump just saved my day.
The following code was lifted from a PDF e-book:
if (sqlite3_prepare_v2(database, sqlStatement, −1, &compiledStatement, NULL) == SQLITE_OK) {
The clang
compiler reports:
Parse issue: Expected expression.
After much headscratching, the -1
part seems the only problem left.
Copy and paste it into a file and run udump
on it:
% python3 ucdump.py minusone.txt 00000000: U+002212 MINUS SIGN 00000003: U+000031 DIGIT ONE 00000004: U+00000A (unnamed character)
Cue minor epiphany.
Explanation: the publishing system for the e-book (or some other step in the production pipeline) has transformed the ordinary minus sign expected by the C compiler into a character good for publishing but confusing for the compiler. With a monospaced font in a programming editor you couldn't really tell them apart by visual inspection.
UPDATE 2014-11-17: udump is now called ucdump, and can be found on GitHub.