Quick tutorial on using the unicodedata module?

Wed Jan 1 02:00:37 EST 2003

Skip Montanaro <skip at pobox.com> writes:

> The unicodedata module docs come with no examples, so I'm left bumping
> around more-or-less in the dark.  I just encountered a web page with no
> encoding information which contains an octal 205 byte.  It seems to display
> as an ellipsis, and my heuristic decoder function expresses it as u'\x85'
> and says the encoding s utf-8.  

U+0085 is a control character, with the usual name NEXT LINE (NEL). I
say "usual name", as it appears that the control characters don't have
any official consortium-assigned names - atleast the Unicode database
lists them only as "control character". So the Python unicodedata
module does not have names for the control characters, either. This
may need fixing, but I'm uncertain of what the correct fix would be.

As for this specific example: If you have guessed correctly, and the
bytes you got really are U+0085, then the character is a line-breaking
character, not a graphical one. NEL is used as a line break (instead
of CR or LF) on big iron machines, so if the data you got are likely
to originate from an OS/390 system, your interpretation might be
correct.

If the browser displays it as an ellipsis, the browser has guessed
that it is windows-1252, which has the assignment

<U2026>     /x85         HORIZONTAL ELLIPSIS

Browsers always assume that bytes in the range 0x80..0x9f indicate a
Windows encoding, because the control characters in that range are
rarely used, and because Windows users don't know how to properly
declare character sets :-)

'\x85' is the horizontal ellipsis all of the windows encodings, so
that it is windows-1252 (and not, say, windows-1251) is a guess,
again.

If you wanted to use the unicodedata module to find a character by
guessing its name, I suggest you use an entirely different tool:
downloading the Unicode database, and searching for a characters with
grep(1) is much more appropriate.

As for a tutorial on the unicodedata module: people normally would not
use it directly, but through some higher-layer functions (e.g. the
.upper method of a unicode object). Knowledge of the Unicode database
is required to effectively use the module; people who do know how this
database is structured should have no problems using the module. See

http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html
http://www.unicode.org/ucd/

Regards,
Martin