textwrap and combining diacritical marks

Berteun Damman berteun at NO_SPAMdds.nl
Thu Jun 28 14:52:45 EDT 2007


On Thu, 28 Jun 2007 09:19:20 +0000 (UTC), Berteun Damman
<berteun at NO_SPAMdds.nl> wrote:
> And that leasts to another question, does Python have a function akin to
> wcwidth() which gives the number of column positions a unicode character
> needs?

After playing around a bit with unicodedata.normalize, but seeing how
this fails when there is no precomposed form, I've decided to take
Marcus Kuhns implementation [1], and made a Python version [2].

This will try to guess the column width of a character. Non printable
characters will report a -1 width (this includes '\n' and '\t' for
example.), except for \0, which has width 0.  Composing characters will
report '0', normal latin characters 1  and full-width forms for example
'2'.

Of course, real output depends on the capabilities of the display
device. xterm is capable of handling combining characters, whereas OS
X's Terminal.app can not do it for Greek or Russian characters for
example.

All in all, I think it is a reasonable start. There is one issue though,
namely involving Plane 1 chars. On 64 bit systems, so it seems, these
are stored as one character, on 32 bit systems as a surrogate pair. I
don't know how this works exactly, but the code should basically ignore
Plane 1 characters on 32 bit systems (i.e. always report display width
'1' even though they're combining or full-width).

Berteun

[1] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
[2] http://berteun.nl/tmp/wcwidth.py



More information about the Python-list mailing list