textwrap and combining diacritical marks

Berteun Damman berteun at NO_SPAMdds.nl
Thu Jun 28 05:19:20 EDT 2007


Hello,

When using the textwrap module, the wrap will always use len() to
determine the length of the string being wrapped. This might be a
sensible thing to do in many circumstances, but I think there are
circumstances where this does not lead to the desired result.

I assume many applications of this module are found in applications
where text is formatted to be presented to a user, e.g. a console
application. The number of characters in the string, as determined by
len() might not be the number of columns occupied. Some of the
characters might be combining diacritical marks, which go on top of the
previous character, i.e. the string de'ge'ne're' (where the ' indicate
combing accute accents) will only display with a width of 8 characters.

The string might also include some characters that'll switch the console
to bold or underline mode, which have zero display width. If this
happens a lot, the resuling text might seem very badly formatted because
of all these zerowidth character-strings.

It is of course impossible to handle all these scenario's in which some
characters might influence the width of the displayed string, but
wouldn't it be convenient to have a 'chunk_width' method or something
which can be overridden in a derived class, so that a user might give a
custom implementation? The default of this chunk_width might just be
'len()'.

And that leasts to another question, does Python have a function akin to
wcwidth() which gives the number of column positions a unicode character
needs?

Berteun



More information about the Python-list mailing list