[Python-Dev] textwrap and unicode

M.-A. Lemburg mal@lemburg.com
Wed, 23 Oct 2002 10:12:48 +0200


Greg Ward wrote:
> On 22 October 2002, Martin v. Loewis said:
> 
> OK, then it's an implementation problem rather than a "you can't get
> there from here" problem.  Good.  The reason I need a list of
> "whitespace chars" is to convert all whitespace to spaces; I use
> string.maketrans() and s.translate() to do this efficiently:

Use the trick Fredrik posted: u' '.join(x.split()) (.split() defaults
to splitting on whitespace, Unicode whitespace if x is Unicode).

> Ahh, OK, I'm starting to see the problem: there's nothing wrong with the
> translate() method of strings or unicode strings, but string.maketrans()
> doesn't generate a mapping that u''.translate() likes.  Hmmmm.

Unicode uses a different API for this since it wouldn't make
sense to pass a sys.maxunicode character Unicode string to translate
just to map a few characters.

> The other bit of ASCII/English prejudice hardcoded into textwrap.py is
> this regex:
> 
>     sentence_end_re = re.compile(r'[%s]'              # lowercase letter
>                                  r'[\.\!\?]'          # sentence-ending punct.
>                                  r'[\"\']?'           # optional end-of-quote
>                                  % string.lowercase)
> 
> You may recall this from the kerfuffle over whether there should be two
> spaces after a sentence in fixed-width fonts.  The feature is there, and
> off by default, in TextWrapper.  I'm not so concerned about this -- I
> mean, this doesn't even work with German or French, never mind Hebrew or
> Chinese or Hindi.  Apart from the narrow definition of "lowercase
> letter", it has English punctuation conventions hardcoded into it.  But
> still, it seems *awfully* dumb in this day and age to hardcode
> string.lowercase into a regex that's meant to detect "lowercase
> letters".  But I couldn't find a better way to do it when I wrote this
> code last spring.  Is there one?

There are far too many lowercase characters in Unicode to make
this approach usable. It would be better if there were a
way to use Unicode character categories in the re sets. Since
that's not available, why not search for all potential sentence
ends and then try all of the using .islower() in a for-loop ?!

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/