schizophrenic view of what is white space

Hrvoje Niksic hniksic at xemacs.org
Thu Dec 4 16:40:46 EST 2008


MRAB <google at mrabarnett.plus.com> writes:

> I'm not sure why the Unicode flag is needed in the API. I reckon
> that it should just look at the text that the regular expression is
> being applied to: if it's Unicode then follow the Unicode rules, if
> not then don't.

It might be that using Unicode tables for lookup of character classes
slows things down considerably because the tables are huge.  It is
useful to be able to treat Unicode strings the same way ASCII strings
are treated, but the question is what should be the default.

Whitespace is probably not controversial, but many parsers tend to
expect things like \d to match [0-9], not any Unicode character marked
as "digit".  For example, I'm not sure if this behavior would be a
good default:

>>> re.match(r'\d', u'\u0660', re.UNICODE)
<_sre.SRE_Match object at 0xb7da0250>

What digit is \u0660, out of 0-9?  Hard to say.  If re.UNICODE were
the default for Unicode strings, code that expected \d to yield an
actual digit would have a problem on their hands -- especially so in
Python 3 where that would apply to *all* strings.



More information about the Python-list mailing list