schizophrenic view of what is white space
MRAB
google at mrabarnett.plus.com
Thu Dec 4 15:30:29 EST 2008
Terry Reedy wrote:
> MRAB wrote:
>> Robin Becker wrote:
>>> Jean-Paul Calderone wrote:
>>> .........
>>>>
>>>> You have to give the re module an additional hint that you care about
>>>> unicode:
>>>>
>>>> exarkun at charm:~$ python
>>>> Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3
>>>> (Ubuntu 4.2.3-2ubuntu7)] on linux2
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>> >>> import re
>>>> >>> print re.compile(r'\s').search(u'a\xa0b')
>>>> None
>>>> >>> print re.compile(r'\s', re.U).search(u'a\xa0b')
>>>> <_sre.SRE_Match object at 0xb7dbb3a0>
>>>> >>>
>>>>
>>>> Jean-Paul
>>> .......
>>>
>>> so the default behaviour differs for unicode and re working on
>>> unicode. I suppose that won't be true in Python 3.
>> >
>> I'm not sure why the Unicode flag is needed in the API. I reckon that
>> it should just look at the text that the regular expression is being
>> applied to: if it's Unicode then follow the Unicode rules, if not then
>> don't.
>
> I presume because \b is interpreted and replaced when the re is compiled
> into internal state machine form.
>
The regular expression is compiled to codes which are then interpreted.
There are 2 versions of the matcher, one for bytestrings and another for
Unicode. I don't think that having it agnostic is too difficult to achieve.
Interestingly, it treats every bytestring character as just a Unicode
codepoint, so re.match(chr(0x80), unichr(0x80)) succeeds! I suppose it
should complain if only one of the regex and the text is Unicode and the
regex contains a literal or a literal character set (if the regex is,
say, just \w then it doesn't matter).
More information about the Python-list
mailing list