schizophrenic view of what is white space

Thu Dec 4 15:30:29 EST 2008

Terry Reedy wrote:
> MRAB wrote:
>> Robin Becker wrote:
>>> Jean-Paul Calderone wrote:
>>> .........
>>>>
>>>> You have to give the re module an additional hint that you care about
>>>> unicode:
>>>>
>>>>  exarkun at charm:~$ python
>>>>  Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)  [GCC 4.2.3 
>>>> (Ubuntu 4.2.3-2ubuntu7)] on linux2
>>>>  Type "help", "copyright", "credits" or "license" for more information.
>>>>  >>> import re
>>>>  >>> print re.compile(r'\s').search(u'a\xa0b')
>>>>  None
>>>>  >>> print re.compile(r'\s', re.U).search(u'a\xa0b')
>>>>  <_sre.SRE_Match object at 0xb7dbb3a0>
>>>>  >>>
>>>>
>>>> Jean-Paul
>>> .......
>>>
>>> so the default behaviour differs for unicode and re working on 
>>> unicode. I suppose that won't be true in Python 3.
>>  >
>> I'm not sure why the Unicode flag is needed in the API. I reckon that 
>> it should just look at the text that the regular expression is being 
>> applied to: if it's Unicode then follow the Unicode rules, if not then 
>> don't.
> 
> I presume because \b is interpreted and replaced when the re is compiled 
> into internal state machine form.
> 
The regular expression is compiled to codes which are then interpreted. 
There are 2 versions of the matcher, one for bytestrings and another for 
Unicode. I don't think that having it agnostic is too difficult to achieve.

Interestingly, it treats every bytestring character as just a Unicode 
codepoint, so re.match(chr(0x80), unichr(0x80)) succeeds! I suppose it 
should complain if only one of the regex and the text is Unicode and the 
regex contains a literal or a literal character set (if the regex is, 
say, just \w then it doesn't matter).