python3 raw strings and \u escapes

Thu May 31 01:43:10 EDT 2012

On 30 mai, 08:52, "ru... at yahoo.com" <ru... at yahoo.com> wrote:
> In python2, "\u" escapes are processed in raw unicode
> strings.  That is, ur'\u3000' is a string of length 1
> consisting of the IDEOGRAPHIC SPACE unicode character.
>
> In python3, "\u" escapes are not processed in raw strings.
> r'\u3000' is a string of length 6 consisting of a backslash,
> 'u', '3' and three '0' characters.
>
> This breaks a lot of my code because in python 2
>       re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
> but in python 3 (the result of running 2to3),
>       re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
>
> I can remove the "r" prefix from the regex string but then
> if I have other regex backslash symbols in it, I have to
> double all the other backslashes -- the very thing that
> the r-prefix was invented to avoid.
>
> Or I can leave the "r" prefix and replace something like
> r'[ \u3000]' with r'[ 　]'.  But that is confusing because
> one can't distinguish between the space character and
> the ideographic space character.  It also a problem if a
> reader of the code doesn't have a font that can display
> the character.
>
> Was there a reason for dropping the lexical processing of
> \u escapes in strings in python3 (other than to add another
> annoyance in a long list of python3 annoyances?)
>
> And is there no choice for me but to choose between the two
> poor choices I mention above to deal with this problem?

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash)
"characters" (in fact unicode code points) are just (normal)
"characters". The backslash, used as an escaping command,
keeps its function.

Note the absence of r'...'

>>> s = 'a\u3000é\u3000€'
>>> s.split('\u3000')
['a', 'é', '€']
>>>
>>> import re
>>> re.split('\u3000', s)
['a', 'é', '€']

>>> s = 'a\\b\\c'
>>> print(s)
a\b\c
>>> s.split('\\')
['a', 'b', 'c']
>>> re.split('\\\\', s)
['a', 'b', 'c']

>>> hex(ord('\\'))
'0x5c'
>>> re.split('\u005c\u005c', s)
['a', 'b', 'c']

jmf