python3 raw strings and \u escapes

jmfauth wxjmfauth at gmail.com
Wed May 30 13:58:53 EDT 2012


On 30 mai, 13:54, Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-
a470-7603bd3aa... at spamschutz.glglgl.de> wrote:
> Am 30.05.2012 08:52 schrieb ru... at yahoo.com:
>
>
>
> > This breaks a lot of my code because in python 2
> >        re.split (ur'[\u3000]', u'A\u3000A') ==>  [u'A', u'A']
> > but in python 3 (the result of running 2to3),
> >        re.split (r'[\u3000]', 'A\u3000A' ) ==>  ['A\u3000A']
>
> > I can remove the "r" prefix from the regex string but then
> > if I have other regex backslash symbols in it, I have to
> > double all the other backslashes -- the very thing that
> > the r-prefix was invented to avoid.
>
> > Or I can leave the "r" prefix and replace something like
> > r'[ \u3000]' with r'[  ]'.  But that is confusing because
> > one can't distinguish between the space character and
> > the ideographic space character.  It also a problem if a
> > reader of the code doesn't have a font that can display
> > the character.
>
> > Was there a reason for dropping the lexical processing of
> > \u escapes in strings in python3 (other than to add another
> > annoyance in a long list of python3 annoyances?)
>
> Probably it is more consequent. Alas, it makes the whole stuff
> incompatible to Py2.
>
> But if you think about it: why allow for \u if \r, \n etc. are
> disallowed as well?
>
> > And is there no choice for me but to choose between the two
> > poor choices I mention above to deal with this problem?
>
> There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
> but should do the trick...
>
> Thomas

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

In your case, the

>>> import unicodedata as ud
>>> ud.name('\u3000')
'IDEOGRAPHIC SPACE'

"character" (in fact a unicode code point), is just
a "character" as a

>>> ud.name('a')
'LATIN SMALL LETTER A'

The code point / unicode logic, Python 3 proposes and follows,
becomes just straightforward.

>>> s = 'a\u3000é\u3000€'
>>> s.split('\u3000')
['a', 'é', '€']
>>>
>>> import re
>>> re.split('\u3000', s)
['a', 'é', '€']


The backslash, used as "real backslash", remains what it
really was in Python 2. Note, the absence of r'...' .

>>> s = 'a\\b\\c'
>>> print(s)
a\b\c
>>> s.split('\\')
['a', 'b', 'c']
>>> re.split('\\\\', s)
['a', 'b', 'c']

>>> hex(ord('\\'))
'0x5c'
>>> re.split('\u005c\u005c', s)
['a', 'b', 'c']

jmf




More information about the Python-list mailing list