python3 raw strings and \u escapes

Thu May 31 16:28:46 EDT 2012

On 05/30/2012 09:07 AM, rurpy at yahoo.com wrote:
> On 05/30/2012 05:54 AM, Thomas Rachel wrote:
>> Am 30.05.2012 08:52 schrieb rurpy at yahoo.com:
>>
>>> This breaks a lot of my code because in python 2
>>>        re.split (ur'[\u3000]', u'A\u3000A') ==>  [u'A', u'A']
>>> but in python 3 (the result of running 2to3),
>>>        re.split (r'[\u3000]', 'A\u3000A' ) ==>  ['A\u3000A']
>>>
>>> I can remove the "r" prefix from the regex string but then
>>> if I have other regex backslash symbols in it, I have to
>>> double all the other backslashes -- the very thing that
>>> the r-prefix was invented to avoid.
>>>
>>> Or I can leave the "r" prefix and replace something like
>>> r'[ \u3000]' with r'[ 　]'.  But that is confusing because
>>> one can't distinguish between the space character and
>>> the ideographic space character.  It also a problem if a
>>> reader of the code doesn't have a font that can display
>>> the character.
>>>
>>> Was there a reason for dropping the lexical processing of
>>> \u escapes in strings in python3 (other than to add another
>>> annoyance in a long list of python3 annoyances?)
>>
>> Probably it is more consequent. Alas, it makes the whole stuff
>> incompatible to Py2.
>>
>> But if you think about it: why allow for \u if \r, \n etc. are
>> disallowed as well?
>
> Maybe the blame is elsewhere then...  If the re module
> interprets (in a regex string) the 2-character string
> consisting of r'\' followed by 'n' as a single newline
> character, then why wasn't re changed for Python 3 to
> interpret the 6-character string, r'\u3000' as a single
> unicode character to correspond with Python's lexer no
> longer doing that (as it did in Python 2)?
>
>>> And is there no choice for me but to choose between the two
>>> poor choices I mention above to deal with this problem?
>>
>> There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
>> but should do the trick...
>
> I guess the "+"s could be left out allowing something
> like,
>
>   '[ \u3000]' r'\w+ \d{3}'
>
> but I'll have to try it a little; maybe just doubling
> backslashes won't be much worse.  I did that for years
> in Perl and lived through it.

Just for some closure, there are many places in my code
that I had/have to track down and change.  But the biggest
problem so far is a lexer module that is structured as many
dozens of little functions, each with a docstring that is
a regex string.

The only way I found change these and maintain sanity was
to go through them and remove the "r" prefix from any strings
that contain "\unnnn" literals, and then double any other
backslashes in the string.

Since these are docstrings, creating them with executable
code was awkward, and using adjacent string concatenation
led to a very confusing mix of string styles.  Strings that
used concatenation often had a single logical regex structure
(eg a character set "[...]") split between two strings.
The extra quote characters were as visually confusing as
doubled backslashes in many cases.

Strings with doubled backslashes, although harder to read
were, were much easier to edit reliably and in their way,
more regular.  It does make this module look very Perlish
though... :-)