[Python-ideas] Make non-meaningful backslashes illegal in string literals

Fri Aug 7 09:15:34 CEST 2015

On Fri, Aug 7, 2015 at 3:12 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, Aug 06, 2015 at 12:26:14PM -0400, random832 at fastmail.us wrote:
>> On Wed, Aug 5, 2015, at 14:56, Eric V. Smith wrote:
>> > Because strings containing \{ are currently valid
>>
>> Which raises the question of why.
>
> Because \C is currently valid, for all values of C. The idea is that if
> you typo an escape, say \d for \f, you get an obvious backslash in your
> string which is easy to spot.
>
> Personally, I think that's a mistake. It leads to errors like this:
>
> filename = 'C:\some\path\something.txt'
>
> silently doing the wrong thing. If we're going to change the way escapes
> work, it's time to deprecate the misfeature that \C is a literal
> backslash followed by C. Outside of raw strings, a backslash should
> *only* be allowed in an escape sequence.

I agree; plus, it means there's yet another thing for people to
complain about when they switch to Unicode strings:

path = "c:\users", "C:\Users" # OK on Py2
path = u"c:\users", u"C:\Users" # Fails

Or equivalently, moving to Py3 and having those strings quietly become
Unicode strings, and now having meaning on the \U and \u escapes.

That said, though: It's now too late to change Python 2, which means
that this is going to be yet another hurdle when people move
(potentially large) Windows codebases to Python 3. IMO it's a good
thing to trip people up immediately, rather than silently doing the
wrong thing - but it is going to be another thing that people moan
about when Python 3 starts complaining. First they have to add
parentheses to print, then it's all those pointless (in their eyes)
encode/decode calls, and now they have to go through and double all
their backslashes as well! But the alternative is that some future
version of Python adds a new escape code, and all their code starts
silently doing weird stuff - or they change the path name and it goes
haywire (changing from "c:\users\demo" to "c:\users\all users" will be
a fun one to diagnose) - so IMO it's better to know about it early.

> If we're going to make major changes to the way escapes work, I'd rather
> add new escapes, not take them away:
>
>
> \e escape \x1B, as supported by gcc and clang;

Please, yes! Also supported by a number of other languages and
commands (Pike, GNU echo, and some others that I don't recall (but not
bind9, which has its own peculiarities)).

> the escaping rules from Haskell:
>
> http://book.realworldhaskell.org/read/characters-strings-and-escaping-rules.html
>
> \P platform-specific newline (e.g. \r\n on Windows, \n on POSIX)

Hmm. Not sure how useful this would be. Personally, I consider this to
be a platform-specific encoding, on par with expecting b"\xc2\xa1" to
display "¡", and as such, it should be kept to boundaries. Work with
"\n" internally, and have input routines convert to that, and output
routines optionally add "\r" before them all.

> \U+xxxx Unicode code point U+xxxx (with four to six hex digits)
>
> It's much nicer to be able to write Unicode code points that (apart from
> the backslash) look like the standard Unicode notation U+0000 to
> U+10FFFF, rather than needing to pad to a full eight digits as the
> \U00xxxxxx syntax requires.

The problem is the ambiguity. How do you specify that "\U+101010" be a
two-character string? "\U000101010" forces it by having exactly eight
digits, but as soon as you allow variable numbers of digits, you run
into problems. I suppose you could always pad to six for that:
"\U+0101010" could know that it doesn't need a seventh digit. (Though
what would ever happen if the Unicode consortium decides to drop
support for UTF-16 and push for a true 32-bit character set, I don't
know.) It is tempting, though - it both removes the need for two
pointless zeroes, and broadly unifies the syntax for Unicode escapes,
instead of having a massive boundary from "\u1234" to "\U00012345".

ChrisA