Unrecognized escape sequences in string literals

Mon Aug 10 18:17:24 EDT 2009

From: Steven D'Aprano <ste... at REMOVE.THIS.cybersource.com.au> wrote:

> On Mon, 10 Aug 2009 00:32:30 -0700, Douglas Alan wrote:

> > In C++, if I know that the code I'm looking at compiles,
> > then I never need worry that I've misinterpreted what a
> > string literal means.

> If you don't know what your string literals are, you don't
> know what your program does. You can't expect the compiler
> to save you from semantic errors. Adding escape codes into
> the string literal doesn't change this basic truth.

I grow weary of these semantic debates. The bottom line is
that C++'s strategy here catches bugs early on that Python's
approach doesn't. It does so at no additional cost.

>From a purely practical point of view, why would any
language not want to adopt a zero-cost approach to catching
bugs, even if they are relatively rare, as early as
possible?

(Other than the reason that adopting it *now* is sadly too
late.)

Furthermore, Python's strategy here is SPECIFICALLY
DESIGNED, according to the reference manual to catch bugs.
I.e., from the original posting on this issue:

     Unlike Standard C, all unrecognized escape sequences
     are left in the string unchanged, i.e., the backslash
     is left in the string. (This behavior is useful when
     debugging: if an escape sequence is mistyped, the
     resulting output is more easily recognized as broken.)

If this "feature" is designed to catch bugs, why be
half-assed about it? Especially since there seems to be
little valid use case for allowing programmers to be lazy in
their typing here.

> The compiler can't save you from typing 1234 instead of
> 11234, or 31.45 instead of 3.145, or "My darling Ho"
> instead of "My darling Jo", so why do you expect it to
> save you from typing "abc\d" instead of "abc\\d"?

Because in the former cases it can't catch the the bug, and
in the latter case, it can.

> Perhaps it can catch *some* errors of that type, but only
> at the cost of extra effort required to defeat the
> compiler (forcing the programmer to type \\d to prevent
> the compiler complaining about \d). I don't think the
> benefit is worth the cost. You and your friend do. Who is
> to say you're right?

Well, Bjarne Stroustrup, for one.

All of these are value judgments, of course, but I truly
doubt that anyone would have been bothered if Python from
day one had behaved the way that C++ does. Additionally, I
expect that if Python had always behaved the way that C++
does, and then today someone came along and proposed the
behavior that Python currently implements, so that the
programmer could sometimes get away with typing a bit less,
such a person would be chided for not understanding the Zen
of Python.

> > You don't have to go running for the manual every time
> > you see code with backslashes, where the upshot might be
> > that the programmer was merely saving themselves some
> > typing.

> Why do you care if there are "funny characters"?

Because, of course, "funny characters" often have
interesting consequences when output. Furthermore, their
consequences aren't always immediately obvious from looking
at the source code, unless you are intimately familiar with
the function of the special characters in question.

For instance, sometimes in the wrong combination, they wedge
your xterm. Etc.

I'm surprised that this needs to be spelled out.

> In C++, if you see an escape you don't recognize, do you
> care?

Yes, of course I do. If I need to know what the program
does.

> Do you go running for the manual? If the answer is No,
> then why do it in Python?

The answer is that I do in both cases.

> No. \z *is* a legal escape sequence, it just happens to map to \z.

> If you stop thinking of \z as an illegal escape sequence
> that Python refuses to raise an error for, the problem
> goes away. It's a legal escape sequence that maps to
> backslash + z.

(1) I already used that argument on my friend, and he wasn't
buying it. (Personally, I find the argument technically
valid, but commonsensically invalid. It's a language-lawyer
kind of argument, rather than one that appeals to any notion
of real aesthetics.)

(2) That argument disagrees with the Python reference
manual, which explicitly states that "unrecognized escape
sequences are left in the string unchanged", and that the
purpose for doing so is because it "is useful when
debugging".

> > "\x" is not a legal escape sequence. Shouldn't it also
> > get left as "\\x"?
>
> No, because it actually is an illegal escape sequence.

What makes it "illegal". As far as I can tell, it's just
another "unrecognized escape sequence". JavaScript treats it
that way. Are you going to be the one to tell all the
JavaScript programmers that their language can't tell a
legal escape sequence from an illegal one?

> > Well, I think he's more annoyed that if Python is going
> > to be so helpful as to put in the missing "\" for you in
> > "foo\zbar", then it should put in the missing "\" for
> > you in "\". He considers this to be an inconsistency.
>
> (1) There is no missing \ in "foo\zbar".
>
> (2) The problem with "\" isn't a missing backslash, but a
> missing end- quote.

Says who? All of this really depends on your point of
view. The whole morass goes away completely if one adopts
C++'s approach here.

> Python isn't DWIMing here. The rules are simple and straightforward,
> there's no mind-reading or guessing required.

It may not be a complex form of DWIMing, but it's still
DWIMing a bit. Python is figuring that if I typed "\z", then
either I must have really meant to type "\\z", or that I
want to see the backslash when I'm debugging because I made
a mistake, or that I'm just too lazy to type "\\z".

> Is it "a form of DWIMing" to consider 1.234e1 and 12.34
> synonymous?

That's a very different issue, as (1) there are very
significant use cases for both kinds of numerical
representations, and (2) there's often only one obvious way
way that the number should be entered, depending on the
coding situation.

> What about 86 and 0x44? Is that DWIMing?

See previous comment.

> I'm sure both you and your friend are excellent
> programmers, but you're tossing around DWIM as a
> meaningless term of opprobrium without any apparent
> understand of what DWIM actually is.

I don't know if my friend even knows the term DWIM, other
than me paraphrasing him, but I certainly understand all
about the term. It comes from InterLisp. When DWIM was
enabled, your program would run until it hit an error, and
for certain kinds of errors, it would wait a few seconds for
the user to notice the error message, and if the user didn't
tell the program to stop, it would try to figure out what
the user most likely meant, and then continue running using
the computer-generated "fix".

I.e., more or less like continuing on in the face of what
the Python Reference manual refers to as an "unrecognized
escape sequence".

|>ouglas