Unrecognized escape sequences in string literals

Mon Aug 10 05:40:24 EDT 2009

On Mon, 10 Aug 2009 00:32:30 -0700, Douglas Alan wrote:

> In C++, if I know that the code I'm looking at compiles, then I never
> need worry that I've misinterpreted what a string literal means.

If you don't know what your string literals are, you don't know what your 
program does. You can't expect the compiler to save you from semantic 
errors. Adding escape codes into the string literal doesn't change this 
basic truth.

Semantics matters, and unlike syntax, the compiler can't check it. 
There's a difference between a program that does the equivalent of:

    os.system("cp myfile myfile~")

and one which does this

    os.system("rm myfile myfile~")

The compiler can't save you from typing 1234 instead of 11234, or 31.45 
instead of 3.145, or "My darling Ho" instead of "My darling Jo", so why 
do you expect it to save you from typing "abc\d" instead of "abc\\d"?

Perhaps it can catch *some* errors of that type, but only at the cost of 
extra effort required to defeat the compiler (forcing the programmer to 
type \\d to prevent the compiler complaining about \d). I don't think the 
benefit is worth the cost. You and your friend do. Who is to say you're 
right?

> At
> least not if it doesn't have any escape characters in it that I'm not
> familiar with. But in Python, if I see, "\f\o\o\b\a\z", I'm not really
> sure what I'm seeing, as I surely don't have committed to memory some of
> the more obscure escape sequences. If I saw this in C++, and I knew that
> it was in code that compiled, then I'd at least know that there are some
> strange escape codes that I have to look up. 

And if you saw that in Python, you'd also know that there are some 
strange escape codes that you have to look up. Fortunately, in Python, 
that's really simple:

>>> "\f\o\o\b\a\z"
'\x0c\\o\\o\x08\x07\\z'

Immediately you can see that the \o and \z sequences resolve to 
themselves, and the \f \b and \a don't.

> Unlike with Python, it
> would never be the case in C++ code that the programmer who wrote the
> code was just too lazy to type in "\\f\\o\\o\\b\\a\\z" instead.

But if you see "abc\n", you can't be sure whether the lazy programmer 
intended "abc"+newline, or "abc"+backslash+"n". Either way, the compiler 
won't complain.

>> You just have to memorize it. If you don't know what a backslash escape
>> is going to do, why would you use it?
> 
> (1) You're looking at code that someone else wrote, or (2) you forget to
> type "\\" instead of "\" in your code (or get lazy sometimes), as that
> is okay most of the time, and you inadvertently get a subtle bug.

The same error can occur in C++, if you intend \\n but type \n by 
mistake. Or vice versa. The compiler won't save you from that.

>> This is especially important when reading (as opposed to writing) code.
>> You read somebody else's code, and see "foo\xbar\n". Let's say you know
>> it compiles without warning. Big deal -- you don't know what the escape
>> codes do unless you've memorized them. What does \n resolve to? chr(13)
>> or chr(97) or chr(0)? Who knows?
> 
> It *is* a big deal. Or at least a non-trivial deal. It means that you
> can tell just by looking at the code that there are funny characters in
> the string, and not just a backslashes. 

I'm not entirely sure why you think that's a big deal. Strictly speaking, 
there are no "funny characters", not even \0, in Python. They're all just 
characters. Perhaps the closest is newline (which is pretty obvious).

> You don't have to go running for
> the manual every time you see code with backslashes, where the upshot
> might be that the programmer was merely saving themselves some typing.

Why do you care if there are "funny characters"?

In C++, if you see an escape you don't recognize, do you care? Do you go 
running for the manual? If the answer is No, then why do it in Python?

And if the answer is Yes, then how is Python worse than C++?

[...]
> Also, it seems that Python is being inconsistent here. Python knows that
> the string "\x" doesn't contain a full escape sequence, so why doesn't
> it
> treat the string "\x" the same way that it treats the string "\z"?
[...]
> I.e., "\z" is not a legal escape sequence, so it gets left as "\\z".

No. \z *is* a legal escape sequence, it just happens to map to \z.

If you stop thinking of \z as an illegal escape sequence that Python 
refuses to raise an error for, the problem goes away. It's a legal escape 
sequence that maps to backslash + z.

> "\x" is not a legal escape sequence. Shouldn't it also get left as
> "\\x"?

No, because it actually is an illegal escape sequence.

>> > He's particularly annoyed too, that if he types "foo\xbar" at the
>> > REPL, it echoes back as "foo\\xbar". He finds that to be some sort of
>> > annoying DWIM feature, and if Python is going to have DWIM features,
>> > then it should, for example, figure out what he means by "\" and not
>> > bother him with a syntax error in that case.
>>
>> Now your friend is confused. This is a good thing. Any backslash you
>> see in Python's default string output is *always* an escape:
> 
> Well, I think he's more annoyed that if Python is going to be so helpful
> as to put in the missing "\" for you in "foo\zbar", then it should put
> in the missing "\" for you in "\". He considers this to be an
> inconsistency.

(1) There is no missing \ in "foo\zbar".

(2) The problem with "\" isn't a missing backslash, but a missing end-
quote.

> Me, I'd never, ever, EVER want a language to special-case something at
> the end of a string, but I can see that from his new-to-Python
> perspective, Python seems to be DWIMing in one place and not the other,
> and he thinks that it should either do no DWIMing at all, or
> consistently DWIM. To not be consistent in this regard is "inelegant",
> says he.

Python isn't DWIMing here. The rules are simple and straightforward, 
there's no mind-reading or guessing required. There is no heuristic 
trying to predict what the user intends. It's a simple rule:

When parsing a string literal (apart from raw strings), if you see a 
backslash, then grab the next token (usually a single character, but for 
\x and \0 it could be multiple characters). If there is a mapping 
available for that token, insert that in the string being built, and if 
not, insert the backslash and the token.

(As I said earlier, this may not be precisely how it is implemented, but 
functionally, it is what Python does.)

> And I can see his point that allowing "foo\zbar" and "foo\\zbar" to be
> synonymous is a form of DWIMing.

Is it "a form of DWIMing" to consider 1.234e1 and 12.34  synonymous?

What about 86 and 0x44? Is that DWIMing?

I'm sure both you and your friend are excellent programmers, but you're 
tossing around DWIM as a meaningless term of opprobrium without any 
apparent understand of what DWIM actually is.

-- 
Steven