Problem with -3 switch

Mon Jan 12 15:39:38 EST 2009

On Jan 12, 5:26 am, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 12, 7:29 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
>
>
>
> > On Jan 12, 12:32 am, John Machin <sjmac... at lexicon.net> wrote:
>
> > > On Jan 12, 12:23 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
>
> > > > On Jan 9, 6:11 pm, John Machin <sjmac... at lexicon.net> wrote:
>
> > > > > On Jan 10, 6:58 am, Carl Banks <pavlovevide... at gmail.com> wrote:
> > > > > > I expect that it'd be a PITA in some cases to use the transitional
> > > > > > dialect (like getting all your Us in place), but that doesn't mean the
> > > > > > language is crippled.
>
> > > > > What is this "transitional dialect"? What does "getting all your Us in
> > > > > place" mean?
>
> > > > Transitional dialect is the subset of Python 2.6 that can be
> > > > translated to Python3 with 2to3 tool.
>
> > > I'd never seen it called "transitional dialect" before.
>
> > I had hoped the context would make it clear what I was talking about.
>
> In vain.

You were one who was mistaken about what Steve and Cliff were talking
about, chief.  Maybe if you'd have paid better attention you would
have gotten it?

> > > >  Getting all your Us in place
> > > > refers to prepending a u to strings to make them unicode objects,
> > > > which is something 2to3 users are highly advised to do to keep hassles
> > > > to a minimum.  (Getting Bs in place would be a good idea too.)
>
> > > Ummm ... I'm not understanding something. 2to3 changes u"foo" to
> > > "foo", doesn't it? What's the point of going through the code and
> > > changing all non-binary "foo" to u"foo" only so that 2to3 can rip the
> > > u off again?
>
> > It does a bit more than that.
>
> Like what?

Never mind; I was confusing it with a different tool.  (Someone had a
source code processing tool that replaced strings with their reprs a
while back.)  My bad.

> > > What hassles? Who's doing the highly-advising where and
> > > with what supporting argument?
>
> > You add the u so the the constant will be the same data type in 2.6 as
> > it becomes in 3.0 after applying 2to3.  str and unicode objects aren't
> > always with smooth with each other, and you have a much better chance
> > of getting the same behavior in 2.6 and 3.0 if you use an actual
> > unicode string in both.
>
> (1) Why specifically 2.6? Do you mean 2.X, or is this related to the
> "port to 2.6 first" theory?

It's not a theory.  2to3 was designed to translate a subset of 2.6
code to 3.0.  It's not designed to translate arbitrary 2.6 code, nor
any 2.5 or lower code.  It might work well enough from 2.5, but it
wasn't designed for it.

> (2) We do assume we are starting off with working 2.X code, don't we?
> If we change "foo" to u"foo" and get a different answer from the 2.X
> code, is that still "working"?

Of course it's not "working" in 2.6, and that's the point: you want it
to work in 2.6 with Unicode strings because it has to run in 3.0 with
Unicode strings.

> > A example of this, though not with string constants,
>
> And therefore irrelevant.

Well, it wasn't from my viewpoint, which was "make sure you are using
only unicode and bytes objects", never str objects.  But if you want
to talk about string constants specifically, ok.

> I would like to hear from someone who has actually started with
> working 2.x code and changed all their text-like "foo" to
> u"foo" [except maybe unlikely suspects like open()'s mode arg]:
> * how many places where the 2.x code broke and so did the 3.x code
> [i.e. the problem would have been detected without prepending u]

I think you're missing the point.  This isn't merely about detecting
errors; it's about making the code in 2.6 behave as similarly to 3.0
as possible, and that includes internal behavior.  When you have mixed
str and unicode objects, 2.6 has to do a lot of encoding and decoding
under the covers; in 3.0 that won't be happening.  That increases the
risk of divergent behavior, and is something you want to avoid.

If you think your test suite is invincible and can catch every
possible edge case where some encoding or decoding mishap occurs, be
my guest and don't do it.

Also, I'm not sure why you think it's preferrable to run tests on 3.0
and have to go back to the 2.6 codebase, run 2to3 again, apply the
patch again, and retest, to fix it.  I don't know, maybe it makes
sense for people with a timemachine.py module, but I don't think it'll
make sense for most people.

> * how many places where the 2.x code broke but the 3.x code didn't
> [i.e. prepending u did find the problem]

If you think this was the main benefit of doing that you are REALLY
missing the point.  The point isn't to find problems in 2.6, it's to
modify 2.6 to behave as similarly to 3.0 as possible.

> * whether they thought it was worth the effort
>
> In the meantime I would be interested to hear from anybody with a made-
> up example of code where the problem would be detected (sooner |
> better | only) by prepending u to text-like string constants.

Here's one for starters.  The mistake was using a multibyte character
in a str object in 2.6.  2to3 would have converted this to a script
that has different behavior.  If the u"" had been present on the
string it would have the same behavior in both 2.6 and 3.0.  (Well,
the repr is different but it's a repr of the same object in both.)

# coding: utf-8
print repr("abcd¥")

Out of curiosity, do

> > 2to3 can only do so
> > much; it can't always guess whether your string usage is supposed to
> > be character or binary.
>
> AFAICT it *always* guesses text rather than binary; do you have any
> examples where it guesses binary (rightly or wrongly)?

Again, not the point.  It's not whether 2to3 guesses correctly, but
whether the runtime does different things in the two versions.

Carl Banks