[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sat Apr 25 00:05:04 CEST 2009

2009/4/24 Stephen J. Turnbull <stephen at xemacs.org>:
> Paul Moore writes:
>
>  > The pros for Martin's proposal are a uniform cross-platform interface,
>  > and a user-friendly API for the common case.
>
> A more accurate phrasing would be "... a user-friendly API for those
> who feel very lucky today."  Which is the common case, of course, but
> spins a little differently.

Sorry, but I think you're misrepresenting things. I'd have probably
let you off if you'd missed out the "very" - but I do think that it's
the common case. Consider:

- Windows systems where broken Unicode (lone surrogates or whatever)
isn't involved
- Unix systems where the user's stated filesystem encoding is correct

Can you honestly say that this isn't the vast majority of real-world
environments? (IIRC, you are based in Japan, so it may well be true
that the likelihood of problems is a lot higher where you are than
where I am - the UK - but I suspect that averaging out, things are
generally as above).

>  > [1] Actually, all the PEP says is "With this PEP, a uniform
>  > treatment of these data as characters becomes possible." An
>  > argument as to why this is a good thing would be a useful addition
>  > to the PEP. At the moment it's more or less treated as self-evident
>  > - which I agree with, but which clearly the Unix people here are
>  > not as certain of.
>
> Well, the problem is that both parts are false.

I can't work out which "parts" you are referring to here.

> If you didn't start
> with a valid string in a known encoding, you shouldn't treat it as
> characters because it's not.

Again, that's the purist argument. If you have a string (of bytes, I
guess) and a 99% certain guess as to the correct encoding, then I'd
argue that, as long as (a) it's not mission-critical (lives or backups
depend on it) and (b) you have a means of failing relatively
gracefully, you have every reason to make the assumption about
encoding.

After all, what's the alternative? Ultimately, you have a byte string
and no encoding. You make some assumption, or you can do hardly
anything. What use is "Processing file \x66\x6f\x6f" as a progress
indicator for a program that scans a directory? (That was "foo" for
people who can't read latin-1 written in hex :-))

> Hand it to a careful API, and you'll get
> an Exception raised in your face.  And that's precisely why it's not
> obviously a good thing.  Careful clients will have to treat it as
> "transcoded bytes", and so the people who develop those clients get no
> benefit.  OTOH, at least some of those who feel lucky and use it
> naively are going to turn out to be wrong.

But 99% of the time, "it" is a perfectly acceptable string.
(Percentage invented out of thin air, admitted :-)) Remember, only
when the system encounters an undecodable byte sequence, would a
technically invalid string be generated - and as far as I can tell,
the main case when that would happen is on Unix, if the user specifies
UTF-8 as the encoding, and the actual filesystem uses something else,
*and* there's a file with a name whose byte sequence is invalid UTF-8.
I'm *really* struggling to see that as a common scenario.

Admittedly, there are other, possibly more common, cases where the
string translation is valid, but semantically not what the user
expects - user says CP1251, but filesystem is CP850, say. As a UK
Windows user, I'm used to seeing CP850 vs CP1251 confusions like this
- "£" replaced with ú is the common case. It happens occasionally, and
occasionally causes code to behave unexpectedly. But it doesn't
reformat my hard drive and the alternative (having to be extra-careful
to tell every program precisely which encoding I'm using in every
situation) would make programs effectively unusable.

> That said, I'm +0 on the PEP as is.

So I'm largely preaching to the converted here. After all, lukewarm
acceptance from someone with experience of Asian encoding issues is
pretty much the equivalent of resounding support from someone who only
ever works in English! :-)

Paul.