Magic UTF-8/Windows-1252 encodings

Chris Angelico rosuav at gmail.com
Tue Aug 30 06:03:13 EDT 2016


On Tue, Aug 30, 2016 at 7:36 PM, Johannes Bauer <dfnsonfsduifb at gmx.de> wrote:
> On 29.08.2016 17:59, Chris Angelico wrote:
>
>> Fair enough. If this were something that a lot of programs wanted,
>> then yeah, there'd be good value in stdlibbing it. Character encodings
>> ARE hard to get right, and this kind of thing does warrant some help.
>> But I think it's best not done in core - at least, not until we see a
>> lot more people doing the same :)
>
> I hope this kind of botchery never makes it in the stdlib. It directly
> contradicts "In the face of ambiguity, refuse the temptation to guess."
>
> If you don't know what the charset is, don't guess. It'll introduce
> subtle ambiguities and ugly corner cases and will make the life for the
> rest of us -- who are trying to get their charsets straight and correct
> -- a living hell.
>
> Having such silly "magic" guessing stuff is actually detrimental to the
> whole concept of properly identifying and using character sets.
> Everything about the thought makes me shiver.

In the clinical purity of theoretical work, I absolutely agree with
you, and for that reason, this definitely doesn't belong in the
stdlib. But designers need to leave their wonderlands - the real world
is not so wonderful. (Nan Sharpe, to Alice Liddell.) If every program
in the world understood character encodings and correctly decoded
bytes using a known encoding and encoded text using the same encoding
(preferably UTF-8), then sure, it'd be easy. But when your program has
to cope with other people's bytes-that-ought-to-represent-text,
sometimes guessing IS better than choking. This example is a perfect
one; a naive byte-oriented server accepts ASCII-compatible text from a
variety of clients, and sends it out to all clients. (Since all the
parts that the server actually parses are ASCII, this works.) Very
commonly, naive Windows clients send text in the native encoding, eg
CP-1252, but smarter clients generally send UTF-8. I want my client to
interoperate perfectly with other UTF-8 clients, which is generally
easy (the only breakage is if the server attempts to letter-wrap a
massively long word, and ends up breaking a UTF-8 sequence across
lines), but I also want to have a decent fallback for the eight-bit
clients. Obviously I can't *know* the encoding used - if they were
smart enough to send encoding info, they'd most likely use UTF-8 - so
it's either guess, or choke on any non-ASCII bytes.

Another place where guessing is VERY useful is when I'm leafing
through 300 subtitles files for "Tangled" and want to know whether
they're accurate transcriptions or not. (Not hypothetical. Been doing
exactly that for a lot of this weekend. It seemed logical, since I've
done the same for "Frozen", and both movies are excellent.) All I have
is a file - a sequence of bytes. I know it's an ASCII-compatible
encoding because the numeric positioning info looks correct. If my
program "avoided the temptation to guess", I would have to manually
test a dozen encodings until one of them looked right to me, the
human; but instead, I use chardet plus some other heuristics, and
generally the program's right on either the first or second guess.
That means just two encodings for me to look at, often just one, and
only going to the full dozen or so if it gets it completely wrong.

The principle "refuse the temptation to guess" applies to core data
types and such (and not even universally there), but NOT to
applications, where you need domain knowledge to make that kind of
call.

ChrisA



More information about the Python-list mailing list