Magic UTF-8/Windows-1252 encodings

Chris Angelico rosuav at gmail.com
Mon Aug 29 11:14:00 EDT 2016


On Tue, Aug 30, 2016 at 12:38 AM, Random832 <random832 at fastmail.com> wrote:
> Directing this to python-list because it's really not on the topic of
> the idea being discussed.
>
> On Mon, Aug 29, 2016, at 05:37, Chris Angelico wrote:
>> Suppose I come to python-ideas and say "Hey, the MUD community would
>> really benefit from a magic decoder that would use UTF-8 where
>> possible, ISO-8859-1 as fall-back, and Windows-1252 for characters not
>> in 8859-1". Apart from responding that 8859-1 is a complete subset of
>> 1252,
>
> ISO-8859-1, with a dash in between "ISO" and "8859" is not a complete
> subset of 1252. In fact, ISO-8859-1-with-a-dash incorporates ISO 6429
> for 0x80-0x9F, and thereby has no bytes that do not map to characters.
> Incidentally, many Windows encodings, including 1252, as they are
> actually used do use ISO 6429 for bytes that do not map to characters,
> even when best fit mappings are not accepted. It is unclear why they
> published tables that define these bytes as undefined, which have been
> picked up by independent implementations of these encodings such as the
> ones in Python. The only reason I can think of is to reserve the ability
> to add new mappings later, as they did for 0x80 to U+20AC.

Huh, okay. Anyway, point is that it's a magical decoder that tries
UTF-8, and if that fails, uses an eight-bit encoding.

>> there's not really a lot that you could discuss about that
>> proposal, unless I were to show you some of my code. I can tell you
>> about the number of MUDs that I play, the number of MUD clients that
>> I've written, and some stats from my MUD server, and say "The MUD
>> community needs this support", but it's of little value compared to
>> actual code.
>>
>> (For the record, a two-step decode of "UTF-8, fall back on 1252" is
>> exactly what I do... in half a dozen lines of code. So this does NOT
>> need to be implemented.)
>
> And what level is the fallback done at? Per line? Per character? Per
> read result? Does encountering an invalid-for-UTF-8 byte put it
> permanently in Windows-1252 mode? Does it "retroactively" affect earlier
> bytes? Can it be used as a stream encoding, or does it require you to
> use bytes-based I/O and a separate .decode step?

Currently? UTF-8 is attempted on an entire read result, and if it
fails, the data is cracked into individual lines and retried, using
the fallback as per the above. So in effect, it's per line. I
basically assume that a naive byte-oriented server is usually going to
be spitting out data from one client at a time, and each client is
either emitting UTF-8 or its native encoding. (Since I have no way of
knowing what native encoding a given client was using, I just pick
Western Europe as the most likely codepage and run with it. The
algorithm would work just the same if I picked, say, Windows-1250 as
the eight-bit encoding.)

> I assume a MUD server isn't blocking on each client socket waiting for a
> newline character, so how does such a decoding step mesh with whatever
> such a server does to handle I/O asynchronously? Are there any
> frameworks that you could be using that you can't if it's not an
> encoding?

This magic started out in my MUD client, where it's connecting to a
naive server that echoes whatever it's given. The same logic is now in
my MUD server, too. It's pretty simple in both cases; the client is
built around asynchronous I/O, the server is threaded, but both of
them have a single point in the code where new bytes come in. There's
one function that converts bytes to text, and it operates on the above
algorithm.

> What happens if it's being used as an incremental decoder, encounters a
> valid UTF-8 lead byte on a buffer boundary, and then must "reject" (i.e.
> decode as the fallback encoding) it afterwards because an invalid trail
> byte follows it in the next buffer? What happens if a buffer consists
> only of a valid partial UTF-8 character?

Hmm, I don't remember if there's any actual handling of this. If
there's a problem, my solution is simple: split on 0x0A first, and
then decode, which means I'm decoding one line at a time. Both server
and client already are fundamentally line-based anyway, and depending
on byte value 0x0A always and only representing U+000A is valid in all
of the encodings that I'm willing to accept.

> I can probably implement the fallback as an error handler in half a
> dozen lines, but it's not obvious and I suspect it's not what a lot of
> people do. It would probably take a bit more than half a dozen lines to
> implement it as an encoding.

Please don't. :) This is something that belongs in the application;
it's somewhat hacky, and I don't see any benefit to it going into the
language. For one thing, I could well imagine making the fallback
encoding configurable (it isn't currently, but it could easily be),
and that doesn't really fit into the Python notion of error handler.
For another, this is a fairly rare concept - I don't see dozens of
programs out there using the exact same strange logic, and even if
there were, there'd be small differences (eg whether or not the
fallback is applied line-by-line). This was intended as an example of
something that does NOT belong in the core language, and while I
appreciate the offer of help, it's not something I'd support polluting
the language with :)

(Plus, my server's not written in Python. Nor is the client that this
started in, although I have considered writing a version of it in
Python, which would in theory benefit from this.)

ChrisA



More information about the Python-list mailing list