Magic UTF-8/Windows-1252 encodings

Mon Aug 29 10:38:26 EDT 2016

Directing this to python-list because it's really not on the topic of
the idea being discussed.

On Mon, Aug 29, 2016, at 05:37, Chris Angelico wrote:
> Suppose I come to python-ideas and say "Hey, the MUD community would
> really benefit from a magic decoder that would use UTF-8 where
> possible, ISO-8859-1 as fall-back, and Windows-1252 for characters not
> in 8859-1". Apart from responding that 8859-1 is a complete subset of
> 1252,

ISO-8859-1, with a dash in between "ISO" and "8859" is not a complete
subset of 1252. In fact, ISO-8859-1-with-a-dash incorporates ISO 6429
for 0x80-0x9F, and thereby has no bytes that do not map to characters.
The magic encoding that people often ask for or use is to use UTF-8
first, Windows-1252 as a fallback, and ISO 6429 as the final fallback
(and may or may not involve a "side trip" through Windows-1252 for UTF-8
encodings purportedly of code points between U+0080 and U+009F).

Incidentally, many Windows encodings, including 1252, as they are
actually used do use ISO 6429 for bytes that do not map to characters,
even when best fit mappings are not accepted. It is unclear why they
published tables that define these bytes as undefined, which have been
picked up by independent implementations of these encodings such as the
ones in Python. The only reason I can think of is to reserve the ability
to add new mappings later, as they did for 0x80 to U+20AC.

> there's not really a lot that you could discuss about that
> proposal, unless I were to show you some of my code. I can tell you
> about the number of MUDs that I play, the number of MUD clients that
> I've written, and some stats from my MUD server, and say "The MUD
> community needs this support", but it's of little value compared to
> actual code.
> 
> (For the record, a two-step decode of "UTF-8, fall back on 1252" is
> exactly what I do... in half a dozen lines of code. So this does NOT
> need to be implemented.)

And what level is the fallback done at? Per line? Per character? Per
read result? Does encountering an invalid-for-UTF-8 byte put it
permanently in Windows-1252 mode? Does it "retroactively" affect earlier
bytes? Can it be used as a stream encoding, or does it require you to
use bytes-based I/O and a separate .decode step?

I assume a MUD server isn't blocking on each client socket waiting for a
newline character, so how does such a decoding step mesh with whatever
such a server does to handle I/O asynchronously? Are there any
frameworks that you could be using that you can't if it's not an
encoding?

What happens if it's being used as an incremental decoder, encounters a
valid UTF-8 lead byte on a buffer boundary, and then must "reject" (i.e.
decode as the fallback encoding) it afterwards because an invalid trail
byte follows it in the next buffer? What happens if a buffer consists
only of a valid partial UTF-8 character?

I can probably implement the fallback as an error handler in half a
dozen lines, but it's not obvious and I suspect it's not what a lot of
people do. It would probably take a bit more than half a dozen lines to
implement it as an encoding.