Processing text data with different encodings

Tue Jun 28 14:03:32 EDT 2016

On Wed, Jun 29, 2016 at 1:52 AM, Random832 <random832 at fastmail.com> wrote:
> On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote:
>> For the OP's situation, frankly, I doubt there'll be anything other
>> than UTF-8, Latin-1, and CP-1252. The chances that someone casually
>> mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the
>> simple decode of "UTF-8, or failing that, 1252" is probably going to
>> give correct results for most of the content. The trick is figuring
>> out a correct boundary for the check; line-by-line may be sufficient,
>> or it may not.
>
> For completeness, this can be done character-by-character (i.e. try to
> decode a UTF-8 character, if it fails decode the offending byte as 1252)
> with an error handler:
>
> import codecs
>
> def cp1252_errors(exception):
>     input, idx = exception.object, exception.start
>     byte = input[idx:idx+1]
>     try:
>         return byte.decode('windows-1252'), idx+1
>     except UnicodeDecodeError:
>         # python's cp1252 doesn't accept 0x81, etc
>         return byte.decode('latin1'), idx+1

Yeah, and the decision as to where that boundary should be placed is
thus completely up to the application. I don't know of any situation
where you'd need the byte-by-byte version, but it's there if you want
it.

The reason I chose line-by-line in my MUD client is because of the
nature of MUDs. The server I primarily use is a naive eight-bit one -
whatever bytes it gets, it retransmits. (All the commands that it
responds to are ASCII, so we can assume that all encodings are
ASCII-compatible or the user will have major difficulties.) Some
clients (including mine) send UTF-8. If I send the command "trivia
This is a piece of text\n", the text will be encoded UTF-8, the server
receives those bytes, and then transmit (to everyone who's tuned to
the [trivia] channel) this text: "MyName [trivia] This is a piece of
text\r\n". Simplistic clients (usually on Windows) will do the same
thing, only they'll use their default encoding - usually 1252 - rather
than UTF-8. But the entire command will be encoded the same way, which
means the server will send an entire line (or several lines, if it
wraps) in the same encoding. It's safe to assume that any given line
will be in one single encoding, but consecutive lines could be in
different encodings.

For emails, it might be possible to use a larger section, but
line-by-line would be safe there too, most likely.

ChrisA