Assignment Versus Equality

Steven D'Aprano steve at pearwood.info
Tue Jun 28 22:27:50 EDT 2016


On Wed, 29 Jun 2016 12:13 am, Random832 wrote:

> On Tue, Jun 28, 2016, at 00:31, Rustom Mody wrote:
>> GG downgrades posts containing unicode if it can, thereby increasing
>> reach to recipients with unicode-broken clients
> 
> That'd be entirely reasonable, except for the excessively broad
> application of "if it can".
> 
> Certainly it _can_ do it all the time. Just replace anything that
> doesn't fit with question marks or hex notation or \N{NAME} or some
> human readable pseudo-representation a la unidecode. It could have done
> any of those with the Hindi that you threw in to try to confound it, (or
> it could have chosen ISCII, which likewise lacks arrow characters, as
> the encoding to downgrade to).

Are you suggesting that email clients and newsreaders should silently mangle
the text of your message behind your back? Because that's what it sounds
like you're saying.

I understand technical limitations. If I'm using a client that can't cope
with anything but (say) ISCII or Latin-1, then I'm all out of luck if I
want to write an email containing Greek or Cyrillic. I get that.

But if the client allows me to type Greek or Cyrillic into the editor, and
then accepts that message for sending, and it mangles it into "question
marks or hex notation or \N{NAME}" (for example), that's a disgrace and
completely unacceptable.

Yes, software *is capable of doing so*, in the same way that software is
capable of deleting all the vowels from your post, or replacing the
word "medieval" with "medireview":

http://northernplanets.blogspot.com.au/2007/01/medireview.html

This is not a good idea.


> It should pick an encoding which it expects recipients to support and
> which contains *all* of the characters in the message, 

That would be UTF-8. That's a no-brainer. Why would you use any other
encoding?

If you use UTF-8, it just works. It supports the *entire* Unicode character
set, which is a superset of virtually all code pages and encodings you are
likely to encounter in practice. (No, your software probably isn't running
on a 1980s vintage Atari, and if you're in Japan using TRON you've got your
own software.) And your text widget or editor surely supports Unicode,
because if it didn't, the user couldn't type those Hindi or Greek letters.

So there's an obvious, sensible algorithm:

- take the user's Unicode text, and encode it to UTF-8

In pseudo-code:

    content = text.encode('utf-8')


And there's the actual algorithm used by mail clients and newsreaders:

- take the user's Unicode text, and try encoding it as a variety of
different encodings (US-ASCII, Latin-1, maybe a few others); if they fail,
then fall back to UTF-8

Or in pseudo-code:

    list_of_encodings = ['US-ASCII', 'Latin-1', ...]
    for encoding in list_of_encodings:
        try:
            content = text.encode(encoding)
            break
        except UnicodeEncodingError:
            pass
    else:
        content = text.encode('utf-8')


Why would you write the second instead of the first? It's just *dumb code*.
Maybe 20 year old applications could be excused for thinking that this
newfangled Unicode thing should be the last resort instead of the code page
system, but its 2016 now and code pages are just holding us back.




This is *especially* egregious since UTF-8 text containing only ASCII
characters is (by design) indistinguishable from US-ASCII, so even if there
is some application out there from 1980 that can only cope with ASCII, your
UTF-8 email will be perfectly readable to the degree that it only
uses "plain text".


> as proper 
> characters and not as pseudo-representations, and downgrade to that if
> and only if such an encoding can be found. For most messages, it can use
> US-ASCII. For most of the remainder it can use some ISO-8859 or
> Windows-125x encoding.

There's never any need to downgrade to a non-Unicode encoding, at least not
by default. Well, maybe in Asia, I don't know how well Asian software
supports Unicode.




-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list