Python Unicode handling wins again -- mostly

Sat Nov 30 02:11:59 EST 2013

On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote:

> On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <roy at panix.com> wrote:
>> I was speaking specifically of "ligatures like fi" (or, if you prefer,
>> "ligatures like ό".  By which I mean those things printers invented
>> because some letter combinations look funny when typeset as two
>> distinct letters.
> 
> I think the encoding of your email is incorrect, because GREEK SMALL
> LETTER OMICRON WITH TONOS is not a ligature.

Roy's post, which is sent via Usenet not email, doesn't have an encoding 
set. Since he's sending from a Mac, his software may believe that the 
entire universe understands the Mac Roman encoding, which makes a certain 
amount of sense since if I recall correctly the fi and fl ligatures 
originally appeared in early Mac fonts. 

I'm going to give Roy the benefit of the doubt and assume he actually 
entered the fi ligature at his end. If his software was using Mac Roman, 
it would insert a single byte DE into the message:

py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'

But that's not what his post includes. The message actually includes two 
bytes CF8C, in other words:

'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'

Since nearly all of his post is in single bytes, it's some variable-width 
encoding, but not UTF-8.

With no encoding set, our newsreader software starts off assuming that 
the post uses UTF-8 ('cos that's the only sensible default), and those 
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.

I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when 
the tools he uses are apparently so broken. But it isn't Unicode's fault, 
its the tools.

The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ﬄ LATIN SMALL LIGATURE FFL 
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the 
ASCII letters "ffl". That's astonishingly weird.

That is really a bizarre error. I suppose it is not entirely impossible 
that the software is actually being clever rather than dumb. Having 
correctly decoded the UTF-8 bytes, perhaps it realised that there was no 
glyph for the ligature, and rather than display a MISSING CHAR glyph 
(usually one of those empty boxes you sometimes see), it normalized it to 
ASCII. But if it's that clever, why the hell doesn't it set an encoding 
line in posts?????

-- 
Steven