Trouble saving unicode text to file

Thomas Bellman bellman at lysator.liu.se
Wed May 11 04:02:06 EDT 2005


=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:

>Thomas Bellman wrote:
>> Fixed-with characters *do* have advantages, even in the external
>> representation.  With fixed-with characters you don't have to
>> parse the entire file or stream in order to read the Nth character;
>> instead you can skip or seek to an octet position that can be
>> calculated directly from N.

> OTOH, encodings that are free of null bytes and ASCII compatible
> also have advantages.

Indeed, indeed.  But that's no reason to choose UTF-16 over UTF-32,
since you don't get those advantages then.

>> And not the least, UTF-32 is *beautiful* compared to UTF-16.

> But ugly compared to UTF-8. Not only does it have the null byte
> and the ASCII incompatibility problem, but it also has the
> endianness problem. So for exchanging Unicode between systems,
> I can see no reason to use anything but UTF-8 (unless, of course,
> one end, or the protocol, already dictates a different encoding).

UTF-8 beats UTF-32 in the practicality department, due to its
compatibility with legacy software, but in my opinion UTF-32 wins
over UTF-8 for shear beauty, even with the endianness problem.

I do wish they had standardized on one single endianness for UTF-32
(and UTF-16), instead of allowing both to exist.  In the mid 1990's
I had to work with files in the TIFF format, which allows both
endianesses.  The specification *requires* you to read both, but it
was a rare sight to find MS Windows software that didn't barf on
big endian TIFF files. :-(  Unix software tended to be better at
reading both endians, but generally wrote in the native format,
meaning big endian on Sun Sparc.  Luckily I could convert files
using tiffcp on our Unix machines, but it was irritating to have to
introduce that extra step.  I fully expect the same problem to
happen with UTF-16 and UTF-32 too.

Anyway, back to UTF, my complaint is that UTF-16 doesn't give you
the advantages of *either* UTF-8, nor UTF-32, so if you have the
choice, UTF-16 is always the worst alternative of those three.  I
see no reason to recommend UTF-16 at all.


-- 
Thomas Bellman,   Lysator Computer Club,   Linköping University,  Sweden
"God is real, but Jesus is an integer."      !  bellman @ lysator.liu.se
                                             !  Make Love -- Nicht Wahr!



More information about the Python-list mailing list