Trouble saving unicode text to file

Thomas Bellman bellman at lysator.liu.se
Tue May 10 20:08:55 EDT 2005


John Machin <sjmachin at lexicon.net> wrote:

> UTF-32 is yet another encoding.
[...]
> Once you have done codecs.open('inputfile', 'rb', 'utf_32') or
> receivedstring.decode('utf_32'), what do you care whether your
> *external representation* has fixed-width characters or not?

> Putting it another way, any advantage of fixed-width characters is to
> be found in *internal* storage, not *external* transmission or
> storage. 

> At the other end, if you don't have to squeeze your data through an
> 8-bit-wide non-binary channel, and you have no need for legibility to
> humans, then the remaining considerations are efficiency and (if you
> have no control over what's used at the other end) whether the
> necessary codec is widely implemented. 

So, are you saying that any encoding that handles all the needed
characters are equally good choices?  So why not choose UTF-7?
Or Punycode?

Should you never care what the black box you are using looks like
on the inside?  Hadn't it mattered if X.400 won over SMTP?  Both
protocols are somewhat capable of sending emails after all; X.400
is just a bit more complicated on the inside where normal users
don't see.


Fixed-with characters *do* have advantages, even in the external
representation.  With fixed-with characters you don't have to
parse the entire file or stream in order to read the Nth character;
instead you can skip or seek to an octet position that can be
calculated directly from N.

In-place editing of single characters in large files becomes more
efficient.

The codec for UTF-32 is extremely simple.  There are no illegal
sequences to care about, like there are in UTF-8 and UTF-16, just
illegal single 32-bit values (those that are larger than 0x10ffff).

And not the least, UTF-32 is *beautiful* compared to UTF-16.


-- 
Thomas Bellman,   Lysator Computer Club,   Linköping University,  Sweden
"Adde parvum parvo magnus acervus erit"       ! bellman @ lysator.liu.se
          (From The Mythical Man-Month)       ! Make Love -- Nicht Wahr!



More information about the Python-list mailing list