Trouble saving unicode text to file

John Machin sjmachin at lexicon.net
Tue May 10 16:55:19 EDT 2005


On Tue, 10 May 2005 07:59:31 +0000 (UTC), Thomas Bellman
<bellman at lysator.liu.se> wrote:

>John Machin <sjmachin at lexicon.net> writes:
>
>> Which raises a question: who or what is going to read your file? If a
>> Unicode-aware application, and never a human, you might like to
>> consider encoding the text as utf-16.
>
>Why would one want to use an encoding that is neither semi-compatible
>with ASCII (the way UTF-8 is), nor uses fixed-with characters (like
>UTF-32 does)?

UTF-32 is yet another encoding. You still need to decode it into the
internal form supported by your processing software. With UTF-32xE,
you can only skip the decoding step when file's x == software's x and
your software uses 32 bits internally.

Python (2.4.1) doesn't have a utf_32 codec. Perhaps that's because
there isn't much call for it (yet). Let's pretend there is such a
codec in Python.

Once you have done codecs.open('inputfile', 'rb', 'utf_32') or
receivedstring.decode('utf_32'), what do you care whether your
*external representation* has fixed-width characters or not?

Putting it another way, any advantage of fixed-width characters is to
be found in *internal* storage, not *external* transmission or
storage. 

At the other end, if you don't have to squeeze your data through an
8-bit-wide non-binary channel, and you have no need for legibility to
humans, then the remaining considerations are efficiency and (if you
have no control over what's used at the other end) whether the
necessary codec is widely implemented. 

So rather than utf-16, perhaps I should have written something like:
"""
Consider utf-8 or utf-16. Consider following this by compression using
a widely-implemented protocol (gzip/zip/bzip2).
"""

Cheers,
John



More information about the Python-list mailing list