WTF? Printing unicode strings

Thu May 18 19:48:21 EDT 2006

Ron Garret wrote:
>
> But what about this:
>
> >>> f2=open('foo','w')
> >>> f2.write(u'\xFF')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in
> position 0: ordinal not in range(128)
> >>>
>
> That should have nothing to do with my terminal, right?

Correct. But first try to answer this: given that you want to write the
Unicode character value 255 to a file, how is that character to be
represented in the file?

For example, one might think that one could just get a byte whose value
is 255 and write that to a file, but what happens if one chooses a
Unicode character whose value is greater than 255? One could use two
bytes or three bytes or as many as one needs, but what if the lowest 8
bits of that value are all set? How would one know, if one reads a file
back and gets a byte whose value is 255 whether it represents a
character all by itself or is part of another character's
representation? It gets complicated!

The solution is that you choose an encoding which allows you to store
the characters in the file, thus answering indirectly the question
above: encodings determine how the characters are represented in the
file and allow you to read the file and get back the characters you put
into it. One of the most common encodings suitable for the storage of
Unicode character values is UTF-8, which has been designed with the
above complications in mind, but as long as you remember to choose an
encoding, you don't have to think about it: Python takes care of the
difficult stuff on your behalf. In the above code you haven't made that
choice.

So, to answer the above question, you can either...

  * Use the encode method on Unicode objects to turn them into plain
    strings, then write them to a file - at that point, you are
    writing specific byte values.
  * Use the codecs.open function and other codecs module features to
    write Unicode objects directly to files and streams - here, the
    module's infrastructure deals with byte-level issues.
  * If you're using something like an XML library, you can often pass a
    normal file or stream object to some function or method whilst
    stating the output encoding.

There is no universally correct answer to which encoding should be used
when writing Unicode character values to files, contrary to some
beliefs and opinions which, for example, lead to people pretending that
everything is in UTF-8 in order to appease legacy applications with the
minimum of tweaks necessary to stop them from breaking completely.
Thus, Python doesn't make a decision for you here.

Paul