how to write a unicode string to a file ?

Thu Oct 15 19:59:43 EDT 2009

On Thu, Oct 15, 2009 at 4:43 PM, Stef Mientki <stef.mientki at gmail.com>wrote:

> hello,
>
> By writing the following unicode string (I hope it can be send on this
> mailing list)
>
>   Bücken
>
> to a file
>
>    fh.write ( line )
>
> I get the following error:
>
>  UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in
> position 9: ordinal not in range(128)
>
> How should I write such a string to a file ?
>
>
First, you have to understand that a file never really contains unicode--
not in the way that it exists in memory / in python when you type line = u'
Bücken'. It contains a series of bytes that are an encoded form of that
abstract unicode data.

There's various encodings you can use-- UTF-8 and UTF-16 are in my
experience the most common. UTF-8 is an ASCII-superset, and its the one I
see most often.

So, you can do:

  import codecs
  f = codecs.open('filepath', 'w', 'utf-8')
  f.write(line)

To read such a file, you'd do codecs.open as well, just with a 'r' mode and
not a 'w' mode.

Now, that uses a file object created with the "codecs" module which operates
with theoretical unicode streams. It will automatically take any passed in
unicode strings, encode them in the specified encoding (utf8), and write the
resulting bytes out.

You can also do that manually with a regular file object, via:

  f.write(line.encode("utf8"))

If you are reading such a file later with a normal file object (e.g., not
one created with codecs.open), you would do:

  f = open('filepath', 'rb')
  byte_data = f.read()
  uni_data = byte_data.decode("utf8")

That will convert the byte-encoded data back to real unicode strings. Be
sure to do this even if it doesn't seem you need to if the file contains
encoded unicode data (a thing you can only know based on documentation of
whatever produced that file)... for example, a UTF8 encoded file might look
and work like a completely normal ASCII file, but if its really UTF8...
eventually your code will break that one time someone puts in a non-ascii
character. Since UTF8 is an ASCII superset, its indistinguishable from ASCII
until it contains a non-ASCII character.

HTH,

--S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20091015/1e331810/attachment-0001.html>