Trouble saving unicode text to file

John Machin sjmachin at lexicon.net
Sat May 7 20:02:05 EDT 2005


On 7 May 2005 14:22:56 -0700, "Svennglenn" <Danielnord15 at yahoo.se>
wrote:

>I'm working on a program that is supposed to save
>different information to text files.
>
>Because the program is in swedish i have to use
>unicode text for ÅÄÖ letters.

"program is in Swedish": to the extent that this means "names of
variables are in Swedish", this is quite irrelevant. The variable
names could be in some other language, like Slovak, Slovenian, Swahili
or Strine. Your problem(s) (PLURAL) arise from the fact that your text
data is in Swedish, the representation of which uses a few non-ASCII
characters. Problem 1 is the representation of Swedish in text
constants in your program; this is causing the exception you show
below but curiously didn't ask for help with.

>
>When I run the following testscript I get an error message.
>
># -*- coding: cp1252 -*-
>
>titel = "åäö"
>titel = unicode(titel)

You should use titel = u"åäö"
Works, and saves wear & tear on your typing fingers.

>
>print "Titel type", type(titel)
>
>fil = open("testfil.txt", "w")
>fil.write(titel)
>fil.close()
>
>
>Traceback (most recent call last):
>  File "D:\Documents and
>Settings\Daniel\Desktop\Programmering\aaotest\aaotest2\aaotest2.pyw",
>line 5, in ?
>    titel = unicode(titel)
>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
>ordinal not in range(128)
>
>
>I need to have the titel variable in unicode format because when I
>write
>åäö in a entry box in Tkinkter it makes the value to a unicode
>format
>automaticly.

The general rule in working with Unicode can be expressed something
like "work in Unicode all the time i.e. decode legacy text as early as
possible; encode into legacy text (if absolutely required) as late as
possible (corollary: if forced to communicate with another
Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
cp666)"

Applying this to Problem 1 is, as you've seen, trivial: To the extent
that you have text constants at all in your program, they should be in
Unicode.

Now after all that, Problem 2: how to save Unicode text to a file?

Which raises a question: who or what is going to read your file? If a
Unicode-aware application, and never a human, you might like to
consider encoding the text as utf-16. If Unicode-aware app plus
(occasional human developer or not CJK and you want to save space),
try utf-8. For general use on Windows boxes in the Latin1 subset of
the universe, you'll no doubt want to encode as cp1252. 

>
>Are there anyone who knows an easy way to save this unicode format text
>to a file?

Read the docs of the codecs module -- skipping over how to register
codecs, just concentrate on using them.

Try this:

# -*- coding: cp1252 -*-
import codecs
titel = u"åäö"
print "Titel type", type(titel)
f1 = codecs.open('titel.u16', 'wb', 'utf_16')
f2 = codecs.open('titel.u8', 'w', 'utf_8')
f3 = codecs.open('titel.txt', 'w', 'cp1252')
# much later, maybe in a different function
# maybe even in a different module
f1.write(titel)
f2.write(titel)
f3.write(titel)
# much later
f1.close()
f2.close()
f3.close()

Note: doing it this way follows the "encode as late as possible" rule
and documents the encoding for the whole file, in one place. Other
approaches which might use the .encode() method of Unicode strings and
then write the 8-bit-string results at different times and in
different functions/modules are somewhat less clean and more prone to
mistakes.

HTH,
John



More information about the Python-list mailing list