Encoding and norwegian (non ASCII) characters.
Peter Otten
__peter__ at web.de
Sat Oct 7 17:59:01 EDT 2006
joakim.hove at gmail.com wrote:
> Hello,
>
> I am having great problems writing norwegian characters æøå to file
> from a python application. My (simplified) scenario is as follows:
>
> 1. I have a web form where the user can enter his name.
>
> 2. I use the cgi module module to get to the input from the user:
> ....
> name = form["name"].value
>
> 3. The name is stored in a file
>
> fileH = open(namefile , "a")
> fileH.write("name:%s \n" % name)
> fileH.close()
>
> Now, this works very well indeed as long the users have 'ascii' names,
> however when someone enters a name with one of the norwegian characters
> æøå - it breaks at the write() statement.
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position
> ....
>
> Now - I understand that the ascii codec can't be used to decode the
> particular characters, however my attempts of specifying an alternative
> encoding have all failed.
>
> I have tried variants along the line:
>
> fileH = codecs.open(namefile , "a" , "latin-1") / fileH =
> open(namefile , "a")
> fileH.write(name) / fileH.write(name.encode("latin-1"))
>
> It seems *whatever* I do the Python interpreter fails to see my pledge
> for an alternative encoding, and fails with the dreaded
> UnicodeDecodeError.
>
> Any tips on this would be *highly* appreciated.
The approach with codecs.open() should succeed
>>> out = codecs.open("tmp.txt", "a", "latin1")
>>> out.write(u"æøå")
>>> out.write("abc")
>>> out.write("æøå")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/codecs.py", line 501, in write
return self.writer.write(data)
File "/usr/local/lib/python2.4/codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
provided that you write only unicode strings with characters in the range
unichr(0)...unichr(255) and normal strs in the range chr(0)...chr(127).
You have to decode non-ascii strs before feeding them to write() with the
appropriate encoding (that only you know)
>>> out.write(unicode("\xe6\xf8\xe5", "latin1"))
If there are unicode code points beyond unichr(255) you have to change the
encoding in codecs.open(), typically to UTF-8.
# raises UnicodeEncodeError
codecs.open("tmp.txt", "a", "latin1").write(u"\u1234")
# works
codecs.open("tmp.txt", "a", "utf8").write(u"\u1234")
Peter
More information about the Python-list
mailing list