Encoding and norwegian (non ASCII) characters.

Sat Oct 7 17:59:01 EDT 2006

joakim.hove at gmail.com wrote:

> Hello,
> 
> I am having great problems writing norwegian characters æøå to file
> from a python application. My (simplified) scenario is as follows:
> 
> 1. I have a web form where the user can enter his name.
> 
> 2. I use the cgi module module to get to the input from the user:
>     ....
>     name = form["name"].value
> 
> 3. The name is stored in a file
> 
>     fileH = open(namefile , "a")
>     fileH.write("name:%s \n" % name)
>     fileH.close()
> 
> Now, this works very well indeed as long the users have 'ascii' names,
> however when someone enters a name with one of the norwegian characters
> æøå - it breaks at the write() statement.
> 
>    UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position
> ....
> 
> Now - I understand that the ascii codec can't be used to decode the
> particular characters, however my attempts of specifying an alternative
> encoding have all failed.
> 
> I have tried variants along the line:
> 
>    fileH = codecs.open(namefile , "a" , "latin-1") / fileH =
> open(namefile , "a")
>    fileH.write(name)   /    fileH.write(name.encode("latin-1"))
> 
> It seems *whatever* I do the Python interpreter fails to see my pledge
> for an alternative encoding, and fails with the dreaded
> UnicodeDecodeError.
> 
> Any tips on this would be *highly* appreciated.

The approach with codecs.open() should succeed

>>> out = codecs.open("tmp.txt", "a", "latin1")
>>> out.write(u"æøå")
>>> out.write("abc")
>>> out.write("æøå")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.4/codecs.py", line 501, in write
    return self.writer.write(data)
  File "/usr/local/lib/python2.4/codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

provided that you write only unicode strings with characters in the range
unichr(0)...unichr(255) and normal strs in the range chr(0)...chr(127).

You have to decode non-ascii strs before feeding them to write() with the
appropriate encoding (that only you know)

>>> out.write(unicode("\xe6\xf8\xe5", "latin1"))

If there are unicode code points beyond unichr(255) you have to change the
encoding in codecs.open(), typically to UTF-8.

# raises UnicodeEncodeError
codecs.open("tmp.txt", "a", "latin1").write(u"\u1234") 

# works
codecs.open("tmp.txt", "a", "utf8").write(u"\u1234")

Peter