codecs, Swedish characters, and XML...don't mix? (repost)

Michael Hammill mike at pdc.kth.se
Thu May 10 16:13:54 EDT 2001


Hi,  (this is a re-post of my previous message that exmh seemed to munch)

A Web searched shows that there used to be some problems with python, 
Swedish characters (two with umlauts ö, ä, one with a circle å --all are 
part of ISO8859-1), and XML.  I'm still having the problem with Python 
2.1.  Does anyone know what's wrong?

Brief description of the problem:
(1) read an XML file containing Swedish characters in using Python's LATIN1 
codec.
(2) write the same file out using Python's UTF-8 codec
(3) read file with xml.minidom, but get error when writing using dom.toxml():

"UnicodeError: ASCII encoding error: ordinal not in range(128)"

More detailed:
(1) Python version:
Python 2.1 (#1, Apr 17 2001, 20:20:54)
[GCC 2.96 20000731 (Red Hat Linux 7.0)] on linux2

(2) Code:
#!/usr/bin/python2.1
import codecs
import xml.dom.minidom
import string

try = 'doc.swedish.xml'
out = 'doc.utf8.xml'
out2= 'del.me'
outout = 'doc.utf8.after.xml'

def main():
     (LATIN1_encode, LATIN1_decode, LATIN1_streamreader, 
LATIN1_streamwriter)  = codecs.lookup('ISO8859-1')
     (UTF8_encode, UTF8_decode, UTF8_streamreader, UTF8_streamwriter) = 
codecs.lookup('UTF-8')

     input = LATIN1_streamreader(open(try, 'r'))
     s = input.read()
     input.close()

     output = UTF8_streamwriter( open(out, 'w') )
     output.write(s)
     output.close()

     f = open(out, 'r')
     g = open(out2, 'w')
     f_list = f.readlines()
     f.close()
     del f_list[0]
     f_list.insert(0,'<?xml version="1.0" encoding="UTF-8"?>')
     g.writelines(f_list)
     g.close()
     ff = open(out2, 'r')
     dom = xml.dom.minidom.parse(ff)
     gg = open(outout, 'w')
     gg.write(dom.toxml())

if __name__ == '__main__':
     main()

(3) File "doc.swedish.xml":
<?xml version="1.0" encoding="iso-8859-1"?>
<slideshow>
<title>Demo slideshöw
</title>
</slideshow>

(4) Traceback:
Traceback (most recent call last):
   File "./q_mini4.py", line 49, in ?
     main()
   File "./q_mini4.py", line 46, in main
     gg.write(dom.toxml())
UnicodeError: ASCII encoding error: ordinal not in range(128)
lxl01:/public/www/snac/Spring_2001/adm/crontab>

(5) Observations:
(a) The characters look fine in the input file.  A simple in and out 
decoding of them using LATIN1 codec for both in and out produces the same 
file as the original.
(b) The output UTF8 file does change the umlauted o to a different looking 
couple of characters
(c) If the Swedish character is replaced by a regular "o" in the input 
file, everything works fine.
(d) Yikes!  This shouldn't be that hard!






More information about the Python-list mailing list