XML and UnicodeError

Pinke Panke dev at null.oo
Mon Oct 4 11:08:20 EDT 2004


Dear people,

I wrote a python script to create html files. The structure was stored in 
a nested array. For easier maintaining at any time I desribe the 
structure in XML, using the minidom parser and a small function to 
convert the XML structure into the array structure. So far so good. 

Then the mess started. The XML document is described as utf-8, stored as 
utf-8. iso-8859-1 makes no difference in this case.

When an character > 128, e.g. an umlaut, occurs my string raises errors. 
An example:

headline = structure[0]
pagetext = structure[1]
foo = headline + "bar" + pagetext
>>> UnicodeError

In my script there are many of such operations. The simple example is 
solved easily with appending .encode('iso-8859-1') at the structure 
statements. So far not so nice but ok. I hope there would be a simpler 
solution.

But there are also string replacements via regexes. An example to make a 
picture of it:
pat = re.compile('<putithere>')
foo = 'def'
bar = 'abc<putithere>ghi'
htmlcode = pat.sub(foo,bar)

Appending .encode(...) to foo and bar does not fix the UnicodeError.

Is there any solution, something I forgot or I could make better? Is 
there any logic behind it? ;-)

TIA.
Martin



More information about the Python-list mailing list