XML and UnicodeError

Paul Boddie paul at boddie.org.uk
Tue Oct 5 04:13:27 EDT 2004


Pinke Panke <dev at null.oo> wrote in message news:<Xns9578AE59045A7devnulloo at 130.133.1.19>...
> Dear people,
> 
> I wrote a python script to create html files. The structure was stored in 
> a nested array. For easier maintaining at any time I desribe the 
> structure in XML, using the minidom parser and a small function to 
> convert the XML structure into the array structure. So far so good.

Note that any access to textual data in your DOM (XML) document will
yield Unicode values, not strings - this is relevant below.

> Then the mess started. The XML document is described as utf-8, stored as 
> utf-8. iso-8859-1 makes no difference in this case.

After you've parsed the XML document, none of the encodings are
relevant - until you serialise the document, everything should be
Unicode (although I'm sure I've seen some XML libraries use plain
strings to represent values which consist only of ASCII characters).

> When an character > 128, e.g. an umlaut, occurs my string raises errors. 
> An example:
> 
> headline = structure[0]
> pagetext = structure[1]
> foo = headline + "bar" + pagetext
> >>> UnicodeError

Are you not addings strings to Unicode values here? I can imagine that
at some point you've decided to change headline or pagetext to
something other than that extracted from the DOM document. However, if
you've used plain Python strings with non-ASCII characters, Python has
no way of knowing how to combine such strings with Unicode values,
since the encoding used in your strings is never made explicit.

> In my script there are many of such operations. The simple example is 
> solved easily with appending .encode('iso-8859-1') at the structure 
> statements. So far not so nice but ok. I hope there would be a simpler 
> solution.

The solution is to use Unicode throughout.

> But there are also string replacements via regexes. An example to make a 
> picture of it:
> pat = re.compile('<putithere>')
> foo = 'def'
> bar = 'abc<putithere>ghi'
> htmlcode = pat.sub(foo,bar)
> 
> Appending .encode(...) to foo and bar does not fix the UnicodeError.
> 
> Is there any solution, something I forgot or I could make better? Is 
> there any logic behind it? ;-)

Yes, but it's complicated, so my advice is to...

  1. Let minidom provide you with Unicode values.
  2. Convert any other text to Unicode as soon as possible.
  3. Manipulate only Unicode values - don't mix them up with
     plain strings.
  4. Serialise to your chosen encoding only when preparing
     output.

Paul



More information about the Python-list mailing list