[xml] convert funny chars to char entites?

Paul Boddie paul at boddie.net
Mon Sep 16 04:48:13 EDT 2002


":B nerdy" <thoa0025 at mail.usyd.edu.au> wrote in message news:<vNah9.10101$Ee4.23987 at news-server.bigpond.net.au>...
> ive got a string "Il Est Né" and i want to put it into a xml document
> 
> is there a quick and easy way to conver the funny characters to char
> entities so i can store them?
> or is there existing libraries?

Characters like "é" aren't generally converted to entities in XML,
unlike in various versions of HTML which had things like "é"
and which seemed overly ASCII-friendly. What I would recommend is that
you obtain a Unicode string from the string that you've been given, if
it isn't already Unicode (but I would guess by your enquiry that it
isn't), and then use the various XML APIs to write it into the
document.

Of course, if the string isn't Unicode, you have to know what encoding
was used in the creation of the string. For example:

  >>> s = "Il est né!"
  >>> s
  'Il est n\x82!'
  >>> u = unicode(s)

At this point, some people naively assume that a Unicode string will
be created, but since most character sets do not assign "é" to "\x82",
Python can't be sure that you want to create "é" in the resulting
Unicode string.

  Traceback (most recent call last):
    File "<stdin>", line 1, in ?
  UnicodeError: ASCII decoding error: ordinal not in range(128)
  >>> u = unicode(s, "iso-8859-1")

Here, we've supplied a character encoding. Now, Python knows what to
do with that "\x82" character.

  >>> print u
  Traceback (most recent call last):
    File "<stdin>", line 1, in ?
  UnicodeError: ASCII encoding error: ordinal not in range(128)
  >>> print u.encode("iso-8859-1")
  Il est né!

To print things out, you need to convert the Unicode string to the
appropriate native encoding.

Of course, there are various shortcuts that can be taken, especially
if you're not using the XML APIs, but are instead writing strings
directly to the XML file. In such cases, you can use an encoding
declaration at the start of the document...

  <?xml version="1.0" encoding="iso-8859-1"?>

...and then not bother converting the strings to Unicode, but an
awareness of Unicode is probably invaluable if you're going to be
doing a fair amount of XML work.

Paul



More information about the Python-list mailing list