minidom and encoding problem

Fredrik Lundh fredrik at pythonware.com
Thu Jun 6 18:57:31 EDT 2002


Ehab Teima wrote:

> > This is a bug in your code. You must not insert (byte) string in a DOM
> > tree; always use Unicode objects.
>
> I do not have control over the sent text. The issue started when some
> bullets were copied from a word document and pasted into a file and
> the whole file was passed to my classes.

if you don't know what encoding the file is using, what
makes you think Python can figure it out?

> I tried to encode the string using different encodings but I could
> not.

the string is already encoded.  you need to *decode* it.

> Here is what I got when I tried .encode("UTF-8"):
> UnicodeError: ASCII decoding error: ordinal not in range(128)

this means that you have non-ASCII characters in an
ASCII string.  to convert this to a unicode string, use

    u = s.decode(encoding)

where "encoding" is the source encoding (if you haven't
the slightest idea, try "iso-8859-1")

also see:

    http://effbot.org/guides/unicode-objects.htm

</F>





More information about the Python-list mailing list