python & xml question

Martin v. Loewis martin at v.loewis.de
Sat Aug 3 17:20:58 EDT 2002


"jano" <jnana4 at DELETEhotmailCAPS.com> writes:

> > Do you have a DOCTYPE declaration in the documented? That might be the
> > easiest approach: add a DOCTYPE that declares mdash; the parser should
> > then replace it automatically.
> 
> Are you asking if there is an associated DTD?  There is, and it does declare
> the mdash entity and what it should be replaced with, like so:
> 
> <!ENTITY mdash "—">

I'm really asking whether this declaration is in the internal or in
the external DTD subset.

However, I'm also surprised that you declare mdash as &#151: This
character is a control character, END OF GUARDED AREA (EPA), and
I don't know why you would associate that with the name mdash...

That your operating system uses byte 151 to represent EM DASH in a
certain code page is irrelevant for XML, XML is based on Unicode, not
code page 1252.

>   File "quoteHandler.py", line 17, in characters
>     print characters
> UnicodeError: ASCII encoding error: ordinal not in range(128)
> 
> Is this saying that — is outside the UTF-8 range?

No. 8212 *is* the Unicode number for EM DASH. The error message just
means that you are trying to convert a Unicode string into ASCII (as a
side effect of the print statement), and that ASCII does not support
the EM DASH. Try

    print characters.encode("cp1252")

instead, if your terminal uses that character set.

> Ah, I'm using PyXML 0.6.5 under Cygwin, because I couldn't get the later
> versions to work under cygwin.  Could this be a source of my problems?

I'd say there are several problems at work. The traceback you report says

/usr/local/lib/python2.1/xml/sax/expatreader.py

so I would say that you are *not* using PyXML at all (first problem).

With that version, you will have problems to process entity references
in the SAX application, unless they are in the internal subset (second
problem).

You seem to have a misunderstanding of how character references work
in XML, and how they are (not) related to your operating system's
encoding (third problem).

HTH,
Martin




More information about the Python-list mailing list