python & xml question

jano jnana4 at DELETEhotmailCAPS.com
Sat Aug 3 18:07:16 EDT 2002


"Martin v. Loewis" <martin at v.loewis.de> wrote in message
news:m3u1mbtz9h.fsf at mira.informatik.hu-berlin.de...
> "jano" <jnana4 at DELETEhotmailCAPS.com> writes:
>
> > > Do you have a DOCTYPE declaration in the documented? That might be the
> > > easiest approach: add a DOCTYPE that declares mdash; the parser should
> > > then replace it automatically.
> >
> > Are you asking if there is an associated DTD?  There is, and it does
declare
> > the mdash entity and what it should be replaced with, like so:
> >
> > <!ENTITY mdash "—">
>
> I'm really asking whether this declaration is in the internal or in
> the external DTD subset.

The declaration is in the external DTD subset.

> However, I'm also surprised that you declare mdash as &#151: This
> character is a control character, END OF GUARDED AREA (EPA), and
> I don't know why you would associate that with the name mdash...
>
> That your operating system uses byte 151 to represent EM DASH in a
> certain code page is irrelevant for XML, XML is based on Unicode, not
> code page 1252.

I used — because the XML is destined to be HTML, and #151, as far as i
know, is the only representation that works in all browsers. I see now
though that I should be using the Unicode representation and translating for
a browser at some later point, if necessary.

> >   File "quoteHandler.py", line 17, in characters
> >     print characters
> > UnicodeError: ASCII encoding error: ordinal not in range(128)
> >
> > Is this saying that — is outside the UTF-8 range?
>
> No. 8212 *is* the Unicode number for EM DASH. The error message just
> means that you are trying to convert a Unicode string into ASCII (as a
> side effect of the print statement), and that ASCII does not support
> the EM DASH. Try
>
>     print characters.encode("cp1252")
>
> instead, if your terminal uses that character set.

Great.  This works now.  I am just trying to parse an existing XML file that
I have and print it to the console, which I thought would be pretty simple,
but I should have done more preparatory work first.  Anyway, it works now.

> > Ah, I'm using PyXML 0.6.5 under Cygwin, because I couldn't get the later
> > versions to work under cygwin.  Could this be a source of my problems?
>
> I'd say there are several problems at work. The traceback you report says
>
> /usr/local/lib/python2.1/xml/sax/expatreader.py
>
> so I would say that you are *not* using PyXML at all (first problem).
>
> With that version, you will have problems to process entity references
> in the SAX application, unless they are in the internal subset (second
> problem).

I will move the DTD into the instance, for now. I thought that I was using
PyXML (i thought the expatreader was called from something in PyXML), but I
am pretty new to Python and XML and as you can see, i am quite confused.

> You seem to have a misunderstanding of how character references work
> in XML, and how they are (not) related to your operating system's
> encoding (third problem).
>
> HTH,
> Martin
>
The &#151 I was using not because of my operating system's encoding, but
because I was foolishly encoding special characters in the way that I
thought they would ultimately end up in a browser.

Anyway, thanks a million.  Your help has been invaluable.  What I have now
is working, and I see several areas that I need to do some research on (like
character encodings and unicode, etc.).

thanks again,

jano





More information about the Python-list mailing list