[XML-SIG] Error 404 and xml.dom.ext.reader

Mon, 12 Aug 2002 15:32:34 +0200

Hello, I'm the current Debian maintainer for python-xml and
python-4suite, and Jérôme has forwarded me you mail.

> From: "J. Imlay" <jimlay@u.washington.edu>
> To: jerome@debian.org
> cc: jimlay@u.washington.edu
> Date: Sat, 27 Jul 2002 00:35:40 -0700 (PDT)
> Subject: python2.1-xml but with xml.dom.ext.reader.PyExpat?
> 
> Hello, I know this isn't your department but I can't figure out who this
> developer for this actually is. It looks like it's 4suite but I don't
> think it is because I thought PyExpat was done by the PyExpat people who
> are not 4Suite. If you could forward this to the appropriate party, (and
> keep me in the cc if you will) I'd appreciate it.

Actually, it's the PyXML code you are using (4DOM, to which xml.dom.ext
belongs, was donated by the 4Suite team to the PyXML project). I'm
cc'ing the PyXML mailing list for further discussion. 

> 
> from xml.dom.ext.reader import PyExpat
> reader = PyExpat.Reader()
> doc = reader.fromUri(uri)
> 
> If the uri contains a #sign (as uri's with references to an anchor tag
> do), the # sign should be ignored no? Instead if
> uri="http://purl.org/file#" and you ask for the file, the webserver
> (depending on how smart it is, apache figures it out, but not all web
> servers do) will return a 404. And the url handeler does not realize it's
> a 404 and proceeds to choke on the non-xml output. So 2 things.
> 
> 1. It should (I think, you of course can disagree if you think I am
> ignorant) pick off the # before making the GET request.
> 
> 2. If there is a http error returned in the GET request it should return
> that rather than trying to parse the 404 page as XML and dieing with a
> line 1 column 54 error. (the error baffled more than 1 Programmer beyond
> solvability, it took some haxoring to figure out it was the # at the end
> of the URL that was bombing it)

This is certainly a bug, but after having given a look at the code in
PyXML, I'd say that it is most likeky a bug in the urllib module from
the python standard library, which doesn't throw an exception when an
HTTP error is encountered.

>>> from urllib import urlopen
>>> urlopen('http://purl.org/file#').read()
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
2.0//EN">\n<HTML><HEAD>\n<TITLE>404 Not
Found</TITLE>\n</HEAD><BODY>\n<H1>Not Found</H1>\nThe requested URL
/file was not found on this server.<P>\n</BODY></HTML>\n'
>>> urlopen('http://purl.org/file').read()
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
2.0//EN">\n<HTML><HEAD>\n<TITLE>404 Not
Found</TITLE>\n</HEAD><BODY>\n<H1>Not Found</H1>\nThe requested URL
/file was not found on this server.<P>\n</BODY></HTML>\n'

Now, this has been fixed in urllib2: 

>>> from urllib2 import urlopen
>>> urlopen('http://purl.org/file').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
<...>
  File "/usr/lib/python2.1/urllib2.py", line 425, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Since support for Python1.5 has been dropped from PyXML, perhaps using
urllib2 instead of urllib should be considered. I don't know if this
module is available in Python2.0, though.

Any opinion?

Alexandre Fayolle
-- 
LOGILAB, Paris (France).
http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
Narval, the first software agent available as free software (GPL).