[ python-Bugs-505747 ] markupbase handling of HTML declarations

Tue Nov 9 17:20:32 CET 2004

Bugs item #505747, was opened at 2002-01-19 09:37
Message generated for change (Comment added) made by ezust
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=505747&group_id=5470

Category: Python Library
Group: Not a Bug
Status: Closed
Resolution: Fixed
Priority: 6
Submitted By: Greg Chapman (glchapman)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: markupbase handling of HTML declarations

Initial Comment:
Using Python 2.2., I tried to use websucker.py on this 
page:

http://magix.fri.uni-lj.si/orange/start/

This resulted in an exception in ParserBase._scan_name 
because _declname_match failed.  Examining the source 
for the page above, I see there are several tags that 
look like: &quot;&lt;![endif]&gt;&quot; where the first character 
after &quot;&lt;!&quot; is a '[', not an alpha as mandated by 
_delcname_match.  Perhaps this is badly formed HTML (I 
see it was produced by FrontPage), but if not, it 
appears that _scan_name may have to be modified.  FYI, 
here's the traceback from the exception:

Traceback (most recent call last):
  File &quot;C:\Python22\Tools\webchecker\websucker.py&quot;, 
line 126, in ?
    sys.exit(main() or 0)
  File &quot;C:\Python22\Tools\webchecker\websucker.py&quot;, 
line 43, in main
    c.run()
  File &quot;C:\Python22\Tools\webchecker\webchecker.py&quot;, 
line 349, in run
    self.dopage(url)
  File &quot;C:\Python22\Tools\webchecker\webchecker.py&quot;, 
line 403, in dopage
    page = self.getpage(url_pair)
  File &quot;C:\Python22\Tools\webchecker\webchecker.py&quot;, 
line 507, in getpage
    return Page(text, url, maxpage=self.maxpage, 
checker=self)
  File &quot;C:\Python22\Tools\webchecker\webchecker.py&quot;, 
line 671, in __init__
    self.parser.feed(self.text)
  File &quot;c:\Python22\lib\sgmllib.py&quot;, line 95, in feed
    self.goahead(0)
  File &quot;c:\Python22\lib\sgmllib.py&quot;, line 161, in 
goahead
    k = self.parse_declaration(i)
  File &quot;c:\Python22\lib\markupbase.py&quot;, line 66, in 
parse_declaration
    decltype, j = self._scan_name(j, i)
  File &quot;c:\Python22\lib\markupbase.py&quot;, line 313, in 
_scan_name
    self.error(&quot;expected name token&quot;)
  File &quot;c:\Python22\lib\sgmllib.py&quot;, line 102, in error
    raise SGMLParseError(message)
sgmllib.SGMLParseError: expected name token

----------------------------------------------------------------------

Comment By: Alan Ezust (ezust)
Date: 2004-11-09 11:20

Message:
Logged In: YES 
user_id=935841

I am running into this problem too. It seems quite common to
have invalid HTML in real-world web pages, and if you are
running a scraper program, I guess it's to be expected that
one will encounter invalid HTML from time to time. 

So in answer to your question about how to respond, I think
what's most important is that you output a better error
message. Then it won't be considered a bug in the library.

The error should indicate where in the document it
encountered this parse error. 

Second, I don't understand what getpos() returns, and how it
relates to the parse error. It returns a 1,2, when actually
in the particular page where I encountered the error, the
problem was on line 12 (see http://www.cs.uvic.ca/~gshoja/
as example). How do I get this information from the object?

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-03-30 09:52

Message:
Logged In: YES 
user_id=21627

This has now been fixed with patch 545300, on grounds of
conformance with SGML.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-06-13 21:39

Message:
Logged In: YES 
user_id=3066

Ok, here's what I think.

This is not an actual bug in the interpretation of HTML, and
there has not been a recurring pattern of complaints about
this.  Given that we do not want to encourage the creation
of broken HTML, this edge case will not be allowed to
further complicate the code.

----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-02-15 01:13

Message:
Logged In: YES 
user_id=3066

Ugh!  I don't think that's legal HTML at all.  I'll have to
think about the right way to deal with it.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=505747&group_id=5470