[Tutor] HTMLParser problem unable to find all the IMG tags....

Chris Barnhart mlist-python at dideas.com
Thu Oct 28 22:22:33 CEST 2004


At 01:49 PM 10/28/2004, Lloyd Kvam wrote:
>On Thu, 2004-10-28 at 08:34, Chris Barnhart wrote:
> >
> > The problem is that using the HTMLParser I'm not getting all the IMG
> > tags.  I know this as I have another program that just uses string
> > processing that gets 2.5 times more IMG SRC tag.  I also know this because
> > HTMLParser starttag is never called with the IMG that I'm after!


The problem with my getting all the IMG tags from CNN is the lack of a 
space separating a close quote and start of an attribute in at least one 
their IMG SRC statements.

So its a problem with CNN.  But this causes HTMLParser to fail permanently 
until the end.

For example this statement breaks HTMLParser:

<IMG SRC = "abc.jpg"WIDTH=5>

Following is a demo program and the output.  It run parse two strings each 
with 3 img src tags.  The 2nd string has the lack of space in the 2nd 
statement.

I guess at this point I'm supposed to fix the Parser since CNN can't be fixed?

import urllib2
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
     def __init__( self ) :
         HTMLParser.__init__(self)
         self.cnt = 0

     def handle_starttag(self, tag, attr):
         print tag, attr

     def close(self) :
         HTMLParser.close(self)


s_bug = '<IMG SRC = "http://abc.com" width=10><IMG SRC = 
"http://abc.com"width=10><IMG SRC = "http://abc.com" width=10>"'

s_ok = '<IMG SRC = "http://abc.com" width=10><IMG SRC = "http://abc.com" 
width=10><IMG SRC = "http://abc.com" width=10>"'

print "Working output"

html = s_ok
h = MyParser()
h.feed(html)
h.close()

print "\nBroken output"

html = s_bug
h = MyParser()
h.feed(html)
h.close()

print "Finished"



 >>> Working output
img [('src', 'http://abc.com'), ('width', '10')]
img [('src', 'http://abc.com'), ('width', '10')]
img [('src', 'http://abc.com'), ('width', '10')]
Broken output
img [('src', 'http://abc.com'), ('width', '10')]
Traceback (most recent call last):
   File 
"C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", 
line 310, in RunScript
     exec codeObject in __main__.__dict__
   File "C:\src\python\url_bug.py", line 35, in ?
     h.close()
   File "C:\src\python\url_bug.py", line 15, in close
     HTMLParser.close(self)
   File "C:\Python23\lib\HTMLParser.py", line 112, in close
     self.goahead(1)
   File "C:\Python23\lib\HTMLParser.py", line 164, in goahead
     self.error("EOF in middle of construct")
   File "C:\Python23\lib\HTMLParser.py", line 115, in error
     raise HTMLParseError(message, self.getpos())
HTMLParseError: EOF in middle of construct, at line 1, column 38
 >>>






More information about the Tutor mailing list