[Tutor] HTMLParser problem unable to find all the IMG
tags....
Chris Barnhart
mlist-python at dideas.com
Thu Oct 28 22:22:33 CEST 2004
At 01:49 PM 10/28/2004, Lloyd Kvam wrote:
>On Thu, 2004-10-28 at 08:34, Chris Barnhart wrote:
> >
> > The problem is that using the HTMLParser I'm not getting all the IMG
> > tags. I know this as I have another program that just uses string
> > processing that gets 2.5 times more IMG SRC tag. I also know this because
> > HTMLParser starttag is never called with the IMG that I'm after!
The problem with my getting all the IMG tags from CNN is the lack of a
space separating a close quote and start of an attribute in at least one
their IMG SRC statements.
So its a problem with CNN. But this causes HTMLParser to fail permanently
until the end.
For example this statement breaks HTMLParser:
<IMG SRC = "abc.jpg"WIDTH=5>
Following is a demo program and the output. It run parse two strings each
with 3 img src tags. The 2nd string has the lack of space in the 2nd
statement.
I guess at this point I'm supposed to fix the Parser since CNN can't be fixed?
import urllib2
from HTMLParser import HTMLParser
class MyParser(HTMLParser):
def __init__( self ) :
HTMLParser.__init__(self)
self.cnt = 0
def handle_starttag(self, tag, attr):
print tag, attr
def close(self) :
HTMLParser.close(self)
s_bug = '<IMG SRC = "http://abc.com" width=10><IMG SRC =
"http://abc.com"width=10><IMG SRC = "http://abc.com" width=10>"'
s_ok = '<IMG SRC = "http://abc.com" width=10><IMG SRC = "http://abc.com"
width=10><IMG SRC = "http://abc.com" width=10>"'
print "Working output"
html = s_ok
h = MyParser()
h.feed(html)
h.close()
print "\nBroken output"
html = s_bug
h = MyParser()
h.feed(html)
h.close()
print "Finished"
>>> Working output
img [('src', 'http://abc.com'), ('width', '10')]
img [('src', 'http://abc.com'), ('width', '10')]
img [('src', 'http://abc.com'), ('width', '10')]
Broken output
img [('src', 'http://abc.com'), ('width', '10')]
Traceback (most recent call last):
File
"C:\Python23\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\src\python\url_bug.py", line 35, in ?
h.close()
File "C:\src\python\url_bug.py", line 15, in close
HTMLParser.close(self)
File "C:\Python23\lib\HTMLParser.py", line 112, in close
self.goahead(1)
File "C:\Python23\lib\HTMLParser.py", line 164, in goahead
self.error("EOF in middle of construct")
File "C:\Python23\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: EOF in middle of construct, at line 1, column 38
>>>
More information about the Tutor
mailing list