HTMLParser not parsing whole html file

josh logan dear.jay.logan at gmail.com
Sun Oct 24 16:38:23 EDT 2010


On Oct 24, 4:36 pm, josh logan <dear.jay.lo... at gmail.com> wrote:
> Hello,
>
> I wanted to use python to scrub an html file for score data, but I'm
> having trouble.
> I'm using HTMLParser, and the parsing seems to fizzle out around line
> 192 or so. None of the event functions are being called anymore
> (handle_starttag, handle_endtag, etc.) and I don't understand why,
> because it is a html page over 1000 lines.
>
> Could someone tell me if this is a bug or simply a misunderstanding on
> how HTMLParser works? I'd really appreciate some help in
> understanding.
>
> I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).
>
> I put the HTML file on pastebin, because I couldn't think of anywhere
> better to put it:http://pastebin.com/wu6Pky2W
>
> The source code has been pared down to the simplest form to exhibit
> the problem. It is displayed below, and is also on pastebin for
> download (http://pastebin.com/HxwRTqrr):
>
> import sys
> import re
> import os.path
> import itertools as it
> import urllib.request
> from html.parser import HTMLParser
> import operator as op
>
> base_url = 'http://www.dci.org'
>
> class TestParser(HTMLParser):
>
>     def handle_starttag(self, tag, attrs):
>         print('position {}, staring tag {} with attrs
> {}'.format(self.getpos(), tag, attrs))
>
>     def handle_endtag(self, tag):
>         print('ending tag {}'.format(tag))
>
> def do_parsing_from_file_stream(fname):
>     parser = TestParser()
>
>     with open(fname) as f:
>         for num, line in enumerate(f, start=1):
>             # print('Sending line {} through parser'.format(num))
>             parser.feed(line)
>
> if __name__ == '__main__':
>     do_parsing_from_file_stream(sys.argv[1])

Sorry, the group doesn't like how i surrounded the Python code's
pastebin URL with parentheses:

http://pastebin.com/HxwRTqrr



More information about the Python-list mailing list