HTMLParser not parsing whole html file

josh logan dear.jay.logan at gmail.com
Sun Oct 24 16:36:31 EDT 2010


Hello,

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:
http://pastebin.com/wu6Pky2W

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):


import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op


base_url = 'http://www.dci.org'

class TestParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

    def handle_endtag(self, tag):
        print('ending tag {}'.format(tag))


def do_parsing_from_file_stream(fname):
    parser = TestParser()

    with open(fname) as f:
        for num, line in enumerate(f, start=1):
            # print('Sending line {} through parser'.format(num))
            parser.feed(line)



if __name__ == '__main__':
    do_parsing_from_file_stream(sys.argv[1])



More information about the Python-list mailing list