Finding Line numbers of HTML file

Thu Dec 13 10:51:18 EST 2007

On Dec 13, 9:01 am, Ramdas <ram... at gmail.com> wrote:
> Hi Paul,
>
> I am cross posting the same to grab your attention at pyparsing forums
> too. 1000 apologies on the same count!
>
> I am a complete newbie to parsing and totally new to pyparsing.
>
> I have adapted your code to store the line numbers as below.
> Surprisingly, the line numbers printed, when I scrap some of the URLs,
> is not accurate and is kind of way off.
>
<snip>

Ramdas -

You will have to send me that URL off-list using e-mail, Google Groups
masks it and I can't pull it up.  In my example, I used the Yahoo home
page.  What is the URL you used, and which tags' results were off?

Just some comments:
- I did a quasi-verification of my results, using a quick-and-dirty re
match.  This did not give me the line numbers, but did give me counts
of tag names (if anyone knows how to get the string location of an re
match, this would be the missing link for an alternative solution to
this problem).  I added this code after the code I posted earlier:

print "Quick-and-dirty verify using re's"
import re
openTagRe = re.compile("<([^ >/!]+)")

tally2 = defaultdict(int)
for match in openTagRe.findall(html):
    tally2[match] += 1

for t in tally2.keys():
    print t,tally2[t],
    if tally2[t] != len(tagLocs[t]):
        print "<<<"
    else:
        print

This crude verifier turned up no mismatches when parsing the Yahoo
home page.

- Could the culprit be your unique function?  You did not post the
code for this, so I had to make up my own:

def unique(lst):
    return sorted(list(set(lst)))

This does trim some of the line numbers, but I did not try to validate
this.

- In your getlinenos function, it is not necessary to call
setParseAction every time.  You only need to do this once, probably
right after you define the tallyTagLineNumber function.

- Here is an abbreviated form of getlinenos:

def getlinenos(page):
    # clear out tally dict, so as not to get crossover data from
    # a previously-parsed page
    tagLocs.clear()
    anyOpenTag.searchString(page)
    return dict((k,unique(v)) for k,v in tagLocs.items())

If you wanted, you could even inline the unique logic, without too
much obfuscation.

-- Paul