Finding Line numbers of HTML file

Ramdas ramdaz at gmail.com
Thu Dec 13 02:07:20 EST 2007


Hey paul,

Thanks a Ton!

Never heard of pyparsing module.

This is more a hobby, than any homework.

This is exactly what I wanted.

I am scrapping a few web pages for data.

I am using Beautiful Soup for tag extraction. However for some quirky
reasons, I need to reference back to the exact line number. BS does
not support line nos right now, and Richardson, says he will include
the same in future.




On Dec 13, 5:27 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Dec 12, 3:56 pm, Ramdas <ram... at gmail.com> wrote:> I am doing some HTML scrapping for a side project.
>
> > I need a method using sgmllib or HTMLParser to parse an HTML file and
> > get line nos of all the tags
>
> Homework, perhaps?  Well, I don't think your instructor will give many
> points for a pyparsing solution, but it was an interesting 10-minute
> exercise anyway.  Once you use pyparsing's built-in expression for
> anyOpenTag, and scan through the input html, the rest is just
> bookkeeping in a defaultdict.
>
> -- Paul
>
> import urllib
> from collections import defaultdict
> from pyparsing import anyOpenTag, lineno
>
> # read in a random html page
> pg = urllib.urlopen("http://www.yahoo.com")
> html = pg.read()
> pg.close()
>
> # print out what we got
> print html
> print
>
> # create a defaultdict to tally up list of line numbers for each tag
> tagLocs = defaultdict(list)
>
> # use a parse action to update the tally whenever a tag is found
> def tallyTagLineNumber(strg, locn, tagTokens):
>     line = lineno(locn,strg)
>     tagLocs[tagTokens[0]].append(line)
> anyOpenTag.setParseAction(tallyTagLineNumber)
>
> # scan the input html, and add tag line numbers to the tally dict
> anyOpenTag.searchString(html)
>
> # print out the results
> tagnames = sorted(tagLocs.keys())
> for t in tagnames:
>     print t, len(tagLocs[t])
>     print tagLocs[t]
>     print
>
> -------
> Prints:
>
> <... extracted HTML not shown...>
>
> a 46
> [54, 68, 96, 97, 98, 99, 110, 111, 112, 113, 114, 115, 116, 117, 120,
> 121, 122, 123, 124, 125, 126, 127, 130, 131, 132, 133, 134, 135, 136,
> 139, 140, 141, 142, 143, 150, 159, 160, 161, 162, 163, 164, 165, 166,
> 168, 169, 170]
>
> b 5
> [91, 109, 119, 129, 138]
>
> base 1
> [6]
>
> body 1
> [17]
>
> br 34
> [91, 93, 93, 94, 110, 111, 112, 113, 114, 115, 116, 117, 120, 121,
> 122, 123, 124, 125, 126, 127, 130, 131, 132, 133, 134, 135, 136, 139,
> 140, 141, 142, 143, 167, 167]
>
> center 1
> [18]
>
> font 15
> [30, 36, 54, 68, 90, 96, 97, 98, 99, 109, 119, 129, 138, 158, 168]
>
> form 1
> [31]
>
> head 1
> [2]
>
> html 1
> [1]
>
> img 2
> [26, 150]
>
> input 5
> [32, 33, 34, 36, 37]
>
> meta 2
> [4, 5]
>
> spacer 26
> [21, 24, 47, 50, 52, 53, 55, 56, 58, 61, 64, 66, 67, 69, 70, 72, 79,
> 82, 83, 89, 101, 104, 146, 148, 156, 157]
>
> span 1
> [39]
>
> style 2
> [7, 12]
>
> table 26
> [19, 21, 29, 30, 35, 46, 49, 54, 63, 68, 79, 82, 83, 84, 87, 88, 95,
> 106, 107, 108, 146, 148, 149, 156, 157, 158]
>
> td 58
> [20, 21, 24, 25, 28, 29, 30, 36, 37, 45, 47, 48, 50, 52, 53, 54, 54,
> 55, 56, 58, 61, 62, 64, 66, 67, 68, 68, 69, 70, 72, 78, 79, 82, 83,
> 86, 87, 89, 90, 96, 97, 98, 99, 101, 104, 105, 106, 107, 109, 119,
> 129, 138, 146, 148, 149, 156, 157, 157, 158]
>
> title 1
> [3]
>
> tr 37
> [20, 21, 23, 29, 30, 35, 44, 46, 49, 51, 54, 57, 63, 65, 68, 71, 78,
> 79, 82, 83, 85, 87, 88, 96, 97, 98, 99, 106, 107, 108, 146, 148, 149,
> 156, 157, 157, 158]




More information about the Python-list mailing list