[Web-SIG] HTML parsing - get text position and font size

Girish Redekar girish.redekar at gmail.com
Mon Jan 12 13:07:37 CET 2009


Thanks Noah - Beautiful Soup does give a tree that can be used - however,
getting from the tree to the result I desire is still a long way.

I'm using lxml (for speed conerns) and it also returns a tree similar to BS
.. I have even got as far as parsing the css and getting the attributes for
each text element. However, getting from here to a simple list of the form:
[ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3,
fontsize3, position3) ... ]
is still tedious as font sizes in html/css can be expressed in multiple
methods (like <FONT> tags, sizes in pixels, relative sizes, default larger
size for header etc). One can get down and code each of these cases, but I
was hoping someone has already (and reliably) worked on the same

Thanks,
Girish


On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift <noah.gift at gmail.com> wrote:

> 2009/1/13 Girish Redekar <girish.redekar at gmail.com>:
> > I'm trying to build a search engine in python am stuck at the place where
> I
> > parse HTML to get useful text. One should ideally be able to parse the
> text
> > (out of HTML tags) along with its position (for phrase searches) and
> > font-size (to weigh words appropriately).
> >
> > However, this part gets very tedious (especially with bad html and css)
> and
> > my code is already unwieldy. It seems to me that this task should've been
> a
> > part of any python based semi-sophisticated screen scraper and that it
> would
> > be a commonly solved problem. Yet, no amount of googling has returned
> > anything useful.
> >
> > Any ideas?
>
> I wrote this article a way back:
>
> http://www.ibm.com/developerworks/aix/library/au-threadingpython/
>
> I didn't fully explore it, but it seems like thread pools and
> Beautiful Soup could work...
>
>
> > _______________________________________________
> > Web-SIG mailing list
> > Web-SIG at python.org
> > Web SIG: http://www.python.org/sigs/web-sig
> > Unsubscribe:
> > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090112/1a4e38d8/attachment.htm>


More information about the Web-SIG mailing list