[Web-SIG] HTML parsing - get text position and font size

Noah Gift noah.gift at gmail.com
Mon Jan 12 12:29:11 CET 2009


2009/1/13 Girish Redekar <girish.redekar at gmail.com>:
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).
>
> However, this part gets very tedious (especially with bad html and css) and
> my code is already unwieldy. It seems to me that this task should've been a
> part of any python based semi-sophisticated screen scraper and that it would
> be a commonly solved problem. Yet, no amount of googling has returned
> anything useful.
>
> Any ideas?

I wrote this article a way back:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/

I didn't fully explore it, but it seems like thread pools and
Beautiful Soup could work...


> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe:
> http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com
>
>


More information about the Web-SIG mailing list