[Web-SIG] HTML parsing - get text position and font size

Girish Redekar girish.redekar at gmail.com
Mon Jan 12 12:26:35 CET 2009


I'm trying to build a search engine in python am stuck at the place where I
parse HTML to get useful text. One should ideally be able to parse the text
(out of HTML tags) along with its position (for phrase searches) and
font-size (to weigh words appropriately).

However, this part gets very tedious (especially with bad html and css) and
my code is already unwieldy. It seems to me that this task should've been a
part of any python based semi-sophisticated screen scraper and that it would
be a commonly solved problem. Yet, no amount of googling has returned
anything useful.

Any ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090112/1157cc77/attachment.htm>


More information about the Web-SIG mailing list