[Web-SIG] HTML parsing - get text position and font size

Mon Jan 12 14:51:01 CET 2009

2009/1/12 Girish Redekar:
> I'm trying to build a search engine in python am stuck at the place where I
> parse HTML to get useful text. One should ideally be able to parse the text
> (out of HTML tags) along with its position (for phrase searches) and
> font-size (to weigh words appropriately).

Have a look at html5lib for HTML parsing: http://code.google.com/p/html5lib
It builds on the HTML5 parsing rules, which are compatible with how
the four most used browsers (IE, Firefox, Safari and Opera) actually
parse HTML as of now (as those do not parse HTML exactly the same, the
algorithm is generally the "less illogical" in these cases).
The result can either be a html5lib-specific tree (SimpleTree) or a
BeautifulSoup, ElementTree/lxml or minidom. This means that, for
instance, you can replace your BeautifulSoup parsing code with
html5lib and keep the processing code as-is.

However, for font-size, you'd have to parse and "apply" CSS and for
this I have no solution at hand (but I don't really understand the
use-case either actually...)

-- 
Thomas Broyer