(htmllib) How to capture text that includes tags?

Wed Nov 5 15:42:04 EST 2003

jennyw wrote:

> On Wed, Nov 05, 2003 at 11:23:36AM +0100, Peter Otten wrote:
>> I've found the parser in the HTMLParser module to be a lot easier to use.
>> Below is the rough equivalent of your posted code. In the general case
>> you will want to keep a stack of tags instead of the simple infont flag.
> 
> Thanks! Whare are the main advantages of HTMLParser over htmllib?

Basically htmllib.HTMLParser feeds a formatter that I don't need with
information that I would rather disregard.
HTMLParser.HTMLParser, on the other hand, has a simple interface (you've
pretty much seen it all in my tiny example).

> The code gives me something to think about ... it doesn't work right now
> because it turns out there are nested font tags (which means the asserts
> fail, and if I comment them out, it generates a 53 MB file from a < 1 MB
> source file). I'll try playing with it and seeing if I can get it to do
> what I want.

I would suspect that there are <font> tags without a corresponding </font>.
You could fix that by preprocessing the html source with a tool like tidy.
As an aside, font tags as search criteria are as bad as you can get. Try to
find something more specific, e. g. the "second column in every row of the
first table". If this gets too complex for HTMLParser, you can instead
convert the html into xml (again via tidy) and then read it into a dom
tree.

> It would be easier if I could find a way to view the HTML as a tree ...
> as a side note, are there any good utils to do this?

I've never applied this primitive data extraction technique to large complex
html files, so for me a text editor has been sufficient so far.
(If you are on Linux, you could give Quanta Plus a try)

Peter

PS: You could ask the company supplying the catalog for a copy in a more
accessible format, assuming you are a customer rather than a competitor.