[Tutor] How do I get text from an HTML document.

Wed, 14 Aug 2002 14:18:58 -0500

On 8/14/02 12:04 PM, "Magnus Lycka" <magnus@thinkware.se> wrote:

> At 08:16 2002-08-14 -0500, SA wrote:
>> Hi Everyone-
>> 
>> I have HTML docs that have text between the comment tags:
>> <!--Story-->
>> Some text here
>> <!--Story-->
>> 
>> What would be the simplest way to get this text. The text will also have
>> some common html tags mixed in like <p>. So I want to strip all the html
>> tags from the text I get also.
> 
> import htmllib, formatter, StringIO
> 
> def getTextFromHTML(html, startPattern, endPattern):
>    # make a file-like string object (data) to which
>    # we'll write the output from HTML parsing.
>    data = StringIO.StringIO()
> 
>    # Find where the relevant text starts and ends.
>    # As used below, the starting tag will be included
>    # in the data which is fed to the HTML parser,
>    # while the ending tag won't, but that hardly
>    # matters since tags are to be removed anyway.
>    # If a tag isn't found, -1 will be returned. This
>    # means that if the start tag isn't found, nothing
>    # will come out of the function. "html[-1:stop]"
>    # If no end tag is found, all text after the
>    # start tag will be extracted "html[start:-1]"
>    # since index "-1" means "just after the end".
>    start = html.find(startPattern)
>    # Start searching for end tag _after_ the start tag
>    # in case they look the same.
>    stop = html.find(endPattern, start + 1)
> 
>    # Parsers like HTMLParser sends its parsed data
>    # to a Formatter, which in turn sends data to a
>    # Writer. Very flexible, but a big difficult to
>    # learn... DumbWriter is just smart enough to do
>    # word-wrapping after 72 columns.
>    fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
>    parser = htmllib.HTMLParser(fmt)
> 
>    # Feed the relevant part of the HTML through the parser.
>    parser.feed(html[start:stop])
> 
>    # Return the entire string in "data"
>    return data.getvalue()
> 
> html = """This should be left out
> <!--Story-->
> This should be found
> <!--Story-->
> But not this"""
> 
> tag = '<!--Story-->'
> 
> print getTextFromHTML(html, tag, tag)
> 
This almost works.

It does a great job of extracting the text between the two tags. But for
some reason I have a lot of extraneous material after the text that was not
between the two tags and looks like it may have come from the html code
after the end tag. Did I miss something?

>>> def getTextFromHTML(html, startPattern, endPattern):
...     data = StringIO.StringIO()
...     start = html.find(startPattern)
...     stop = html.find(endPattern, start + 1)
...     fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
...     parser = htmllib.HTMLParser(fmt)
...     parser.feed(html[start:stop])
...     return data

Thanks.
SA

-- 
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me