[Tutor] How do I get text from an HTML document.
SA
sarmstrong13@mac.com
Wed, 14 Aug 2002 14:18:58 -0500
On 8/14/02 12:04 PM, "Magnus Lycka" <magnus@thinkware.se> wrote:
> At 08:16 2002-08-14 -0500, SA wrote:
>> Hi Everyone-
>>
>> I have HTML docs that have text between the comment tags:
>> <!--Story-->
>> Some text here
>> <!--Story-->
>>
>> What would be the simplest way to get this text. The text will also have
>> some common html tags mixed in like <p>. So I want to strip all the html
>> tags from the text I get also.
>
> import htmllib, formatter, StringIO
>
> def getTextFromHTML(html, startPattern, endPattern):
> # make a file-like string object (data) to which
> # we'll write the output from HTML parsing.
> data = StringIO.StringIO()
>
> # Find where the relevant text starts and ends.
> # As used below, the starting tag will be included
> # in the data which is fed to the HTML parser,
> # while the ending tag won't, but that hardly
> # matters since tags are to be removed anyway.
> # If a tag isn't found, -1 will be returned. This
> # means that if the start tag isn't found, nothing
> # will come out of the function. "html[-1:stop]"
> # If no end tag is found, all text after the
> # start tag will be extracted "html[start:-1]"
> # since index "-1" means "just after the end".
> start = html.find(startPattern)
> # Start searching for end tag _after_ the start tag
> # in case they look the same.
> stop = html.find(endPattern, start + 1)
>
> # Parsers like HTMLParser sends its parsed data
> # to a Formatter, which in turn sends data to a
> # Writer. Very flexible, but a big difficult to
> # learn... DumbWriter is just smart enough to do
> # word-wrapping after 72 columns.
> fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
> parser = htmllib.HTMLParser(fmt)
>
> # Feed the relevant part of the HTML through the parser.
> parser.feed(html[start:stop])
>
> # Return the entire string in "data"
> return data.getvalue()
>
> html = """This should be left out
> <!--Story-->
> This should be found
> <!--Story-->
> But not this"""
>
> tag = '<!--Story-->'
>
> print getTextFromHTML(html, tag, tag)
>
This almost works.
It does a great job of extracting the text between the two tags. But for
some reason I have a lot of extraneous material after the text that was not
between the two tags and looks like it may have come from the html code
after the end tag. Did I miss something?
>>> def getTextFromHTML(html, startPattern, endPattern):
... data = StringIO.StringIO()
... start = html.find(startPattern)
... stop = html.find(endPattern, start + 1)
... fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
... parser = htmllib.HTMLParser(fmt)
... parser.feed(html[start:stop])
... return data
Thanks.
SA
--
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me