[Tutor] How do I get text from an HTML document.

Magnus Lycka magnus@thinkware.se
Wed, 14 Aug 2002 22:39:45 +0200


At 14:18 2002-08-14 -0500, SA wrote:
>It does a great job of extracting the text between the two tags. But for
>some reason I have a lot of extraneous material after the text that was not
>between the two tags and looks like it may have come from the html code
>after the end tag. Did I miss something?
>
> >>> def getTextFromHTML(html, startPattern, endPattern):
>...     data =3D StringIO.StringIO()
>...     start =3D html.find(startPattern)
>...     stop =3D html.find(endPattern, start + 1)
>...     fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
>...     parser =3D htmllib.HTMLParser(fmt)
>...     parser.feed(html[start:stop])
>...     return data

That suggests that "stop =3D html.find(endPattern, start + 1)"
didn't work as intended? Does stop come out as -1?

 >>> def getTextFromHTML(html, startPattern, endPattern):
...     data =3D StringIO.StringIO()
...     start =3D html.find(startPattern)
...     stop =3D html.find(endPattern, start + 1)
Here we could put in
         print stop
         print html[start:stop]
to see what that looks like...

...     fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
...     parser =3D htmllib.HTMLParser(fmt)
...     parser.feed(html[start:stop])
...     return data

Hm, I think I see what it was: Did you supply "<!--/Storytext-->"
as the "endPattern", not "<!--Storytext-->". In your question you
had "<!--Story-->" both before and after, so I called the function
as "getTextFromHTML(html, tag, tag)" With the same tag for before
and after.




--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se