how to get rid of html tags

koko kokohh at hotmail.com
Thu Oct 3 18:07:34 EDT 2002


Thanks a lot. It worked well if the tags are on the same line.
But if the tag is broked to a few lines, it will not work.
eg. <!--abcd
        eeeee
            fff>


"Cameron Laird" <claird at lairds.org> wrote in message
news:anhj3t$mg9$1 at lairds.org...
> In article <mailman.1033619587.32128.python-list at python.org>,
> Ian Bicking  <ianb at colorstudy.com> wrote:
> >The easy answer:
> >
> >page = re.sub(r'<.*?>', '', page)
> >
> >There may be more Correct answers, though.  (Some HTML has unquoted <>
> >characters, which browsers accept even though it's super annoying to
> >parse -- but I don't know that htmllib parses improper HTML either)
> >
> >On Wed, 2002-10-02 at 20:04, koko wrote:
> >> I am trying to retrieve a web page.
> >> But I only want to keep the content of the webpage without the html
tags.
> >> How can I  parse the webpage to get rid of the tags?
> .
> .
> .
> People answer this question in *dozens* of different
> ways.  Perhaps the most satisfying to koko will be
> dialectically.  Does, for example, command-line
>   lynx -dump $URL > $RESULT
> meet all your requirements?
> --
>
> Cameron Laird <Cameron at Lairds.com>
> Business:  http://www.Phaseit.net
> Personal:  http://starbase.neosoft.com/~claird/home.html





More information about the Python-list mailing list