how to get rid of html tags
Cameron Laird
claird at lairds.org
Thu Oct 3 10:10:05 EDT 2002
In article <mailman.1033619587.32128.python-list at python.org>,
Ian Bicking <ianb at colorstudy.com> wrote:
>The easy answer:
>
>page = re.sub(r'<.*?>', '', page)
>
>There may be more Correct answers, though. (Some HTML has unquoted <>
>characters, which browsers accept even though it's super annoying to
>parse -- but I don't know that htmllib parses improper HTML either)
>
>On Wed, 2002-10-02 at 20:04, koko wrote:
>> I am trying to retrieve a web page.
>> But I only want to keep the content of the webpage without the html tags.
>> How can I parse the webpage to get rid of the tags?
.
.
.
People answer this question in *dozens* of different
ways. Perhaps the most satisfying to koko will be
dialectically. Does, for example, command-line
lynx -dump $URL > $RESULT
meet all your requirements?
--
Cameron Laird <Cameron at Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html
More information about the Python-list
mailing list