how to get rid of html tags

Cameron Laird claird at lairds.org
Thu Oct 3 10:10:05 EDT 2002


In article <mailman.1033619587.32128.python-list at python.org>,
Ian Bicking  <ianb at colorstudy.com> wrote:
>The easy answer:
>
>page = re.sub(r'<.*?>', '', page)
>
>There may be more Correct answers, though.  (Some HTML has unquoted <>
>characters, which browsers accept even though it's super annoying to
>parse -- but I don't know that htmllib parses improper HTML either)
>
>On Wed, 2002-10-02 at 20:04, koko wrote:
>> I am trying to retrieve a web page.
>> But I only want to keep the content of the webpage without the html tags.
>> How can I  parse the webpage to get rid of the tags?
			.
			.
			.
People answer this question in *dozens* of different
ways.  Perhaps the most satisfying to koko will be 
dialectically.  Does, for example, command-line
  lynx -dump $URL > $RESULT
meet all your requirements?
-- 

Cameron Laird <Cameron at Lairds.com>
Business:  http://www.Phaseit.net
Personal:  http://starbase.neosoft.com/~claird/home.html



More information about the Python-list mailing list