strip not well formed html tags...

Shagshag13 shagshag13 at yahoo.fr
Tue Oct 22 13:24:20 EDT 2002


"Mark McEahern" <marklists at mceahern.com> a écrit dans le message de news: mailman.1035288928.25184.python-list at python.org...
> > i've seen many post about how to strip html tags from a string,
> > some use sgmllib, others regular expressions... i 'd the following
> > trouble i would like to strip html (or even xml) tags but i had
> > to work on incomplete string so they could be not well formed - what
> > should i use ? regexp ? sgmllib with many exceptions handling ?
>
> 1.  Try mxTidy.

thanks i 'll check...

> 2.  Consider providing an example of the data you're talking about.

sorry :

"""<tag1> <tag2>a title</tag2><tag3>this is an example <tag3>and closing tags are missing """
"""but this one is possible too </tag1><tag3> <script>here is javascript code</script>"""
(but also even html tags)
i want this to be human readable - by now i use regexp to strip tags but i have trouble with javascript
<script>here is javascript code</script> -> i get the content, so i was wondering if i shoud'nt use a real parser... but here my
trouble will be as i don't kwow in advance all the tags that i will have to handle.

thanks,

s13.





More information about the Python-list mailing list