Removing an attribute from html with Regex

Stefan Behnel stefan_ml at behnel.de
Thu Dec 30 03:53:53 EST 2010


Selvam, 30.12.2010 08:30:
> I have some HTML string which I would like to feed to BeautifulSoup.
>
> But, One malformed attribute breaks BeautifulSoup.
>
>      <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para'  '
>   class='terp_header'>  My String</p>

Didn't try with BS (and you forgot to say what "breaks" means exactly in 
your case), but it parses in a somewhat reasonable way with lxml:

   Python 3.2b2 (py3k:87572, Dec 29 2010, 21:25:38)
   [GCC 4.4.3] on linux2
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import lxml.html as H
   >>> doc = H.fromstring('''
   ... <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para'  '
   ...  class='terp_header'> My String</p>
   ... ''')
   >>> H.tostring(doc)
   b'<p style="terp_header" wrong_tag=" text1 " text2 and \
     class="terp_header"> My String</p>'
   >>> doc.attrib
   {'text2': '', 'and': '', 'style': 'terp_header', \
    'wrong_tag': ' text1 ', 'class': 'terp_header'}


> I would like it to replace all the occurances of that attribute with an
> empty string.
>
> I am unable to figure out the exact regex, which can do this job.
>
> This is what, I have managed so far,
>
> m = re.compile("rml_except='([^']*)")

I assume "rml_accept" is the real name of the attribute?

You may be able to do this with a look-ahead expression, e.g.:

   replace = re.compile('(wrong_tag\s*=\s*[^>=]*)(?=>|\s+\w+\s*=)').sub

   html_data = replace('', html_data)

The trick is to match everything up to the next character that looks 
reasonable again, i.e. a closing tag character (">") or another attribute.

Stefan




More information about the Python-list mailing list