Mutating an HTML file with BeautifulSoup

Sat Aug 20 17:51:41 EDT 2022

On 2022-08-20, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
> Jon Ribbens <jon+usenet at unequivocal.eu> writes:
>>... or you could avoid all that faff and just do re.sub()?
>
> import bs4
> import re
>
> source = '<a name="b" href="http" accesskey="c"></a>'
>
> # Use Python to change the source, keeping the order of attributes.
>
> result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

You could go a bit harder with the regexp of course, e.g.:

  result = re.sub(
      r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
      r"\1\2NEW\2",
      source,
      flags=re.IGNORECASE
  )

> # Now use BeautifulSoup only for the verification of the result.
>
> reference = bs4.BeautifulSoup( source, features="html.parser" )
> for a in reference.find_all( "a" ):
>     if a[ 'href' ]== 'http': a[ 'href' ]='https'
>
> print( bs4.BeautifulSoup( result, features="html.parser" )== reference )

Hmm, yes that seems like a pretty good idea.