Mutating an HTML file with BeautifulSoup

Jon Ribbens jon+usenet at unequivocal.eu
Sun Aug 21 20:09:01 EDT 2022


On 2022-08-21, Peter J. Holzer <hjp-python at hjp.at> wrote:
> On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>> > Jon Ribbens <jon+usenet at unequivocal.eu> writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = '<a name="b" href="http" accesskey="c"></a>'
>> >
>> > # Use Python to change the source, keeping the order of attributes.
>> >
>> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
>> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )
>
> Depending on the content of the site, this might replace some stuff
> which is not a link.
>
>> You could go a bit harder with the regexp of course, e.g.:
>> 
>>   result = re.sub(
>>       r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
>
> This will fail on:
>     <a alt="42 > 23" href="the.answer.html">

I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
believe I've ever seen anyone do that. (Wrongly putting an 'alt'
attribute on an 'a' element is very common, on the other hand ;-) )

> The problem can be solved with regular expressions (and given the
> constraints I think I would prefer that to using Beautiful Soup), but
> getting the regexps right is not trivial, at least in the general case.

I would like to see the regular expression that could fully parse
general HTML...


More information about the Python-list mailing list