Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Sun Aug 21 23:30:58 EDT 2022


On Mon, 22 Aug 2022 at 10:04, Buck Evan <buck.2019 at gmail.com> wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`.
> Unless I'm mistaken, all such changes should fall into no more than a dozen groups.
>

Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.

ChrisA


More information about the Python-list mailing list