Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Fri Aug 19 17:01:09 EDT 2022


On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> I recently ported from very old bs to bs4 and hit the same issue.
> So no it will not output the same as went in.
>
> If you can trust the input to be parsed as xml, meaning all the rules of closing
> tags have been followed. Then I think you can parse and unparse thru xml to
> do what you want.
>


Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
well. Thanks for trying, anyhow.

So I'm left with a few options:

1) Give up on validation, give up on verification, and just run this
thing on the production site with my fingers crossed
2) Instead of doing an intelligent reconstruction, just str.replace()
one URL with another within the file
3) Split the file into lines, find the Nth line (elem.sourceline) and
str.replace that line only
4) Attempt to use elem.sourceline and elem.sourcepos to find the start
of the tag, manually find the end, and replace one tag with the
reconstructed form.

I'm inclined to the first option, honestly. The others just seem like
hard work, and I became a programmer so I could be lazy...

ChrisA


More information about the Python-list mailing list