Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Sun Aug 21 04:09:32 EDT 2022


On Sun, 21 Aug 2022 at 17:26, Barry <barry at barrys-emacs.org> wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico <rosuav at gmail.com> wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
> >>
> >>
> >>
> >>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
>
> Can you build a beta site with original intack?

In a naive way, a full copy would be quite a few gigabytes. I could
cut that down a good bit by taking only HTML files and the things they
reference, but then we run into the same problem of broken links,
which is what we're here to solve in the first place.

But I would certainly not want to run two copies of the site and then
manually compare.

> Also wonder if using selenium to walk the site may work as a verification step?
> I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences.

Image recognition won't necessarily even be valid; some of the changes
will have visual consequences (eg a broken image reference now
becoming correct), and as soon as that happens, the whole document can
reflow.

> From my one task using bs4 I did not see it produce any bad results.
> In my case the problems where in the code that built on bs1 using bad assumptions.

Did that get run on perfect HTML, or on messy real-world stuff that
uses quirks mode?

ChrisA


More information about the Python-list mailing list