Mutating an HTML file with BeautifulSoup

Barry barry at barrys-emacs.org
Sun Aug 21 03:25:58 EDT 2022



> On 19 Aug 2022, at 22:04, Chris Angelico <rosuav at gmail.com> wrote:
> 
> On Sat, 20 Aug 2022 at 05:12, Barry <barry at barrys-emacs.org> wrote:
>> 
>> 
>> 
>>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav at gmail.com> wrote:
>>> 
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>> 
>> I recall that in bs4 it parses into an object tree and loses the detail of the input.
>> I recently ported from very old bs to bs4 and hit the same issue.
>> So no it will not output the same as went in.
>> 
>> If you can trust the input to be parsed as xml, meaning all the rules of closing
>> tags have been followed. Then I think you can parse and unparse thru xml to
>> do what you want.
>> 
> 
> 
> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> well. Thanks for trying, anyhow.
> 
> So I'm left with a few options:
> 
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed

Can you build a beta site with original intack?

Also wonder if using selenium to walk the site may work as a verification step?
I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences.

From my one task using bs4 I did not see it produce any bad results.
In my case the problems where in the code that built on bs1 using bad assumptions.



> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
> 
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list