Mutating an HTML file with BeautifulSoup

Chris Angelico rosuav at gmail.com
Sat Aug 20 14:06:17 EDT 2022


On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>
> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
> >textual representations.  That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
>   The /elements/ differ. They have the /same/ infoset.

That's the bit that's hard to prove.

>   The OP could edit the files with regexps to create a new version.

To you and Jon, who also suggested this: how would that be beneficial?
With Beautiful Soup, I have the line number and position within the
line where the tag starts; what does a regex give me that I don't have
that way?

>   Soup := BeautifulSoup.
>
>   Then have Soup read both the new version and the old version.
>
>   Then have Soup also edit the old version read in, the same way as
>   the regexps did and verify that now the old version edited by
>   Soup and the new version created using regexps agree.
>
>   Or just use Soup as a tool to show the diffs for visual inspection
>   by having Soup read both the original version and the version edited
>   with regexps. Now both are normalized by Soup and Soup can show the
>   diffs (such a diff feature might not be a part of Soup, but it should
>   not be too much effort to write one using Soup).
>

But as mentioned, the entire problem *is* the normalization, as I have
no proof that it has had no impact on the rendering of the page.
Comparing two normalized versions is no better than my original option
1, whereby I simply ignore the normalization and write out the
reconstructed content.

It's easy if you know for certain that the page is well-formed. Much
harder if you do not - or, as in some cases, if you know the page is
badly-formed.

ChrisA


More information about the Python-list mailing list