Mutating an HTML file with BeautifulSoup
Chris Angelico
rosuav at gmail.com
Sat Aug 20 14:06:17 EDT 2022
On Sun, 21 Aug 2022 at 03:27, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>
> 2QdxY4RzWzUUiLuE at potatochowder.com writes:
> >textual representations. That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
> The /elements/ differ. They have the /same/ infoset.
That's the bit that's hard to prove.
> The OP could edit the files with regexps to create a new version.
To you and Jon, who also suggested this: how would that be beneficial?
With Beautiful Soup, I have the line number and position within the
line where the tag starts; what does a regex give me that I don't have
that way?
> Soup := BeautifulSoup.
>
> Then have Soup read both the new version and the old version.
>
> Then have Soup also edit the old version read in, the same way as
> the regexps did and verify that now the old version edited by
> Soup and the new version created using regexps agree.
>
> Or just use Soup as a tool to show the diffs for visual inspection
> by having Soup read both the original version and the version edited
> with regexps. Now both are normalized by Soup and Soup can show the
> diffs (such a diff feature might not be a part of Soup, but it should
> not be too much effort to write one using Soup).
>
But as mentioned, the entire problem *is* the normalization, as I have
no proof that it has had no impact on the rendering of the page.
Comparing two normalized versions is no better than my original option
1, whereby I simply ignore the normalization and write out the
reconstructed content.
It's easy if you know for certain that the page is well-formed. Much
harder if you do not - or, as in some cases, if you know the page is
badly-formed.
ChrisA
More information about the Python-list
mailing list