Mutating an HTML file with BeautifulSoup

Peter J. Holzer hjp-python at hjp.at
Mon Aug 22 13:56:52 EDT 2022


On 2022-08-22 00:09:01 -0000, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer <hjp-python at hjp.at> wrote:
> > On 2022-08-20 21:51:41 -0000, Jon Ribbens via Python-list wrote:
> >>   result = re.sub(
> >>       r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> >     <a alt="42 > 23" href="the.answer.html">
> 
> I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
> believe I've ever seen anyone do that. (Wrongly putting an 'alt'
> attribute on an 'a' element is very common, on the other hand ;-) )

My bad. I meant title, not alt, of course. The unescaped > is completely
standard conforming HTML, however (both HTML 4.01 strict and HTML 5).
You almost never have to escape > - in fact I can't think of any case
right now - and I generally don't (sometimes I do for symmetry with <,
but that's an aesthetic choice, not a technical one).


> > The problem can be solved with regular expressions (and given the
> > constraints I think I would prefer that to using Beautiful Soup), but
> > getting the regexps right is not trivial, at least in the general case.
> 
> I would like to see the regular expression that could fully parse
> general HTML...

That depends on what you mean by "parse".

If you mean "construct a DOM tree", you can't since regular expressions
(in the mathematical sense, not what's implemented by some programming
languages) by definition describe finite automata, and those don't
support recursion.

But if you mean "split into a sequence of tags and PCDATA's (and then
each tag further into its attributes)", that's absolutely possible, and
that's all that is needed here. I don't think I have ever implemented a
complete solution (if only because stuff like <![CDATA[...]]> is
extremely rare), but I should have some Perl code lying around which
worked on a wide variety of HTML. I just have to find it again ...

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp at hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://mail.python.org/pipermail/python-list/attachments/20220822/d653dadd/attachment.sig>


More information about the Python-list mailing list