HTML extraction

Tue Dec 7 06:53:42 EST 2021

Hey,

Could anyone please comment on the purest way simply to strip HTML tags
from the internal text they surround?

I know Beautiful Soup is a convenient tool, but I’m interested to know what
the most minimal way to do it would be.

People say you usually don’t use Regex for a second order language like
HTML, so I was thinking about using xpath or lxml, which seem like very
pure, universal tools for the job.

I did find an example for doing this with the re module, though.

Would it be fair to say that to just strip the tags, Regex is fine, but you
need to build a tree-like object if you want the ability to select which
nodes to keep and which to discard?

Can xpath / lxml do that?

What are the chief differences between xpath / lxml and Beautiful Soup?

Thanks,
Julius