HTML extraction

Chris Angelico rosuav at gmail.com
Tue Dec 7 13:07:36 EST 2021


On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
<juliushamilton100 at gmail.com> wrote:
>
> Hey,
>
> Could anyone please comment on the purest way simply to strip HTML tags
> from the internal text they surround?
>
> I know Beautiful Soup is a convenient tool, but I’m interested to know what
> the most minimal way to do it would be.

That's definitely the best and most general way, and would still be my
first thought most of the time.

> People say you usually don’t use Regex for a second order language like
> HTML, so I was thinking about using xpath or lxml, which seem like very
> pure, universal tools for the job.
>
> I did find an example for doing this with the re module, though.
>
> Would it be fair to say that to just strip the tags, Regex is fine, but you
> need to build a tree-like object if you want the ability to select which
> nodes to keep and which to discard?

Obligatory reference:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

> Can xpath / lxml do that?
>
> What are the chief differences between xpath / lxml and Beautiful Soup?
>

I've never directly used lxml, mainly because bs4 offers all the same
advantages and more, with about the same costs. However, if you're
looking for a no-external-deps option, Python *does* include an HTML
parser in the standard library:

https://docs.python.org/3/library/html.parser.html

If your purpose is extremely simple (like "strip tags, search for
text"), then it should be easy enough to whip up something using that
module. No external deps, not a lot of code, pretty straight-forward.
On the other hand, if you're trying to do an "HTML to text"
conversion, you'd probably need to be aware of which tags are
block-level and which are inline content, so that (for instance)
"<div>Hello</div> <div>world</div>" would come out as two separate
paragraphs of text, whereas the same thing with <b> tags would become
just "Hello world". But for the most part, handle_data will probably
do everything you need.

ChrisA


More information about the Python-list mailing list