HTML extraction

Tue Dec 7 15:55:26 EST 2021

Hello,

ti 7. jouluk. 2021 klo 20.08 Chris Angelico (rosuav at gmail.com) kirjoitti:

> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
> <juliushamilton100 at gmail.com> wrote:
> >
> > Hey,
> >
> > Could anyone please comment on the purest way simply to strip HTML tags
> > from the internal text they surround?
> >
> > I know Beautiful Soup is a convenient tool, but I’m interested to know
> what
> > the most minimal way to do it would be.
>
> That's definitely the best and most general way, and would still be my
> first thought most of the time.
>
> > People say you usually don’t use Regex for a second order language like
> > HTML, so I was thinking about using xpath or lxml, which seem like very
> > pure, universal tools for the job.
> >
> > I did find an example for doing this with the re module, though.
> >
> > Would it be fair to say that to just strip the tags, Regex is fine, but
> you
> > need to build a tree-like object if you want the ability to select which
> > nodes to keep and which to discard?
>
> Obligatory reference:
>
>
> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>
> > Can xpath / lxml do that?
> >
> > What are the chief differences between xpath / lxml and Beautiful Soup?
> >
>
> I've never directly used lxml, mainly because bs4 offers all the same
> advantages and more, with about the same costs. However, if you're
> looking for a no-external-deps option, Python *does* include an HTML
> parser in the standard library:
>
>
But isn't bs4 only for SOAP content?
Can bs4 or lxml cope with HTML code that does not comply with XML as the
following fragment?

<p>A
<p>B
<hr>

BR,
Roland

> https://docs.python.org/3/library/html.parser.html
>
> If your purpose is extremely simple (like "strip tags, search for
> text"), then it should be easy enough to whip up something using that
> module. No external deps, not a lot of code, pretty straight-forward.
> On the other hand, if you're trying to do an "HTML to text"
> conversion, you'd probably need to be aware of which tags are
> block-level and which are inline content, so that (for instance)
> "<div>Hello</div> <div>world</div>" would come out as two separate
> paragraphs of text, whereas the same thing with <b> tags would become
> just "Hello world". But for the most part, handle_data will probably
> do everything you need.
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>