HTML extraction

Tue Dec 7 16:57:06 EST 2021

On Wed, Dec 8, 2021 at 7:55 AM Roland Mueller
<roland.em0001 at googlemail.com> wrote:
>
> Hello,
>
> ti 7. jouluk. 2021 klo 20.08 Chris Angelico (rosuav at gmail.com) kirjoitti:
>>
>> On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
>> <juliushamilton100 at gmail.com> wrote:
>> >
>> > Hey,
>> >
>> > Could anyone please comment on the purest way simply to strip HTML tags
>> > from the internal text they surround?
>> >
>> > I know Beautiful Soup is a convenient tool, but I’m interested to know what
>> > the most minimal way to do it would be.
>>
>> That's definitely the best and most general way, and would still be my
>> first thought most of the time.
>>
>> > People say you usually don’t use Regex for a second order language like
>> > HTML, so I was thinking about using xpath or lxml, which seem like very
>> > pure, universal tools for the job.
>> >
>> > I did find an example for doing this with the re module, though.
>> >
>> > Would it be fair to say that to just strip the tags, Regex is fine, but you
>> > need to build a tree-like object if you want the ability to select which
>> > nodes to keep and which to discard?
>>
>> Obligatory reference:
>>
>> https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
>>
>> > Can xpath / lxml do that?
>> >
>> > What are the chief differences between xpath / lxml and Beautiful Soup?
>> >
>>
>> I've never directly used lxml, mainly because bs4 offers all the same
>> advantages and more, with about the same costs. However, if you're
>> looking for a no-external-deps option, Python *does* include an HTML
>> parser in the standard library:
>>
>
> But isn't bs4 only for SOAP content?
> Can bs4 or lxml cope with HTML code that does not comply with XML as the following fragment?
>
> <p>A
> <p>B
> <hr>
>
> BR,
> Roland
>

Check out the bs4 docs for some of the things you can do with it :)

ChrisA