Question about how to do something in BeautifulSoup?

Cody Piersall cody.piersall at gmail.com
Sat Jan 23 15:49:48 EST 2016


On Fri, Jan 22, 2016 at 8:01 AM, inhahe <inhahe at gmail.com> wrote:
> Say I have the following HTML (I hope this shows up as plain text here
> rather than formatting):
>
> <div style="font-size: 20pt;"><span style="color:
#000000;"><em><strong>"Is
> today the day?"</strong></em></span></div>
>
> And I want to extract the "Is today the day?" part. There are other places
> in the document with <em> and <strong>, but this is the only place that
> uses color #000000, so I want to extract anything that's within a color
> #000000 style, even if it's nested multiple levels deep within that.
>
> - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's
defined
> as #000000
> - Sometimes the <strong> is within the <em> and sometimes the <em> is
> within the <strong>.
> - There may be other discrepancies I haven't noticed yet
>
> How can I do this in BeautifulSoup (or is this better done in lxml.html)?

I hope this helps you get started:

This may help you get started:

from bs4 import BeautifulSoup
from itertools import chain
soup = BeautifulSoup('''\
<div style="font-size: 20pt;"><span style="color: #000000;"><em><strong>"Is
today the day?"</strong></em></span></div>
<div style="font-size: 20pt;"><span style="color: RGB(0, 0,
0);"><strong><em>"Is
tomorrow the day?"</em></strong></span></div>''')

# We're going to get all the tags that specify the color, either using hex
or RGB.
# If you only want to get the span tags, just give the positional argument
'span' to
# find_all:
# for tag in chain(soup.find_all('span', style='color: #000000;'),
#                  soup.find_all('span', style='color: RGB(0, 0, 0);')):

for tag in chain(soup.find_all(style='color: #000000;'),
                 soup.find_all(style='color: RGB(0, 0, 0);')):
    try:
        print(tag.em.strong.text)
    except AttributeError:
        try:
            print(tag.strong.em.text)
        except AttributeError:
            print('ooooooh nooooo no text')

Cody



More information about the Python-list mailing list