Delete h2 until you reach the next h2 in beautifulsoup

rosefox911 at gmail.com rosefox911 at gmail.com
Sun Nov 6 18:24:02 EST 2016


On Sunday, November 6, 2016 at 1:27:48 AM UTC-4, rosef... at gmail.com wrote:
> Considering the following html:
> 
>     <h2 id="example">cool stuff</h2> <ul> <li>hi</li> </ul> <div> <h2 id="cool"><h2> <ul><li>zz</li> </ul> </div>
> 
> and the following list:
> 
>     ignore_list = ['example','lalala']
> 
> My goal is, while going through the HTML using Beautifulsoup, I find a h2 that has an ID that is in my list (ignore_list) I should delete all the ul and lis under it until I find another h2. I would then check if the next h2 was in my ignore list, if it is, delete all the ul and lis until I reach the next h2 (or if there are no h2s left, delete the ul and lis under the current one and stop). 
> 
> How I see the process going: you read all the h2s from up to down in the DOM. If the id for any of those is in the ignore_list, then delete all the ul and li under the h2 until you reach the NEXT h2. If there is no h2, then delete the ul and LI then stop.
> 
> Here is the full HMTL I am trying to work with: http://pastebin.com/Z3ev9c8N
> 
> I am trying to delete all the UL and lis after "See_also"How would I accomplish this in Python?


I got it working with the following solution:

#Remove content I don't want
            try:
                for element in body.find_all('h2'):
                    current_h2 = element.get_text()
                    current_h2 = current_h2.replace('[edit]','')
                    #print(current_h2)
                    if(current_h2 in ignore_list):
                        if(element.find_next_sibling('div') != None):
                            element.find_next_sibling('div').decompose()
                        if(element.find_next_sibling('ul') != None):
                            element.find_next_sibling('ul').decompose()
            except(AttributeError, TypeError) as e:
                continue   



More information about the Python-list mailing list