A little complex usage of Beautiful Soup Parsing Help!

SAKTHEESH s.a.saktheesh at gmail.com
Wed Jul 20 14:18:02 EDT 2011


I am using Beautiful Soup to parse a html to find all text that is Not
contained inside any anchor elements

I came up with this code which finds all links within href but not the
other way around.

How can I modify this code to get only plain text using Beautiful
Soup, so that I can do some find and replace and modify the soup?

    for a in soup.findAll('a',href=True):
        print a['href']


Example:

    <html><body>
     <div> <a href="www.test1.com/identify">test1</a> </div>
     <div><br></div>
     <div><a href="www.test2.com/identify">test2</a></div>
     <div><br></div><div><br></div>
     <div>
       This should be identified

       Identify me 1

       Identify me 2
       <p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
     </div>
    </body></html>

Output:

    This should be identified
    Identify me 1
    Identify me 2
    This paragraph should be identified.

I am doing this operation to find text not within `<a></a>` : then
find "Identify" and do replace operation with "Replaced"

So the final output will be like this:

    <html><body>
     <div> <a href="www.test1.com/identify">test1</a> </div>
     <div><br></div>
     <div><a href="www.test2.com/identify">test2</a></div>
     <div><br></div><div><br></div>
     <div>
       This should be identified

       Repalced me 1

       Replaced me 2
       <p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
     </div>
    </body></html>

Thanks for your time and help !



More information about the Python-list mailing list