Beautiful soup : why does "string" not give me the string?

Gabriel Rossetti gabriel.rossetti at arimaz.com
Wed Apr 1 08:15:58 EDT 2009


Jeremiah Dodds wrote:
>
>
> On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti 
> <gabriel.rossetti at arimaz.com <mailto:gabriel.rossetti at arimaz.com>> wrote:
>
>     Hello everyone,
>
>     I am using beautiful soup to parse some HTML and I came across
>     something strange.
>     Here is an illustration:
>
>     >>> soup = BeautifulSoup(u'<div class="text">hello ça boume<br
>     /></div')
>     >>> soup
>     <div class="text">hello ça boume<br /></div>
>     >>> soup.find("div", "text")
>     <div class="text">hello ça boume<br /></div>
>     >>> soup.find("div", "text").string
>     >>> soup.find("div", "text").next
>     u'hello \xe7a boume'
>
>     why does soup.find("div", "text").string not give me the string?
>     Is it because there is a <br/>?
>
>
> IIRC, yes it is, and there's not much you can do about it other than  
> use .next.string or .contents[0]  or stripping out brs. See 
> http://www.crummy.com/software/BeautifulSoup/documentation.html , 
> particularly the "Removing Elements" and "string" sections.
>
>
Ok, thanks, I also found that I can do this :

    soup.find(text=lambda t: isinstance(t, basestring))

or this:

    soup.find(text=True)

it seems faster than doing this :

    [br.extract() for br in soup.findAll("br")]
    soup.string

but I may be wrong.

Thanks again!
Gabriel



More information about the Python-list mailing list