BeautifulSoup to get string inner 'p' and 'a' tags

Nick Vatamaniuc vatamane at gmail.com
Mon Jul 24 06:41:59 EDT 2006


Quick-n-dirty way:
After you  get your whole p string: <p class="contentBody">FOO <a
name="f"></a> </p>
Remove any tags delimited by '<' and '>' with a regex. In your short
example you _don't_ show that there might be something between the <a>
and </a> tags so I assume there won't be anything or if there would be
something then you  also want it included in the final text. As in
'<p class="contentBody">FOO <a name="f">URLNAME</a> </p>' ==> 'FOO
URLNAME'

For the regex start with something simple like <.*?> and see if it
works then improve it.  Use kiki or kodos - python visual regex
helpers.

Hope this helps,
Nick V.


GinTon wrote:
> I'm trying to get the 'FOO' string but the problem is that inner 'P'
> tag there is another tag, 'a'. So:
>
> > from BeautifulSoup import BeautifulSoup
> > s = '<td width="88%" valign="TOP"> <p class="contentBody">FOO <a name="f"></a> </p></td>'
> > tree = BeautifulSoup(s)
>
> > print tree.first('p')
> <p class="contentBody">FOO <a name="f"></a> </p>
>
> So if I run 'print tree.first('p').string' to get the 'FOO' string it
> shows Null value because it's the 'a' tag:
> 
> > print tree.first('p').string
> Null
> 
> Any solution?




More information about the Python-list mailing list