Help with using findAll() in BeautifulSoup

Stefan Behnel stefan_ml at behnel.de
Sat Jul 12 01:55:12 EDT 2008


Alexnb wrote:
> Okay, I am not sure if there is a better way of doing this than findAll() but
> that is how I am doing it right now.

Consider using lxml.html and lxml.cssselect.

http://codespeak.net/lxml/


> I am making an app that screen scapes
> dictionary.com for definitions.

Do they have a policy for doing that?


>  noun
> <table blah>
> <table blah>
> verb
> <table blah>
> 
> I want to be able to do like findAll('span', {'class': 'pg'}), but tell me
> how many <table> things are after it, or before the next  so I know how many
> defintions it has.

You didn't say where the "span" is in the HTML code, but lxml.cssselect should
get you pretty close to what you want. If your tables are descendants of the
"span"s, a selector like:

    "span.pg table"

might work. There's also a CSS syntax for siblings.

Stefan



More information about the Python-list mailing list