BeautifulSoup: problems with parsing a website

Stefan Behnel stefan_ml at behnel.de
Wed May 28 16:04:23 EDT 2008


Marco Hornung wrote:
> Hy guys,

... and girls?


> I'm using the python-framework BeautifulSoup(BS) to parse some
> information out of a german soccer-website.

consider using lxml.

http://codespeak.net/lxml

    >>> from lxml import html


> I want to parse the article shown on the website.

    >>> tree = html.parse("http://www.bundesliga.de/de/liga/news/
                                    2007/index.php?f=94820.php")

> To do so I want to
> use the Tag " <div class="txt_fliesstext">" as a starting-point.

    >>> div = tree.xpath('//div[@class = "txt_fliesstext"]')


> When
> I have found the Tag I somehow want to get all following "br"-Tags

Following? Meaning: after the div?

    >>> br_list = diff.xpath("following-sibling::br")

Or within the div?

    >>> br_list = diff.xpath(".//br")


> until there is a new CSS-Class Style is coming up.

Ok, that's different.

    >>> for el in div.iter(): # or div.itersiblings():
    ...     if el.tag == "br":
    ...         print el.text # or whatever
    ...     elif el.tag == "span" or el.get("class"):
    ...         break

Hope it helps.

Stefan



More information about the Python-list mailing list