BeautifulSoup: problems with parsing a website

Marco Hornung Marcohornung at hotmail.com
Wed May 28 07:25:30 EDT 2008


Hy guys,

I'm using the python-framework BeautifulSoup(BS) to parse some
information out of a german soccer-website.
I spend some qualitiy time with the BS-docs, but I couldn't really
figure out how to get what I was looking for.

Here's the deal:
I want to parse the article shown on the website. To do so I want to
use the Tag " <div class="txt_fliesstext">" as a starting-point. When
I have found the Tag I somehow want to get all following "br"-Tags
until there is a new CSS-Class Style is coming up.
I tried several options in the findAll()-command, but nothing seems to
work.(like: soup.findAll('br',attrs={'class':'txt_fliesstext'}, text
=True) - This one comes with a thound addtional Tag that I don't want
to have, or soup.findAll(attrs={'class':'txt_fliesstext'}) - This
gives me a much better Result, but in this case I only get some few
Tags, instead of all the Tags I want)

Any suggestions?
Thanks in advance!

Website:
http://www.bundesliga.de/de/liga/news/2007/index.php?f=94820.php
Some html-code of the website:
<div id="area_headline">
      <div class="txt_headline_red">Erst Höhenflug, dann Absturz</
div>
    </div>
    <div id="area_fliesstext">
      <div class="txt_fliesstext_bold">Mit 28 Punkten stand der KSC
nach der Hinrunde sensationell auf Platz 6.</div>
      <br><br>
      <div class="txt_fliesstext">Doch in der Rückrunde brachen
die Badener regelrecht ein und holten nur noch 15 Zähler.<br />
<br />
43 Punkte reichten am Ende für den 11. Tabellenplatz, ein mehr
als respektables Ergebnis für einen Aufsteiger.<br />
<br />



More information about the Python-list mailing list