Extract information from HTML table

Ulysse maxime.p at gmail.com
Mon Apr 2 17:07:24 EDT 2007


On Apr 2, 9:28 pm, cla... at lairds.us (Cameron Laird) wrote:
> In article <1175503135.234560.51... at n59g2000hsh.googlegroups.com>,
>
>
>
> anjesh <anjeshtulad... at gmail.com> wrote:
> >On Apr 2, 12:54 am, "Dotan Cohen" <dotanco... at gmail.com> wrote:
> >> On 1 Apr 2007 07:56:04 -0700, Ulysse <maxim... at gmail.com> wrote:
>
> >> > I have seen the Beautiful Soup online help and tried to apply that to
> >> > my problem. But it seems to be a little bit hard. I will rather try to
> >> > do this with regular expressions...
>
> >> If you think that Beautiful Soup is difficult than wait till you try
> >> to do this with regexes. Granted you know the exact format of the HTML
> >> you are scraping will help, if you ever need to parse HTML from an
> >> unknown source than Beautiful Soup is the only way to go. Not all HTML
> >> authors close their td and tr tags, and sometimes there are attributes
> >> to those tags. If you plan on ever reusing the code or the format of
> >> the HTML may change, then you are best off sticking with Beautiful
> >> Soup.
>
> >> Dotan Cohen
>
> >>http://lyricslist.com/http://what-is-what.com/
>
> >Have you tried HTMLParser. It can do the task you want to perform
> >http://docs.python.org/lib/module-HTMLParser.html
>
> >-anjesh
>
> Yes, except that these last two follow-ups UNDERstate the difficulty--in
> fact, the impossibility--of achieving adequate results on this problem
> with regular expressions.  We'll help with the documentation for HTMLParser
> and BeautifulSoup.  REs are an invitation to madness.
>
> <URL:http://www.unixreview.com/documents/s=10121/ur0702e/> might amuse
> those who want to think more about REs.

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?<a href="(.*?)">(.*?)</a>.*?</td>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2}).*?player\.php.*?>(.*?)</
a>.*?<textarea.*?>(.*?)</textarea>'

r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn">
\W*?Message au clan de :([a-zA-Z0-9_\-]+?)\W*<br>(.*?)</th>'

These three REs extract all data I need. That not exactly apply to the
given string.
I read the article but I didn't understood why REs are invitation to
madness...




More information about the Python-list mailing list