html parsing? Or just simple regex'ing?

Mon Nov 15 11:03:34 EST 2004

On Thu, 11 Nov 2004 15:05:27 -0500, Michael J. Fromberger wrote:

> In article <pan.2004.11.10.01.37.41.879705 at dcs.nac.uci.edu>,
>  Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
> 
>> I'm working on writing a program that will synchronize one database with
>> another.  For the source database, we can just use the python sybase API;
>> that's nice and normal.
>> 
>> [...]
>> 
>> 1) Would I be better off just regex'ing the html I'm getting back?  (I
>> suppose this depends on the complexity of the html received, eh?)
>> 
>> 2) Would I be better off feeding the HTML into an HTML parser, and then
>> traversing that datastructure (is that really how it works?)?
> 
> I recommend you look at BeautifulSoup:
> 
>   http://www.crummy.com/software/BeautifulSoup/
> 
> It is very forgiving of the typical affronts HTML writers put into their 
> code.

BeautifulSoup looks great.

Regrettably, I'm getting:

Traceback (most recent call last):
  File "./netreo.py", line 130, in ?
    soup.feed(html)
  File "/Dcs/staff/strombrg/netreo/lib/BeautifulSoup.py", line 308, in feed
    SGMLParser.feed(self, text)
  File "/Web/lib/python2.3/sgmllib.py", line 94, in feed
    self.rawdata = self.rawdata + data
TypeError: cannot concatenate 'str' and 'list' objects

...upon feeding some html into the "feed" method.

This is with python 2.3.4, on an RHEL 3 system.

Am I perhaps using a version of python that is too recent?