html parsing? Or just simple regex'ing?

Mon Nov 15 19:35:26 EST 2004

In article <pan.2004.11.15.16.04.02.657003 at dcs.nac.uci.edu>,
 Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:

> On Thu, 11 Nov 2004 15:05:27 -0500, Michael J. Fromberger wrote:
> 
> > In article <pan.2004.11.10.01.37.41.879705 at dcs.nac.uci.edu>,
> >  Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
> >> [...]
> >> 
> >> 1) Would I be better off just regex'ing the html I'm getting back?  (I
> >> suppose this depends on the complexity of the html received, eh?)
> >> 
> >> 2) Would I be better off feeding the HTML into an HTML parser, and then
> >> traversing that datastructure (is that really how it works?)?
> > 
> > I recommend you look at BeautifulSoup:
> > 
> >   http://www.crummy.com/software/BeautifulSoup/
> > 
> > It is very forgiving of the typical affronts HTML writers put into their 
> > code.
> 
> BeautifulSoup looks great.
> 
> Regrettably, I'm getting:
> 
> Traceback (most recent call last):
>   File "./netreo.py", line 130, in ?
>     soup.feed(html)
>   File "/Dcs/staff/strombrg/netreo/lib/BeautifulSoup.py", line 308, in feed
>     SGMLParser.feed(self, text)
>   File "/Web/lib/python2.3/sgmllib.py", line 94, in feed
>     self.rawdata = self.rawdata + data
> TypeError: cannot concatenate 'str' and 'list' objects
> 
> ...upon feeding some html into the "feed" method.
> 
> This is with python 2.3.4, on an RHEL 3 system.
> 
> Am I perhaps using a version of python that is too recent?

Are you feeding a string to your parser, or a list?  It wants a string; 
it looks like maybe you're giving it a list.

-M

-- 
Michael J. Fromberger             | Lecturer, Dept. of Computer Science
http://www.dartmouth.edu/~sting/  | Dartmouth College, Hanover, NH, USA