html parsing? Or just simple regex'ing?
Michael J. Fromberger
Michael.J.Fromberger at Clothing.Dartmouth.EDU
Mon Nov 15 19:35:26 EST 2004
In article <pan.2004.11.15.16.04.02.657003 at dcs.nac.uci.edu>,
Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
> On Thu, 11 Nov 2004 15:05:27 -0500, Michael J. Fromberger wrote:
>
> > In article <pan.2004.11.10.01.37.41.879705 at dcs.nac.uci.edu>,
> > Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
> >> [...]
> >>
> >> 1) Would I be better off just regex'ing the html I'm getting back? (I
> >> suppose this depends on the complexity of the html received, eh?)
> >>
> >> 2) Would I be better off feeding the HTML into an HTML parser, and then
> >> traversing that datastructure (is that really how it works?)?
> >
> > I recommend you look at BeautifulSoup:
> >
> > http://www.crummy.com/software/BeautifulSoup/
> >
> > It is very forgiving of the typical affronts HTML writers put into their
> > code.
>
> BeautifulSoup looks great.
>
> Regrettably, I'm getting:
>
> Traceback (most recent call last):
> File "./netreo.py", line 130, in ?
> soup.feed(html)
> File "/Dcs/staff/strombrg/netreo/lib/BeautifulSoup.py", line 308, in feed
> SGMLParser.feed(self, text)
> File "/Web/lib/python2.3/sgmllib.py", line 94, in feed
> self.rawdata = self.rawdata + data
> TypeError: cannot concatenate 'str' and 'list' objects
>
> ...upon feeding some html into the "feed" method.
>
> This is with python 2.3.4, on an RHEL 3 system.
>
> Am I perhaps using a version of python that is too recent?
Are you feeding a string to your parser, or a list? It wants a string;
it looks like maybe you're giving it a list.
-M
--
Michael J. Fromberger | Lecturer, Dept. of Computer Science
http://www.dartmouth.edu/~sting/ | Dartmouth College, Hanover, NH, USA
More information about the Python-list
mailing list