html parsing? Or just simple regex'ing?
Dan Stromberg
strombrg at dcs.nac.uci.edu
Mon Nov 15 11:03:34 EST 2004
On Thu, 11 Nov 2004 15:05:27 -0500, Michael J. Fromberger wrote:
> In article <pan.2004.11.10.01.37.41.879705 at dcs.nac.uci.edu>,
> Dan Stromberg <strombrg at dcs.nac.uci.edu> wrote:
>
>> I'm working on writing a program that will synchronize one database with
>> another. For the source database, we can just use the python sybase API;
>> that's nice and normal.
>>
>> [...]
>>
>> 1) Would I be better off just regex'ing the html I'm getting back? (I
>> suppose this depends on the complexity of the html received, eh?)
>>
>> 2) Would I be better off feeding the HTML into an HTML parser, and then
>> traversing that datastructure (is that really how it works?)?
>
> I recommend you look at BeautifulSoup:
>
> http://www.crummy.com/software/BeautifulSoup/
>
> It is very forgiving of the typical affronts HTML writers put into their
> code.
BeautifulSoup looks great.
Regrettably, I'm getting:
Traceback (most recent call last):
File "./netreo.py", line 130, in ?
soup.feed(html)
File "/Dcs/staff/strombrg/netreo/lib/BeautifulSoup.py", line 308, in feed
SGMLParser.feed(self, text)
File "/Web/lib/python2.3/sgmllib.py", line 94, in feed
self.rawdata = self.rawdata + data
TypeError: cannot concatenate 'str' and 'list' objects
...upon feeding some html into the "feed" method.
This is with python 2.3.4, on an RHEL 3 system.
Am I perhaps using a version of python that is too recent?
More information about the Python-list
mailing list