[BangPypers] HTML Parsing in python

Anand Balachandran Pillai abpillai at gmail.com
Tue Oct 20 15:02:58 CEST 2009


On Thu, Sep 10, 2009 at 7:44 PM, Puneet Aggarwal <look4puneet at gmail.com>wrote:

> Thanks all for the suggestions. I think I will start with BeautifulSoup
> (3.0.7a) and will experiment with other suggested libs if it does not fit
> into my requirement or if I face issues with this.
>

 You are not going to believe this, but the creator of BeautifulSoup
(Leonardo)
 advised me to use the SGMLParser module in Python for parsing HTML.  This
 was back in 2004 (or 2005) when I had written to him regarding
BeautifulSoup
 as parser in HarvestMan. He advised me to derive a wrapper from SGMLParser
 and thats what I did.

 In case you are interested, you can check out the HTML parser used in
HarvestMan.
It is available at,


http://harvestman-crawler.googlecode.com/svn/trunk/HarvestMan/harvestman/lib/pageparser.py



>
> On Thu, Sep 10, 2009 at 7:07 PM, Baishampayan Ghose <b.ghose at gmail.com>wrote:
>
>> > Can anyone suggest me a good library for html parsing in python ?
>> > I googled a found few libararies BeautifulSoup, HTMLParser, SGMLParser
>> etc.
>> >
>> > Can anyone suggest me which should I go for from your experience.
>>
>> BeautifulSoup was OK, but now it's broken. Use lxml, it's very good.
>>
>> http://codespeak.net/lxml/
>>
>> Regards,
>> BG
>>
>>
>> --
>> Baishampayan Ghose
>> b.ghose at gmail.com
>> _______________________________________________
>> BangPypers mailing list
>> BangPypers at python.org
>> http://mail.python.org/mailman/listinfo/bangpypers
>>
>
>
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>
>


-- 
--Anand
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/bangpypers/attachments/20091020/2e2978af/attachment-0001.htm>


More information about the BangPypers mailing list