[Baypiggies] HTML Parsers (n00b)

Jeff Enderwick jeff.enderwick at gmail.com
Sun Jan 31 23:58:48 CET 2010


I've used beautiful soup to programmatically extract content from word docs
saved as HTML - yuck!!!
Beautiful Soup performed ... beautifully :-). Speed was NOT a consideration
for me, though.

On Thu, Jan 28, 2010 at 5:26 PM, Jeff Kunce <jjkunce at gmail.com> wrote:

> No question about lxml's speed.  I'm using it (as part of Deliverance) on a
> current project to re-theme a website on the fly.
>
> But for day-to-day use, it's Beautiful Soup.  I can't resist pure python :)
>
>   -- Jeff
>
>
> On Thu, Jan 28, 2010 at 4:58 PM, Alec Flett <alecf at flett.org> wrote:
>
>> lxml is awesome, don't be fooled by the name - it has great understanding
>> of HTML, even malformed.
>>
>> ianbicking did a great comparison years ago but it still stands:
>> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>>
>> and an update:
>>
>> http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
>>
>> Basically: lxml is fast as hell, (uses libxml2 under the hood)low memory
>> footprint, and very forgiving of wacky html, better than Beautiful Soup.
>>
>> I think pyquery actually uses lxml under the hood? or at least libxml2?
>>
>>
>> Alec
>>
>>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100131/8247d97c/attachment.htm>


More information about the Baypiggies mailing list