HTMLParser rejects real-life tagsoup

Thanos Vassilakis tvassila at siac.com
Mon Feb 17 19:38:11 EST 2003


In the real world most html templates or pages created by designers are
non-standard and will break parsers. That is why we  NYSE used our own html
parsers for templating: The first based on scriptfoundry's  tagparser which
was very fast and easier to use than HTMLParser, and now we use pso.parser.
This is fast, elegant and robust.

http://sourceforge.net/projects/pso/
and see docs at:
http://sourceforge.net/docman/?group_id=49265

thanos



                                                                                                                               
                    Rene Pijlman                                                                                               
                    <reageer.in at de.nie        To:     python-list at python.org                                                   
                    uwsgroep>                 cc:                                                                              
                    Sent by:                  Subject:     Re: HTMLParser rejects real-life tagsoup                            
                    python-list-admin@                                                                                         
                    python.org                                                                                                 
                                                                                                                               
                                                                                                                               
                    02/12/2003 05:09                                                                                           
                    PM                                                                                                         
                                                                                                                               
                                                                                                                               




Gerhard Häring:
>Rene Pijlman wrote:
>> I've been using the HTMLParser module to process external web
>> pages that I don't control. HTMLParser seems to be rather strict
>> [...]
>> Any suggestions on how to handle this? [...]
>
>I'd try tidying up the HTML first:
>http://www.lemburg.com/files/python/mxTidy.html

Great idea, it works fine now. Thanks!

--
René Pijlman

Wat wil jij leren?  http://www.leren.nl
--
http://mail.python.org/mailman/listinfo/python-list





-----------------------------------------
This message and its attachments may contain  privileged and confidential information.  If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email.  If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer.






More information about the Python-list mailing list