HTMLParser rejects real-life tagsoup
Thanos Vassilakis
tvassila at siac.com
Mon Feb 17 19:38:11 EST 2003
In the real world most html templates or pages created by designers are
non-standard and will break parsers. That is why we NYSE used our own html
parsers for templating: The first based on scriptfoundry's tagparser which
was very fast and easier to use than HTMLParser, and now we use pso.parser.
This is fast, elegant and robust.
http://sourceforge.net/projects/pso/
and see docs at:
http://sourceforge.net/docman/?group_id=49265
thanos
Rene Pijlman
<reageer.in at de.nie To: python-list at python.org
uwsgroep> cc:
Sent by: Subject: Re: HTMLParser rejects real-life tagsoup
python-list-admin@
python.org
02/12/2003 05:09
PM
Gerhard Häring:
>Rene Pijlman wrote:
>> I've been using the HTMLParser module to process external web
>> pages that I don't control. HTMLParser seems to be rather strict
>> [...]
>> Any suggestions on how to handle this? [...]
>
>I'd try tidying up the HTML first:
>http://www.lemburg.com/files/python/mxTidy.html
Great idea, it works fine now. Thanks!
--
René Pijlman
Wat wil jij leren? http://www.leren.nl
--
http://mail.python.org/mailman/listinfo/python-list
-----------------------------------------
This message and its attachments may contain privileged and confidential information. If you are not the intended recipient(s), you are prohibited from printing, forwarding, saving or copying this email. If you have received this e-mail in error, please immediately notify the sender and delete this e-mail and its attachments from your computer.
More information about the Python-list
mailing list