[Expat-discuss] Quick crazy question: HTML?

Fred L. Drake, Jr. fdrake@acm.org
Mon, 12 Mar 2001 09:07:37 -0500 (EST)


On Mon, Mar 12, 2001 at 02:09:26AM -0800, Dru Nelson wrote:
 > How hard would it be to get expat to handle your typical
 > HTML as well? I'm talking about the typical sloppy
 > HTML out in the wild?

Greg Stein writes:
 > Very hard. Super difficult. Insane work. ... :-)
 > 
 > XML was designed *expressly* to get away from the sloppiness of HTML. One of
 > its main design points is to be absolutely rigorous. From that standpoint,
 > there isn't even a desire/motivation to make Expat provide tolerance.

  I agree with Greg on this point.  Having worked on the Grail browser
project, I've put a lot of effort into making "HTML as deployed" work
with some similarity to other browsers -- and believe me, the parser
that supports all the hueristics needed to do that is very different
from something that supports "proper" XML.
  If you need a parser for HTML as deployed, you can take a look at
the source of any of the many HTML parsers out there, but none that
really work with "as deployed" HTML will be an easy read.  Certainly
the Mozilla sources are available, but the "entry fee" to read the
sources is probably pretty high.  The Grail sources are in Python, so
it might not be too hard to wrap your head around, but you'll need to
dig through the "sgml" directory a bit to see the structure of the
parser and the Grail-specific code in the main directory to figure out
some of the hueristics used to make it work.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Digital Creations