HTML Parser chokes on WordHTML...

Steven Taschuk staschuk at telusplanet.net
Fri May 2 15:09:18 EDT 2003


Quoth Harald Massa:
  [...]
> first, content of an <-- Tag is taken as data:
  [...]
> <style>
> <!--
>  /* Font Definitions */
> @font-face

This is by design; <style> is considered a "CDATA content
element", which entails that comments are not recognized as
comments inside such elements.  There is a reason for this; see
below.

  [...]
> To my understanding no good idea to put the stylesheet inside of the
> HTML-File, but rather legal HTML. [...]

Well, it's a complicated question.  First of all, there's nothing
illegal about having a stylesheet in HTML; it's just text, and
that's fine.  But there are some issues:

Strictly speaking, anything inside <!-- --> is a comment and the
parser should ignore it.

However, the HTML specs require that if a browser does not
recognize an element, it should treat what's inside it as
character content, to be displayed like normal text.  So, if a
browser does not recognize <style> (lynx doesn't, I think), it
will dump the stylesheet onto the screen, which is ugly.

So, people started putting stylesheets inside <!-- -->, so
browsers which don't recognize <style> wouldn't display the
stylesheet.

But this makes problems for browsers which *do* recognize <style>,
since they have to either (a) ignore the stylesheet because it's
in a comment, or (b) treat <style> as a special case, and ignore
<!-- --> there.

HTMLParser, doing the best it can to deal with real-world
practice, does (b).  (You can change this behaviour by changing
HTMLParser.CDATA_CONTENT_ELEMENTS, I think.)

This is a mess, and is why, as you say, it's better to link to an
external stylesheet via, say,
    <link rel='stylesheet' type='text/css' href='...' />
in the <head> element.

(Exactly the same issues arise with <script> elements, and
HTMLParser ignores <!-- --> there too.)

  [...]
> The second error is: HTML-Parser excepts with: [...]
> HTMLParseError: expected name token, at line 1494, column 29
> 
> Line 1494 from the Error is:
> 
> <p class=Aufzhlung-Strich><![if !supportLists]><span
  [...]
> again, <![if !suportLists]> does not look great, but should be legal
> HTMl - should'nt it? 

No: <![if ...]> isn't legal HTML, so HTMLParser quite properly
rejects it.  The <! is legal only for starting a DOCTYPE
declaration (and inside a DTD, which is not usually present in an
HTML document).

> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ? 

I don't know.  Try htmllib and see what happens.  Also try
subclassing HTMLParser to do what you need.

-- 
Steven Taschuk                            staschuk at telusplanet.net
"Our analysis begins with two outrageous benchmarks."
  -- "Implementation strategies for continuations", Clinger et al.





More information about the Python-list mailing list