HTML Parser chokes on WordHTML...
Steven Taschuk
staschuk at telusplanet.net
Fri May 2 15:09:18 EDT 2003
Quoth Harald Massa:
[...]
> first, content of an <-- Tag is taken as data:
[...]
> <style>
> <!--
> /* Font Definitions */
> @font-face
This is by design; <style> is considered a "CDATA content
element", which entails that comments are not recognized as
comments inside such elements. There is a reason for this; see
below.
[...]
> To my understanding no good idea to put the stylesheet inside of the
> HTML-File, but rather legal HTML. [...]
Well, it's a complicated question. First of all, there's nothing
illegal about having a stylesheet in HTML; it's just text, and
that's fine. But there are some issues:
Strictly speaking, anything inside <!-- --> is a comment and the
parser should ignore it.
However, the HTML specs require that if a browser does not
recognize an element, it should treat what's inside it as
character content, to be displayed like normal text. So, if a
browser does not recognize <style> (lynx doesn't, I think), it
will dump the stylesheet onto the screen, which is ugly.
So, people started putting stylesheets inside <!-- -->, so
browsers which don't recognize <style> wouldn't display the
stylesheet.
But this makes problems for browsers which *do* recognize <style>,
since they have to either (a) ignore the stylesheet because it's
in a comment, or (b) treat <style> as a special case, and ignore
<!-- --> there.
HTMLParser, doing the best it can to deal with real-world
practice, does (b). (You can change this behaviour by changing
HTMLParser.CDATA_CONTENT_ELEMENTS, I think.)
This is a mess, and is why, as you say, it's better to link to an
external stylesheet via, say,
<link rel='stylesheet' type='text/css' href='...' />
in the <head> element.
(Exactly the same issues arise with <script> elements, and
HTMLParser ignores <!-- --> there too.)
[...]
> The second error is: HTML-Parser excepts with: [...]
> HTMLParseError: expected name token, at line 1494, column 29
>
> Line 1494 from the Error is:
>
> <p class=Aufzhlung-Strich><![if !supportLists]><span
[...]
> again, <![if !suportLists]> does not look great, but should be legal
> HTMl - should'nt it?
No: <![if ...]> isn't legal HTML, so HTMLParser quite properly
rejects it. The <! is legal only for starting a DOCTYPE
declaration (and inside a DTD, which is not usually present in an
HTML document).
> So... is there any replacement for the HTMLParser from the python.lib
> which even can eat Microsoft Word HTML ?
I don't know. Try htmllib and see what happens. Also try
subclassing HTMLParser to do what you need.
--
Steven Taschuk staschuk at telusplanet.net
"Our analysis begins with two outrageous benchmarks."
-- "Implementation strategies for continuations", Clinger et al.
More information about the Python-list
mailing list