htmllib: CR in CDATA

Mark Nottingham mnot at pobox.com
Tue Jun 22 08:48:21 EDT 1999


> well, htmllib doesn't claim to be HTML 4.0 compliant...

But it does claim 2.0:
"""HTML 2.0 parser.

See the HTML 2.0 specification:
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_toc.html
"""

Fine. Now, if we have a look at
http://www.w3.org/MarkUp/html-spec/html-spec_9.html#SEC9.1
we'll see that attributes are marked as CDATA.

Unfortunately, I don't have a copy of the SGML specification, so I can't
definatively say that this is the proper way to treat CDATA; all I have is
the description in the HTML 4.0 docs, as previously referenced. So, it's
probably not a good idea to patch this in sgmllib.py (as I did). However, it
is IMHO reasonable to conclude that, since both 2.0 and 4.0 refer to SGML
for the definition of CDATA, we can apply what we know about it from one to
the other (SGML being a fairly stable spec AFAIK).

I'm certainly willing to admit that this isn't directly specified behaviour
for a 2.0 parser, but I still think it's the Right Thing. A a practical
level, I'm parsing HTML with these constructs in it; if I pass off an HREF
to httplib that has a newline in it, all sorts of bad things happen.

I've ended up calling a cleaning function each time I parse attributes in my
subclassed parser; this does the job nicely. However, IMHO this sort of
lexical processing/second guessing shouldn't be necessary by the user of a
parser.


> ...and it doesn't claim to be a "user agent", either...

*sigh*
Do we _really_ want to take a trip down this semantic rabbit warren? In HTML
2.0-land, user agent is:
A component of a distributed system that presents an interface and processes
requests on behalf of a user; for example, a www browser or a mail user
agent.

Now, htmllib certainly:
* is a component
* part of a distributed system (i.e., the Web)
* presents an interface (programmatic)
* processes requests on behalf of a user

I'm curious... if it's not a user agent in the quoted context, what is it?







More information about the Python-list mailing list