HTMLParser and Quotes

Bengt Richter bokr at oz.net
Thu Jan 2 07:58:27 EST 2003


On Thu, 02 Jan 2003 05:37:37 GMT, Richard West <rwest2 at opti.cgi.net> wrote:

>
>This evening I found a case that crashes HTMLParser (lib 2.2.2) which
>probably shouldn't.  Check out the following code:
>
>
>from HTMLParser import HTMLParser
>
>test = """<html>
><body>
><font face=arial,helvetica>test</font>
></body>
></html>"""
>
>x = HTMLParser()
>x.feed(test)
>
>
>The face should obviously have quotes around its value, but under the
>circumstances I would think HTMLParser should take anything up until
>the next space or end of the tag as its value.
>
First a little nit: I don't think it's fair to say it "crashes" (which
implies a loss-of-control bug) when what the program did was complain
about its input, presumably according to design.

>From the HTML 4.0 spec:

"The HTML 4.0 specification includes an SGML declaration, three document
type definitions (see the section on HTML version information for
a description of the three), and a list of character references."

...

"3.2 SGML constructs used in HTML

The following sections introduce SGML constructs that are used in HTML."

...

"3.2.2 Attributes

Elements may have associated properties, called attributes, which may have values (by
default, or set by authors or scripts). Attribute/value pairs appear before the final ">"
of an element's start tag. Any number of (legal) attribute value pairs, separated by
spaces, may appear in an element's start tag. They may appear in any order.

In this example, the id attribute is set for an H1 element: 

   <H1 id="section1">
   This is an identified heading thanks to the id attribute
   </H1> 

By default, SGML requires that all attribute values be delimited using either double
quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39).
Single quote marks can be included within the attribute value when the value is
delimited by double quote marks, and vice versa. Authors may also use numeric
character references to represent double quotes (") and single quotes (').
For double quotes authors can also use the character entity reference ". 

In certain cases, authors may specify the value of an attribute without any quotation
marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9),
hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We recommend using
quotation marks even when it is possible to eliminate them. 

Attribute names are always case-insensitive 

Attribute values are generally case-insensitive. The definition of each attribute in the
reference manual indicates whether its value is case-insensitive.

All the attributes defined by this specification are listed in the attribute index."


>From the "In certain cases ..." paragraph above, it looks to me
like the comma in the unquoted face list is the origin of the program's complaint.

Regards,
Bengt Richter




More information about the Python-list mailing list