[Expat-discuss] not well-formed (invalid token) error

Lee Passey lee at novomail.net
Thu Apr 9 00:00:36 CEST 2009


Krishna Kondaka wrote:
> Hi
> 
> I am trying to parse a very simple HTML file but I am getting 'not
> well-formed (invalid token) error'. Is there any thing I can do to
> make this work without getting errors?

Yes, you can convert your HTML file to XHTML.

Expat is an XML parser. HTML is /not/ XML; rather HTML is based on SGML 
syntax. SGML allows for certain tags to be implicitly closed. For 
example, in HTML this is allowed:

<p>This is a paragraph
<p>This is another paragraph

In HTML, when a <p> is encountered, it /implicitly/ closes any open <p> tag.

XHTML is an implementation of HTML with the additional restriction that 
not only must the markup be valid HTML it must also be well-formed XML 
as well. So the foregoing HTML snippet must be encoded in XHTML as:

<p>This is a paragraph</p>
<p>This is another paragraph</p>

Note the addition of the closing tags.

Also, HTML has a few tags that are self-contained ("empty"), and contain 
no closing tags. Among these tags are <img>, <br> and <hr>. In XML, 
however, closing tags must be explicit so the syntax for these empty 
tags uses a slash prior to the final angle bracket.

You example is valid HTML, but it is not valid XHTML, because the <hr> 
tag is not closed; use <hr/> instead.

If you do not control your HTML, you can use a tool like HTMLTidy 
(http://tidy.sourceforge.net/) to convert valid (and sometimes invalid) 
HTML to XHTML.


More information about the Expat-discuss mailing list