[Expat-discuss] Handling invalid tokens

Nick MacDonald nickmacd at gmail.com
Tue May 11 15:01:23 CEST 2010


Patrick:

I'm afraid you're trying to do something you just shouldn't!  If you
read the XML 1.0 specification ( http://www.w3.org/TR/REC-xml/ ),
you'll find this text:

== snip ==
2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters,
which may represent markup or character data.] [Definition: A
character is an atomic unit of text as specified by ISO/IEC 10646:2000
[ISO/IEC 10646]. Legal characters are tab, carriage return, line feed,
and the legal characters of Unicode and ISO/IEC 10646. The versions of
these standards cited in A.1 Normative References were current at the
time this document was prepared. New characters may be added to these
standards by amendments or new editions. Consequently, XML processors
MUST accept any character in the range specified for Char. ]
Character Range
[2]   	Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */
== snip ==

As you can plainly see, your ESCape character is not one of the ones
allowed in a valid/well formed XML file.  eXpat is not going to allow
you to parse invalid XML, and this is, I think, a good thing... there
is no point having a specification if people can break the rules
willy-nilly.

I know that XML 1.1 changes the support of characters (basically
changing from "only these things are allowed" to "anything not
forbidden is allowed") but I am not really familiar with the spec, and
I don't know if it would help you with ESC... and I don't honestly
know if eXpat handles XML 1.1 as I have never tried.  You could throw
a <?xml version="1.1" encoding="utf-8"?> at the top of your document
and see if anything changes... but if not then:

As I see it, you have two choices...  filter your input on the way to
eXpat so that invalid characters are removed before the parser sees
them, or come up with an alternative encoding ... such as an <ESC/>
tag... but that will only work if the ESCape char is in the right
places... (as text in body of tags... its not going to work in the
parameters to tags.)

Good luck,
  Nick


On Mon, May 10, 2010 at 5:20 PM,  <kcirtap at lavabit.com> wrote:
> Hi, I'm trying to parse a UTF-8 XML file that has a byte with a hex value
> of 0x1B (ASCII ESCape).
>
> Whenever I try to load up said XML file with expat, expat gives me "not
> well-formed (invalid token)" for that byte. If I remove that byte, expat
> loads the file just fine.
>
> My question is, how do I make it so that expat doesn't error when it finds
> that byte?


More information about the Expat-discuss mailing list