From yikaikai at gmail.com Wed May 5 05:55:59 2010 From: yikaikai at gmail.com (zhang yikai) Date: Wed, 5 May 2010 11:55:59 +0800 Subject: [Expat-discuss] not well-formed (invalid token) at line 1 Message-ID: hi all, I want to use expat analysis xml file, here is my code modify from examples: static void XMLCALL startElement(void *userData, const char *name, const char **atts) { int i; int *depthPtr = (int *)userData; for (i = 0; i < *depthPtr; i++) putchar('\t'); puts(name); for (i = 0; atts[i]; i += 2) { printf(" %s='%s'", atts[i], atts[i + 1]); } *depthPtr += 1; } static void XMLCALL endElement(void *userData, const char *name) { int *depthPtr = (int *)userData; *depthPtr -= 1; } int main(int argc, char *argv[]) { char buf[1024] = "\ \ \ The Adventures of Huckleberry Finn\ Mark Twain\ mass market paperback\ 298\ $5.49\ \ \ Leaves of Grass\ Walt Whitman\ hardcover\ 462\ $7.75\ \ "; XML_Parser parser = XML_ParserCreate(NULL); int done; int depth = 0; XML_SetUserData(parser, &depth); XML_SetElementHandler(parser, startElement, endElement); do { if (XML_Parse(parser, buf, 1024, 1) == XML_STATUS_ERROR) { printf("%s at line %d\n" XML_FMT_INT_MOD "u\n", XML_ErrorString(XML_GetErrorCode(parser)), XML_GetCurrentLineNumber(parser)); return 1; } } while (1); XML_ParserFree(parser); return 0; } got result : INVENTORY BOOK TITLE AUTHOR BINDING PAGES PRICE BOOK TITLE AUTHOR BINDING PAGES PRICE not well-formed (invalid token) at line 1 what wrong with my code? thank you From jeremy at omsys.com Wed May 5 08:31:39 2010 From: jeremy at omsys.com (Jeremy H. Griffith) Date: Tue, 04 May 2010 23:31:39 -0700 Subject: [Expat-discuss] not well-formed (invalid token) at line 1 In-Reply-To: References: Message-ID: On Wed, 5 May 2010 11:55:59 +0800, zhang yikai wrote: >what wrong with my code? Easy. You've provided no way to exit the do/while loop without an error. So when expat is done, it tries again and fails. HTH! -- Jeremy H. Griffith, at Omni Systems Inc. http://www.omsys.com/ From yikaikai at gmail.com Wed May 5 10:25:32 2010 From: yikaikai at gmail.com (zhang yikai) Date: Wed, 5 May 2010 16:25:32 +0800 Subject: [Expat-discuss] not well-formed (invalid token) at line 1 In-Reply-To: References: Message-ID: Thank you for your hint, I change it to if (XML_Parse(parser, buf, strlen(buf), XML_TRUE) == XML_STATUS_ERROR) { printf("Error: %s\n", XML_ErrorString(XML_GetErrorCode(parser))); } XML_ParserFree(parser); now it fine. 2010/5/5 Jeremy H. Griffith : > On Wed, 5 May 2010 11:55:59 +0800, zhang yikai > wrote: > >>what wrong with my code? > > Easy. ?You've provided no way to exit the do/while > loop without an error. ?So when expat is done, it > tries again and fails. > > HTH! > > -- Jeremy H. Griffith, at Omni Systems Inc. > ? ?http://www.omsys.com/ > From kcirtap at lavabit.com Mon May 10 23:20:04 2010 From: kcirtap at lavabit.com (kcirtap at lavabit.com) Date: Mon, 10 May 2010 17:20:04 -0400 (EDT) Subject: [Expat-discuss] Handling invalid tokens Message-ID: <6517.71.224.104.216.1273526404.squirrel@lavabit.com> Hi, I'm trying to parse a UTF-8 XML file that has a byte with a hex value of 0x1B (ASCII ESCape). Whenever I try to load up said XML file with expat, expat gives me "not well-formed (invalid token)" for that byte. If I remove that byte, expat loads the file just fine. My question is, how do I make it so that expat doesn't error when it finds that byte? Thanks, Patrick From nickmacd at gmail.com Tue May 11 15:01:23 2010 From: nickmacd at gmail.com (Nick MacDonald) Date: Tue, 11 May 2010 09:01:23 -0400 Subject: [Expat-discuss] Handling invalid tokens In-Reply-To: <6517.71.224.104.216.1273526404.squirrel@lavabit.com> References: <6517.71.224.104.216.1273526404.squirrel@lavabit.com> Message-ID: Patrick: I'm afraid you're trying to do something you just shouldn't! If you read the XML 1.0 specification ( http://www.w3.org/TR/REC-xml/ ), you'll find this text: == snip == 2.2 Characters [Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char. ] Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ == snip == As you can plainly see, your ESCape character is not one of the ones allowed in a valid/well formed XML file. eXpat is not going to allow you to parse invalid XML, and this is, I think, a good thing... there is no point having a specification if people can break the rules willy-nilly. I know that XML 1.1 changes the support of characters (basically changing from "only these things are allowed" to "anything not forbidden is allowed") but I am not really familiar with the spec, and I don't know if it would help you with ESC... and I don't honestly know if eXpat handles XML 1.1 as I have never tried. You could throw a at the top of your document and see if anything changes... but if not then: As I see it, you have two choices... filter your input on the way to eXpat so that invalid characters are removed before the parser sees them, or come up with an alternative encoding ... such as an tag... but that will only work if the ESCape char is in the right places... (as text in body of tags... its not going to work in the parameters to tags.) Good luck, Nick On Mon, May 10, 2010 at 5:20 PM, wrote: > Hi, I'm trying to parse a UTF-8 XML file that has a byte with a hex value > of 0x1B (ASCII ESCape). > > Whenever I try to load up said XML file with expat, expat gives me "not > well-formed (invalid token)" for that byte. If I remove that byte, expat > loads the file just fine. > > My question is, how do I make it so that expat doesn't error when it finds > that byte?