[Expat-bugs] [ expat-Bugs-768010 ] Extended characters not parsed correctly...

Wed Jul 9 09:32:34 EDT 2003

Bugs item #768010, was opened at 2003-07-08 16:13
Message generated for change (Comment added) made by kwaclaw
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=768010&group_id=10127

Category: None
Group: None
>Status: Closed
>Resolution: Rejected
Priority: 5
Submitted By: Tony Palmer (etpalmer)
Assigned to: Nobody/Anonymous (nobody)
Summary: Extended characters not parsed correctly...

Initial Comment:
We have been using the expat parser for a couple years 
and have recently had need for parsing extended 
characters (128-255).  Even though I specify "iso-8859-
1" in my parser create, it still is returning me incorrect 
character data.  I am not sure what I am doing wrong.
I am on a window 2000 platform with 1.95.6 now.

I feed in the following:
<?xml version="1.0" encoding="iso-8859-1"?> 
<MESSAGE>TÖNY</MESSAGE>

What I get out of the parse is (start,character data,
hex of character data, end):
MESSAGE
T&#9500;ûNY
54 ffffffc3 ffffff96 4e 59
MESSAGE

My code is as follows:

void startElement(void *userData, const char *name, 
const char **atts)
{
	printf("%s\n",name);
}

void endElement(void *userData, const char *name)
{
	printf("%s\n",name);
}

void characterData(void *userData, const XML_Char *s, 
int len)
{
	char s2[256000];
	int i;
	strncpy(s2,s,len);
	s2[len]=0;
	printf("%s\n", s2);
	for (i = 0; i < len; ++i)
		printf("%x ",s2[i]);
	printf("\n");
}

void ParseXML(const char* csInput)
{
	XML_Parser parser = XML_ParserCreate("iso-
8859-1");
	int len = 0;
	XML_SetElementHandler(parser, startElement, 
endElement);
	XML_SetCharacterDataHandler(parser, 
characterData);

	len = strlen(csInput);

	printf("started XML parse\n");

    if (!XML_Parse(parser, csInput, len, 1)) 
	{
		printf("%s at line %d column %d\n",
	      XML_ErrorString(XML_GetErrorCode
(parser)),
	      XML_GetCurrentLineNumber(parser),
		  XML_GetCurrentColumnNumber
(parser));
		return;
    }
	XML_ParserFree(parser);
}

Any help would be appreciated.

Thanks,
Tony

----------------------------------------------------------------------

>Comment By: Karl Waclawek (kwaclaw)
Date: 2003-07-09 11:32

Message:
Logged In: YES 
user_id=290026

It generally makes not much sense to process data
internally in a non-standard encoding, as you may have
input from various sources with different encodings,
and you would have to write different code for each encoding.

You likely need to convert to ISO-8859-1 for display
and output purposes only. This is easiest if you use the
version of Expat that reports wide characters (UTF-16),
since ISO-8859-1 is a subset of UTF-16 that can be 
obtained by simply taking all UTF-16 characters with
a zero high-byte, and extract the low-byte.

Closing this report, as it is not a bug.

----------------------------------------------------------------------

Comment By: Tony Palmer (etpalmer)
Date: 2003-07-09 11:19

Message:
Logged In: YES 
user_id=818598

Thanks for the clarification.  I guess then I am not sure how I 
am supposed to convert from UTF-8 back to the iso-8859-1 
character that was passed into the parser.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2003-07-08 16:20

Message:
Logged In: YES 
user_id=290026

Are you aware that Expat reports XML content encoded
as UTF-8 or UTF-16 only? The document's encoding
is irrelevant for that.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=768010&group_id=10127