[Expat-discuss] Fw: Extra character inserted in CharacterData Handler?

Subramanian, Binu binu.subramanian@barconet.com
Fri Jul 26 00:55:05 2002


Hi,

Would you have any idea how to convert the UTF-8 characters back to ascii (
or extended ascii)
programatically so that i would be rid of the multi byte character
indicator?
I am working on VC 6.0 and i would like my application to run on all windows
platforms.

Any help/ suggestion will be useful.

kr,
Binu


-----Original Message-----
From: Josh Martin [mailto:Josh.Martin@abq.sc.philips.com]
Sent: 25 July 2002 23:52
To: expat-discuss@lists.sourceforge.net; binu.subramanian@barconet.com
Subject: RE: [Expat-discuss] Fw: Extra character inserted in
CharacterData Handler?


Hi,

I finally figured it out, thanks in no small way to your informative link to
the 
UTF-8 encoding.

The output you are recieving from expat is completely correct.  Here's the
deal: 
I was right, although I didn't know why, the (A WITH CIRCUMFLEX) IS a sort
of 
multi-byte character indicator.  It is the first byte of the UTF-8 encoding
for 
the trademark symbol.  The output that you are seeing from expat are the 
properly UTF-8 encoded characters viewed in the ISO-8859-1 encoding (or
whatever 
ISO-8859 derivative your OS happens to be using).  It just so happens that
the 
UTF-8 encoding for the trademark symbol, also happens to contain the
character 
representing the trademark symbol encoded in ISO-8859-1 (preceded by the  
character).

So basically what I'm saying is that if you take the output from the
character 
data handler, save it in a text file, load it into a browser (preferrably 
Netscape, as that will make it much easier to:) and change the encoding to 
UTF-8, then the output will look exactly like it is supposed to.

In the end, expat did what it was supposed to, and your OS did what it was 
supposed to, but we (you and I), being mere mortals, were confused because
they 
did what we said, not what we meant.

I hope this information helps you and I know I certainly learned alot. Happy

coding.

 - Josh Martin

> Hi,
> 
> No UTF-8 is not restricted to 8 bits. Do refer this link:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> 
> I have some data which contains some characters like Euro, trademark, =
etc.
> I want to serialise it in XML format. So when i write the XML file, i
> replace
> the characters with their numerical entities. This XML file is viewed
> correctly
> in IE 6.0. But when i parse the XML file, the expat prefixes the Â
> character. ie
> i get the  character followed by the Euro character.
> 
> My XML file has the encoding specified as UTF-8.
> 
> I will try changing the encoding of the parser and check.
> 
> Binu
> -----Original Message-----
> From: Josh Martin [mailto:Josh.Martin@abq.sc.philips.com]
> Sent: 25 July 2002 03:52
> To: expat-discuss@lists.sourceforge.net; =
binu.subramanian@barconet.com
> Subject: Re: [Expat-discuss] Fw: Extra character inserted in
> CharacterData Handler?
> 
> 
> Hi,
> 
> Call me crazy, but isn't UTF-8 an 8-bit wide character encoding =
format?
And
> if 
> so, isn't the number 8364 a bit out of its league?  If this is true =
then I
> would 
> think that the  character is either some sort of multi-byte =
character 
> indicator, or is just expat fudging on the numbers it doesn't =
understand.
> Try 
> not specifying the encoding format for the XML document and the XML
> parser... 
> see what happens.  Let us know how it goes.  I think that is how I =
solved
> this 
> problem when I encountered it about a year ago.
> 
> BTW, I thought you were trying to keep the parser from converting
character 
> entities to their character representation?
> 
>  - Josh Martin
> 
> > Hello,
> > 
> > I am facing exactly the same problem. In my case the characters are =
the
> > Euro, trademark, etc.
> > When i write the xml file, i replace the Euro character with its
numerical
> > entity € 
> > I have specified the encoding for my XML file as UTF-8.
> > 
> > Now when the expat parser parses the file, it appends the  =
character.
so
> it
> > is  followed by the Euro character.
> > What should i do to get rid of the extra character?
> > Am i missing something here?
> > Binu
> > 
> > 
> > > 
> >  > The "." character in your file - 0xB7 - is invalid UTF-8.
> >  > Maybe it is valid ISO-8859-1?
> >  > In that case you must add an XML declaration.
> >  > 
> >  > Actually, 1.95.3 should reject it (and it does so on my system).
> >  
> >  Rolf Ade just pointed out to me that I didn't read your code.
> >  You passed the ISO-8859-1 encoding to the parser, so there
> >  was no error on your side.
> >  
> >  However, what you reported looks exactly like what a word =
processor
> >  would show you when it expects ISO-8859-1, but gets UTF-8 (tested =
with
> > Wordpad).
> >  Now, this would be a correct result, since Expat only passes UTF-8
> >  or UTF-16 to its handlers, no matter what the input.
> >  
> >  Karl
> >  
> > 
> > 
> > 
> > -------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Expat-discuss mailing list
> > Expat-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/expat-discuss
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by: Jabber - The world's fastest =
growing 
> real-time communications platform! Don't just IM. Build it in! 
> http://www.jabber.com/osdn/xim
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/expat-discuss