[Expat-discuss] Fw: Extra character inserted in CharacterData Handler?

Josh Martin Josh.Martin@abq.sc.philips.com
Thu Jul 25 11:22:07 2002


Hi,

I finally figured it out, thanks in no small way to your informative link to the 
UTF-8 encoding.

The output you are recieving from expat is completely correct.  Here's the deal: 
I was right, although I didn't know why, the (A WITH CIRCUMFLEX) IS a sort of 
multi-byte character indicator.  It is the first byte of the UTF-8 encoding for 
the trademark symbol.  The output that you are seeing from expat are the 
properly UTF-8 encoded characters viewed in the ISO-8859-1 encoding (or whatever 
ISO-8859 derivative your OS happens to be using).  It just so happens that the 
UTF-8 encoding for the trademark symbol, also happens to contain the character 
representing the trademark symbol encoded in ISO-8859-1 (preceded by the  
character).

So basically what I'm saying is that if you take the output from the character 
data handler, save it in a text file, load it into a browser (preferrably 
Netscape, as that will make it much easier to:) and change the encoding to 
UTF-8, then the output will look exactly like it is supposed to.

In the end, expat did what it was supposed to, and your OS did what it was 
supposed to, but we (you and I), being mere mortals, were confused because they 
did what we said, not what we meant.

I hope this information helps you and I know I certainly learned alot. Happy 
coding.

 - Josh Martin

> Hi,
> 
> No UTF-8 is not restricted to 8 bits. Do refer this link:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> 
> I have some data which contains some characters like Euro, trademark, etc=
.
> I want to serialise it in XML format. So when i write the XML file, i
> replace
> the characters with their numerical entities. This XML file is viewed
> correctly
> in IE 6.0. But when i parse the XML file, the expat prefixes the Â
> character. ie
> i get the  character followed by the Euro character.
> 
> My XML file has the encoding specified as UTF-8.
> 
> I will try changing the encoding of the parser and check.
> 
> Binu
> -----Original Message-----
> From: Josh Martin [mailto:Josh.Martin@abq.sc.philips.com]
> Sent: 25 July 2002 03:52
> To: expat-discuss@lists.sourceforge.net; binu.subramanian@barconet.com
> Subject: Re: [Expat-discuss] Fw: Extra character inserted in
> CharacterData Handler?
> 
> 
> Hi,
> 
> Call me crazy, but isn't UTF-8 an 8-bit wide character encoding format?  =
And
> if 
> so, isn't the number 8364 a bit out of its league?  If this is true then =
I
> would 
> think that the  character is either some sort of multi-byte character=
 
> indicator, or is just expat fudging on the numbers it doesn't understand.
> Try 
> not specifying the encoding format for the XML document and the XML
> parser... 
> see what happens.  Let us know how it goes.  I think that is how I solved
> this 
> problem when I encountered it about a year ago.
> 
> BTW, I thought you were trying to keep the parser from converting charact=
er 
> entities to their character representation?
> 
>  - Josh Martin
> 
> > Hello,
> > 
> > I am facing exactly the same problem. In my case the characters are the
> > Euro, trademark, etc.
> > When i write the xml file, i replace the Euro character with its numeri=
cal
> > entity € 
> > I have specified the encoding for my XML file as UTF-8.
> > 
> > Now when the expat parser parses the file, it appends the  character=
. so
> it
> > is  followed by the Euro character.
> > What should i do to get rid of the extra character?
> > Am i missing something here?
> > Binu
> > 
> > 
> > > 
> >  > The "." character in your file - 0xB7 - is invalid UTF-8.
> >  > Maybe it is valid ISO-8859-1?
> >  > In that case you must add an XML declaration.
> >  > 
> >  > Actually, 1.95.3 should reject it (and it does so on my system).
> >  
> >  Rolf Ade just pointed out to me that I didn't read your code.
> >  You passed the ISO-8859-1 encoding to the parser, so there
> >  was no error on your side.
> >  
> >  However, what you reported looks exactly like what a word processor
> >  would show you when it expects ISO-8859-1, but gets UTF-8 (tested with
> > Wordpad).
> >  Now, this would be a correct result, since Expat only passes UTF-8
> >  or UTF-16 to its handlers, no matter what the input.
> >  
> >  Karl
> >  
> > 
> > 
> > 
> > -------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Expat-discuss mailing list
> > Expat-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/expat-discuss
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by: Jabber - The world's fastest growing=
 
> real-time communications platform! Don't just IM. Build it in! 
> http://www.jabber.com/osdn/xim
> _______________________________________________
> Expat-discuss mailing list
> Expat-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/expat-discuss