[Expat-bugs] [ expat-Bugs-1185243 ] Acirc prepended to entity

Tue Apr 19 16:17:56 CEST 2005

Bugs item #1185243, was opened at 2005-04-18 17:04
Message generated for change (Comment added) made by janho
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1185243&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jan Hochstein (janho)
Assigned to: Nobody/Anonymous (nobody)
Summary: Acirc prepended to entity

Initial Comment:
Consider the following xml document:

-----

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE xml [
<!ENTITY deg     "&#176;">
]>

<anything>
Latitude is 42 &deg; 34' north.
</anything>

-----

The character handler is called three times. The data
it receives is this:
1) "Latitude is 42"
2) "Â°"
3) "34' north."

If your encoding is different from mine: the first
character received on the second call is HTML &Acirc;
or ASCII 194.

I have found this example to work with any entity not
just &deg; on expat-1.95.7 and expat-1.95.8 .

----------------------------------------------------------------------

>Comment By: Jan Hochstein (janho)
Date: 2005-04-19 16:17

Message:
Logged In: YES 
user_id=1234078

You are right of course. I was not aware that UTF-8 has
characters of different bit lengths.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-04-19 15:49

Message:
Logged In: YES 
user_id=290026

UTF-8 is a multi-byte encoding.
Please determine the exact UTF-8 encoding for the character
with Unicode code point 176. It might very well be c2 b0.

----------------------------------------------------------------------

Comment By: Jan Hochstein (janho)
Date: 2005-04-19 09:35

Message:
Logged In: YES 
user_id=1234078

First of all, the call-back 2) gets two bytes. That would
mean a 16-bit wide encoding. But call-backs 1) and 3) get
only one byte for each character.

The hex codes of the buffers received by the call-backs are
these:
1) 4c 61 74 69 74 75 64 65 20 69 73 20 34 32 20 
2) c2 b0 
3) 20 33 34 27 20 6e 6f 72 74 68 2e 

I attach the C program used to produce them.

If I change the encoding="" in the xml file to UTF-8 the
problem persists.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2005-04-18 19:33

Message:
Logged In: YES 
user_id=290026

I cannot reproduce your problem.

Are you aware that Expat reports data in UTF-8 or UTF-16
encoding? Which GUI controls do you use to display
the data from the character handler? Is it the built-in
display of your IDE? It may not handle UTF-8/16.

Why don't you write down the hex values from call-back 2)
and then look up how character 176 (ISO-8859-1) would be
encoded in UTF-8 or UTF-16. Compare this to the call-back
data, and you will know if there is a bug.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110127&aid=1185243&group_id=10127