[ expat-Bugs-481609 ] Wrong umlauts after parsing

noreply@sourceforge.net noreply@sourceforge.net
Mon Apr 22 10:13:13 2002


Bugs item #481609, was opened at 2001-11-14 03:33
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=481609&group_id=10127

Category: XML::Parser (Perl module)
Group: Not a Bug
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Thomas Frings (frings)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: Wrong umlauts after parsing

Initial Comment:
Parsing a xml-file that contains german umlauts like 
ä ö ü or their encoding like ä ä or ü
results in 'C$' (instead of 'ä'), 'C<' (instead of 'ü') 
or  'C6' (instead of 'ö').

What's going wrong? 

System: Solaris 2.8
        expat 1.95.2
        XML-Parser 2.30

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-22 13:11

Message:
Logged In: YES 
user_id=3066

I still can't reproduce this.  I've tried using "ö"
literally with the document marked as iso-8859-1 encoded,
and encoded as &#246; and &#x0F6;, both in an attribute
value and character data.

Please attach a complete (but short) document that exhibits
this problem, and explain your test in detail.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-04-22 07:20

Message:
Logged In: NO 

When you write umlauts in attributes, it goes completely 
wrong:
<image id="2" alt="Schön" />
results in a value alt="Schn" or (in newer versions of 
Expat) in a Well-Formed error.
When you do alt="Sch&uuml;n" you get alt="Schn" , too.
The only workaround is doing: alt="Sch&amp;uuml;n" , and 
that isn't nice at all.


----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-15 23:39

Message:
Logged In: YES 
user_id=3066

The output shown is not UTF-8, but UTF-8 with the high bit
stripped.  I expect this was an artifact of the display font
or the terminal.  Expat should produce UTF-8 in all cases;
that's part of the intended interface.

----------------------------------------------------------------------

Comment By: Simon Gordon (si_gordon)
Date: 2001-11-14 19:03

Message:
Logged In: YES 
user_id=227124

I believe this is UTF-8. Expat always outputs in UTF-8 
rather than either (a) what you want or (b) what the XML 
encoding is set to.

I have long-held the belief that this is a bug even though 
the relese notes for 1.95 documented this fact. I had to 
patch my version to output ISO-8859-1 for exactly the same 
reason - I needed umlauted characters in ISO, not UTF-8.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=481609&group_id=10127