[ expat-Bugs-566240 ] UTF-8 char handling still broken(1.95.3)
noreply@sourceforge.net
noreply@sourceforge.net
Mon Jun 10 21:08:02 2002
Bugs item #566240, was opened at 2002-06-08 13:31
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8 char handling still broken(1.95.3)
Initial Comment:
The changes, to fix bug 477667 seems to
have also messed up some things.
I've run all not-wellformedness tests
(should all raise error) and all valid
tests (should all not raise an error)
of the OASIS xml test suite Version 2.
I found, that in this tests
xmltest/not-wf/sa/166.xml
xmltest/not-wf/sa/167.xml
xmltest/not-wf/sa/171.xml
xmltest/not-wf/sa/172.xml
xmltest/not-wf/sa/173.xml
xmltest/not-wf/sa/174.xml
xmltest/not-wf/sa/175.xml
xmltest/not-wf/sa/177.xml
ibm/not-wf/P02/ibm02n32.xml
ibm/not-wf/P02/ibm02n33.xml
a invalid UTF-8 char isn't reported as error
In this test:
ibm/valid/ibm02v01.xml
expat claims error for a valid UTF-8 char.
rolf
----------------------------------------------------------------------
>Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-11 00:07
Message:
Logged In: YES
user_id=290026
I think in this case, i.e. ibm/valid/P02/ibm02v01.xml,
the test case is in error, since the file contains the
UTF-8 sequence F0 90 80 5F, which is invalid.
At this point I am not planning further fixes, unless
somebody can show me a reason why this sequence
should be considered valid.
Karl
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-06-10 23:06
Message:
Logged In: YES
user_id=13222
That problematic test file is
ibm/valid/P02/ibm02v01.xml
Sorry.
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-06-10 23:02
Message:
Logged In: YES
user_id=13222
(You're of course right. I better should
have distinguished between legal UTF-8
chars and legal XML PCDATA chars. I
confess I still have to re-lookup the
releated parts of the notorious
specs. At the moment, I only mechanical
bang it against the OASIS suite and
report the strange (ie new)
things. Sorry, for omitting deeper
analysis.)
Better now, according to the OASIS test
suite.
Only
ibm/valid/ibm02v01.xml
still seems to be wrong. Expat claims
"invalid token", while the test suite
claims, that this is valid XML.
rolf
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 10:03
Message:
Logged In: YES
user_id=290026
Fix checked in.
Please test CVS rev. 1.17 of xmltok.c.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 22:51
Message:
Logged In: YES
user_id=290026
Looking at the spec, it seems that there are in fact
additional restrictions:
Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */
This means we have to re-visit the UTF-8 fix.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 21:15
Message:
Logged In: YES
user_id=290026
I looked at these test cases, and checked them against
Table 3.1B in Unicode 3.2 - have a look at
<http://www.unicode.org/unicode/reports/tr28/>
First, lets deal with James Clark's test cases:
The docs state that they test if the invalid character FFFF
(or FFFE for test case not-wf-sa-167) is present. This
would map to the sequence EF BF BF (or EF BF BE).
Now, the sequences in question are indeed present, but
they are actually valid UTF-8!
So, where does it say that they are not valid in XML?
XMLSpy accepts these test cases as well-formed, btw.
The same then applies to the IBM test cases:
ibm02n32.xml tests for FFFE and ibm02n33.xml
tests for FFFF. Same question as above - valid UTF-8,
but invalid XML?
About the last test case, file ibm/valid/iP02/bm02v01.xml :
It contains the sequence F0 90 80 5F, which is an illegal
UTF-8 sequnce according to Table 3.1B in Unicode 3.2.
So, as far as I can tell Expat is correct in how
it checks the UTF-8 sequences, but I am not sure
if XML imposes further restrictions on them.
Anybody care to comment?
Karl
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127