[ expat-Bugs-566240 ] UTF-8 char handling still broken(1.95.3)
noreply@sourceforge.net
noreply@sourceforge.net
Tue Jun 11 10:29:03 2002
Bugs item #566240, was opened at 2002-06-08 13:31
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8 char handling still broken(1.95.3)
Initial Comment:
The changes, to fix bug 477667 seems to
have also messed up some things.
I've run all not-wellformedness tests
(should all raise error) and all valid
tests (should all not raise an error)
of the OASIS xml test suite Version 2.
I found, that in this tests
xmltest/not-wf/sa/166.xml
xmltest/not-wf/sa/167.xml
xmltest/not-wf/sa/171.xml
xmltest/not-wf/sa/172.xml
xmltest/not-wf/sa/173.xml
xmltest/not-wf/sa/174.xml
xmltest/not-wf/sa/175.xml
xmltest/not-wf/sa/177.xml
ibm/not-wf/P02/ibm02n32.xml
ibm/not-wf/P02/ibm02n33.xml
a invalid UTF-8 char isn't reported as error
In this test:
ibm/valid/ibm02v01.xml
expat claims error for a valid UTF-8 char.
rolf
----------------------------------------------------------------------
>Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-11 13:28
Message:
Logged In: YES
user_id=290026
Closed due to user enthusiasm. :-)
Karl
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-06-11 13:24
Message:
Logged In: YES
user_id=13222
Karl is right. This in deed appears to be a bug in the OASIS
test suite - this was already discussed in the test result
report to #551599.
Even more. Up to (and including) 1.95.2, this wasn't
detected by expat and now is. So, this test file isn't an
example of an expat bug, but the contrary an example for the
improved char checking in 1.95.3.
rolf
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-11 00:07
Message:
Logged In: YES
user_id=290026
I think in this case, i.e. ibm/valid/P02/ibm02v01.xml,
the test case is in error, since the file contains the
UTF-8 sequence F0 90 80 5F, which is invalid.
At this point I am not planning further fixes, unless
somebody can show me a reason why this sequence
should be considered valid.
Karl
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-06-10 23:06
Message:
Logged In: YES
user_id=13222
That problematic test file is
ibm/valid/P02/ibm02v01.xml
Sorry.
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-06-10 23:02
Message:
Logged In: YES
user_id=13222
(You're of course right. I better should
have distinguished between legal UTF-8
chars and legal XML PCDATA chars. I
confess I still have to re-lookup the
releated parts of the notorious
specs. At the moment, I only mechanical
bang it against the OASIS suite and
report the strange (ie new)
things. Sorry, for omitting deeper
analysis.)
Better now, according to the OASIS test
suite.
Only
ibm/valid/ibm02v01.xml
still seems to be wrong. Expat claims
"invalid token", while the test suite
claims, that this is valid XML.
rolf
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 10:03
Message:
Logged In: YES
user_id=290026
Fix checked in.
Please test CVS rev. 1.17 of xmltok.c.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 22:51
Message:
Logged In: YES
user_id=290026
Looking at the spec, it seems that there are in fact
additional restrictions:
Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */
This means we have to re-visit the UTF-8 fix.
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-08 21:15
Message:
Logged In: YES
user_id=290026
I looked at these test cases, and checked them against
Table 3.1B in Unicode 3.2 - have a look at
<http://www.unicode.org/unicode/reports/tr28/>
First, lets deal with James Clark's test cases:
The docs state that they test if the invalid character FFFF
(or FFFE for test case not-wf-sa-167) is present. This
would map to the sequence EF BF BF (or EF BF BE).
Now, the sequences in question are indeed present, but
they are actually valid UTF-8!
So, where does it say that they are not valid in XML?
XMLSpy accepts these test cases as well-formed, btw.
The same then applies to the IBM test cases:
ibm02n32.xml tests for FFFE and ibm02n33.xml
tests for FFFF. Same question as above - valid UTF-8,
but invalid XML?
About the last test case, file ibm/valid/iP02/bm02v01.xml :
It contains the sequence F0 90 80 5F, which is an illegal
UTF-8 sequnce according to Table 3.1B in Unicode 3.2.
So, as far as I can tell Expat is correct in how
it checks the UTF-8 sequences, but I am not sure
if XML imposes further restrictions on them.
Anybody care to comment?
Karl
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127