[ expat-Bugs-566240 ] UTF-8 char handling still broken(1.95.3)

Tue Jun 11 10:25:04 2002

Bugs item #566240, was opened at 2002-06-08 17:31
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Rolf Ade (pointsman)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8 char handling still broken(1.95.3)

Initial Comment:

The changes, to fix bug 477667 seems to
have also messed up some things.

I've run all not-wellformedness tests
(should all raise error) and all valid
tests (should all not raise an error)
of the OASIS xml test suite Version 2.

I found, that in this tests

xmltest/not-wf/sa/166.xml
xmltest/not-wf/sa/167.xml
xmltest/not-wf/sa/171.xml
xmltest/not-wf/sa/172.xml
xmltest/not-wf/sa/173.xml
xmltest/not-wf/sa/174.xml
xmltest/not-wf/sa/175.xml
xmltest/not-wf/sa/177.xml
ibm/not-wf/P02/ibm02n32.xml
ibm/not-wf/P02/ibm02n33.xml

a invalid UTF-8 char isn't reported as error

In this test:

ibm/valid/ibm02v01.xml 

expat claims error for a valid UTF-8 char.

rolf

----------------------------------------------------------------------

>Comment By: Rolf Ade (pointsman)
Date: 2002-06-11 17:24

Message:
Logged In: YES 
user_id=13222

Karl is right. This in deed appears to be a bug in the OASIS
test suite - this was already discussed  in the test result
report to #551599.

Even more. Up to (and including) 1.95.2, this wasn't
detected by expat and now is. So, this test file isn't an
example of an expat bug, but the contrary an example for the
improved char checking in 1.95.3.

rolf

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-11 04:07

Message:
Logged In: YES 
user_id=290026

I think in this case, i.e. ibm/valid/P02/ibm02v01.xml,
the test case is in error, since the file contains the
UTF-8 sequence F0 90 80 5F, which is invalid.

At this point I am not planning further fixes, unless
somebody can show me a reason why this sequence 
should be considered valid.

Karl

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2002-06-11 03:06

Message:
Logged In: YES 
user_id=13222

That problematic test file is

ibm/valid/P02/ibm02v01.xml

Sorry.

----------------------------------------------------------------------

Comment By: Rolf Ade (pointsman)
Date: 2002-06-11 03:02

Message:
Logged In: YES 
user_id=13222

(You're of course right. I better should
have distinguished between legal UTF-8
chars and legal XML PCDATA chars. I
confess I still have to re-lookup the
releated parts of the notorious
specs. At the moment, I only mechanical
bang it against the OASIS suite and
report the strange (ie new)
things. Sorry, for omitting deeper
analysis.)

Better now, according to the OASIS test
suite.

Only 

ibm/valid/ibm02v01.xml 

still seems to be wrong. Expat claims
"invalid token", while the test suite
claims, that this is valid XML.

rolf

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 14:03

Message:
Logged In: YES 
user_id=290026

Fix checked in. 
Please test CVS rev. 1.17 of xmltok.c.

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 02:51

Message:
Logged In: YES 
user_id=290026

Looking at the spec, it seems that there are in fact
additional restrictions:

Character Range
[2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, 
excluding the surrogate blocks, FFFE, and FFFF. */ 

This means we have to re-visit the UTF-8 fix.

Karl

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2002-06-09 01:15

Message:
Logged In: YES 
user_id=290026

I looked at these test cases, and checked them against 
Table 3.1B in  Unicode 3.2 - have a look at 
<http://www.unicode.org/unicode/reports/tr28/>

First, lets deal with James Clark's test cases:
The docs state that they test if the invalid character FFFF
(or FFFE for test case not-wf-sa-167) is present. This
would map to the sequence EF BF BF (or EF BF BE).

Now, the sequences in question are indeed present, but
they are actually valid UTF-8!
So, where does it say that they are not valid in XML?
XMLSpy accepts these test cases as well-formed, btw.

The same then applies to the IBM test cases:
ibm02n32.xml tests for FFFE and ibm02n33.xml
tests for FFFF. Same question as above - valid UTF-8,
but invalid XML?

About the last test case, file ibm/valid/iP02/bm02v01.xml :
It contains the sequence F0 90 80 5F, which is an illegal
UTF-8 sequnce according to Table 3.1B in Unicode 3.2.

So, as far as I can tell Expat is correct in how
it checks the UTF-8 sequences, but I am not sure
if XML imposes further restrictions on them.

Anybody care to comment?

Karl

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=566240&group_id=10127