[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error
noreply@sourceforge.net
noreply@sourceforge.net
Fri May 17 12:26:03 2002
Bugs item #477667, was opened at 2001-11-02 14:58
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127
Category: None
Group: None
Status: Open
Resolution: Works For Me
Priority: 6
Submitted By: Patrick McCormick (patrickmc)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error
Initial Comment:
I have a problem where users like to use iso-8859-1
without declaring it in
the prolog, like this:
<?xml version='1.0'?>
<rule>abécdef</rule>
expat properly defaults to utf-8 in this case. As I
understand utf-8, the
é character (0xE7) has a bitfield that looks like the
start of a three byte
sequence. A 3-byte sequence is supposed to look like
this:
bytes | bits | representation
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
the above two bytes (c and d) don't match the 10vvvvvv
mask, so écd is an
illegal utf-8 sequence. But expat doesn't throw a
well-formedness error.
Expat uses this macro in xmltok.c to figure out what's
illegal:
#define UTF8_INVALID3(p) \
((*p) == 0xED \
? (((p)[1] & 0x20) != 0) \
: ((*p) == 0xEF \
? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] ==
0xBE)) \
: 0))
but this doesn't seem strict enough.
I wrote a patch that makes expat check UTF-8 sequences
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html
and it's attached.
----------------------------------------------------------------------
>Comment By: Patrick McCormick (patrickmc)
Date: 2002-05-17 12:25
Message:
Logged In: YES
user_id=363812
actually NORMAL_VTABLE *initializes* the struct, it doesn't
define it; that's done at "struct normal_encoding". so
that's how the functions are hooked up.
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 12:24
Message:
Logged In: YES
user_id=3066
Aargh! This is why I hate token pasting! "grep" doesn't
like it either.
It gets glued in in four "struct normal_encoding" structures
statically defined, starting with "utf8_encoding_ns".
Ok, I'll keep digging.
----------------------------------------------------------------------
Comment By: Patrick McCormick (patrickmc)
Date: 2002-05-17 12:18
Message:
Logged In: YES
user_id=363812
not referenced? sure it is! you have to tap into the
crazy zen of expat's vtables-without-C++.
look at the struct utf8_encoding. at the bottom, it uses
the macro NORMAL_VTABLE(utf8_), which creates a struct
entry "utf8_invalid3".
the macro IS_INVALID_CHAR turns into a function call to the
appropriate utf8_invalidN struct member. at some point the
struct members are hooked up to the functions, but I'm not
sure where.
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 10:58
Message:
Logged In: YES
user_id=3066
It's not just that the UTF8_INVALID3() macro is wrong, but
that it isn't used at all! The macro is referenced from
utf8_isInvalid3(), but that function is not referenced. ;-(
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 09:31
Message:
Logged In: YES
user_id=3066
Ok, I've found a bug in the test case (re-using the parser
without resetting it); I've fixed that in my copy and can
now reproduce the error.
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-17 09:20
Message:
Logged In: YES
user_id=290026
I am using the library directly - with my own code.
Karl
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-05-17 08:52
Message:
Logged In: YES
user_id=3066
This is strange. Using the CVS version of Expat, the test
case (in tests/runtests.c:test_illegal_utf8) sees the error
properly reported. xmlwf doesn't report it, however. Are
you using the library directly or going through xmlwf?
I'll see what I can figure out.
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 07:44
Message:
Logged In: YES
user_id=290026
There is official conversion code at unicode.org.
Download the files ConvertUTF.c and ConvertUTF.h from
ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/
and then look at the function
static Boolean isLegalUTF8(UTF8 *source, int length)
Karl
----------------------------------------------------------------------
Comment By: Karl Waclawek (kwaclaw)
Date: 2002-05-09 07:24
Message:
Logged In: YES
user_id=290026
I can confirm that the current CVS does indeed not
report an error against:
<?xml version='1.0'?>
<rule>abécdef</rule>
Karl
----------------------------------------------------------------------
Comment By: Rolf Ade (pointsman)
Date: 2002-05-08 14:40
Message:
Logged In: YES
user_id=13222
I'm not happy with closing this bug report without
action. Contrary to Fred's test result, I still find, that
the described bug is still there (as it was at the time, the
bug was reported). I've tested this with the current CVS
HEAD.
The bug is in deed easly demonstrable with the example out
of the bug report. I use:
<?xml version='1.0'?>
<rule>abécdef</rule>
The third character of the PCDATA is a small e with acute,
that's 0xe9 in the iso-8859-1 char table (and the unicode
char 00e9), if there may be an encoding problem throu the
web interface.
xmlwf passes this test file, without any error report, which
is, to the best of my knowledge, wrong.
rxp and libxml (i.e. xmllint) confirm, that the test file is
not proper UTF-8.
IHMO, this is a real _crucial_ bug.
Please, __Please__, re-check this.
rolf
----------------------------------------------------------------------
Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-19 12:19
Message:
Logged In: YES
user_id=3066
Added a test (tests/runtests.c revision 1.9) that shows this
bug does not exist in the CVS version.
You did not state which version of Expat you're using.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127