[ expat-Bugs-477667 ] illegal utf-8 seqs do not throw error
noreply@sourceforge.net
noreply@sourceforge.net
Fri Apr 19 12:28:18 2002
Bugs item #477667, was opened at 2001-11-02 17:58
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127
Category: None
Group: None
>Status: Closed
>Resolution: Works For Me
Priority: 5
Submitted By: Patrick McCormick (patrickmc)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: illegal utf-8 seqs do not throw error
Initial Comment:
I have a problem where users like to use iso-8859-1
without declaring it in
the prolog, like this:
<?xml version='1.0'?>
<rule>abécdef</rule>
expat properly defaults to utf-8 in this case. As I
understand utf-8, the
é character (0xE7) has a bitfield that looks like the
start of a three byte
sequence. A 3-byte sequence is supposed to look like
this:
bytes | bits | representation
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
the above two bytes (c and d) don't match the 10vvvvvv
mask, so écd is an
illegal utf-8 sequence. But expat doesn't throw a
well-formedness error.
Expat uses this macro in xmltok.c to figure out what's
illegal:
#define UTF8_INVALID3(p) \
((*p) == 0xED \
? (((p)[1] & 0x20) != 0) \
: ((*p) == 0xEF \
? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] ==
0xBE)) \
: 0))
but this doesn't seem strict enough.
I wrote a patch that makes expat check UTF-8 sequences
against the Table 3.1B of the Unicode 3.1 standard:
http://www.unicode.org/unicode/reports/tr27/
as originally clarified in this Corrigendum:
http://www.unicode.org/unicode/uni2errata/UTF-
8_Corrigendum.html
and it's attached.
----------------------------------------------------------------------
>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2002-04-19 15:19
Message:
Logged In: YES
user_id=3066
Added a test (tests/runtests.c revision 1.9) that shows this
bug does not exist in the CVS version.
You did not state which version of Expat you're using.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=110127&aid=477667&group_id=10127