[Expat-discuss] does expat detect illegal utf-8 sequences?
Patrick McCormick
patrick@meer.net
Tue Oct 30 16:45:01 2001
I have a problem where users like to use iso-8859-1 without declaring it in
the prolog, like this:
<?xml version='1.0'?>
<rule>abécdef</rule>
expat properly defaults to utf-8 in this case. As I understand utf-8, the
é character (0xE7) has a bitfield that looks like the start of a three byte
sequence. A 3-byte sequence is supposed to look like this:
bytes | bits | representation
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
the above two bytes (c and d) don't match the 10vvvvvv mask, so écd is an
illegal utf-8 sequence. But expat doesn't throw a well-formedness error.
Expat uses this macro in xmltok.c to figure out what's illegal:
#define UTF8_INVALID3(p) \
((*p) == 0xED \
? (((p)[1] & 0x20) != 0) \
: ((*p) == 0xEF \
? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] == 0xBE)) \
: 0))
but I don't understand what it's checking for. Can someone explain?
If the mask I mention above is correct, the check should look something
like this:
#define UTF8_INVALID3(p) \
(!(((p)[0] & 0xF0) == 0xE0 && \
((p)[1] & 0xC0) == 0x80 && \
((p)[2] & 0xC0) == 0x80))
It's entirely possible that I am not understanding utf-8 properly - can
someone explain what supposed to happen with the document above?
Patrick