[Expat-discuss] does expat detect illegal utf-8 sequences?

Patrick McCormick patrick@meer.net
Tue Oct 30 16:45:01 2001


I have a problem where users like to use iso-8859-1 without declaring it in
the prolog, like this:

<?xml version='1.0'?>
<rule>abécdef</rule>

expat properly defaults to utf-8 in this case.  As I understand utf-8, the
é character (0xE7) has a bitfield that looks like the start of a three byte
sequence.  A 3-byte sequence is supposed to look like this:

bytes | bits | representation
    3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv

the above two bytes (c and d) don't match the 10vvvvvv mask, so écd is an
illegal utf-8 sequence.  But expat doesn't throw a well-formedness error.

Expat uses this macro in xmltok.c to figure out what's illegal:

#define UTF8_INVALID3(p) \
  ((*p) == 0xED \
  ? (((p)[1] & 0x20) != 0) \
  : ((*p) == 0xEF \
     ? ((p)[1] == 0xBF && ((p)[2] == 0xBF || (p)[2] == 0xBE)) \
     : 0))

but I don't understand what it's checking for.  Can someone explain?

If the mask I mention above is correct, the check should look something
like this:

#define UTF8_INVALID3(p) \
  (!(((p)[0] & 0xF0) == 0xE0 && \
     ((p)[1] & 0xC0) == 0x80 && \
     ((p)[2] & 0xC0) == 0x80))

It's entirely possible that I am not understanding utf-8 properly - can
someone explain what supposed to happen with the document above?

Patrick