[ expat-Patches-562005 ] Detect invalid UTF-8 sequences

Wed May 29 11:15:04 2002

Patches item #562005, was opened at 2002-05-29 14:14
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=310127&aid=562005&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Karl Waclawek (kwaclaw)
Assigned to: Nobody/Anonymous (nobody)
Summary: Detect invalid UTF-8 sequences

Initial Comment:

This patch is based on the patch attached to bug # 
477667.

I have updated the patch to Unicode 3.2 and
made some optimizing (hopefully) modifications
to the code.

This is the table it is now based on:

Table 3.1B. Legal UTF-8 Byte Sequences in Unicode 3.2
 Code Points        1st Byte 2nd     3rd     4th 
U+0000..U+007F      00..7F			

U+0080..U+07FF      C2..DF   80..BF 		

U+0800..U+0FFF      E0       A0..BF  80..BF 	

U+1000..U+CFFF      E1..EC   80..BF  80..BF 	

U+D000..U+D7FF      ED       80..9F  80..BF 	

U+D800..U+DFFF	    ill-formed		

U+E000..U+FFFF      EE..EF   80..BF  80..BF 	

U+10000..U+3FFFF    F0       90..BF  80..BF  80..BF	
U+40000..U+FFFFF    F1..F3   80..BF  80..BF  80..BF	
U+100000..U+10FFFF  F4       80..8F  80..BF  80..BF	

Optimization 1)

Analyzing the code in xmltok.c, I found that the
functions utf8_isInvalid2,3,4 are called only when
the first byte of the UTF-8 sequence maps to
BT_LEAD2,3,4 respectively in the table in utf8tab.h 
(I looked at the pre-processed output for that).

This means for the first byte p[0]:
BT_LEAD2 <==> p[0] >= 0xC0 and p[0] <= 0xDF,
  therefore we don't have to check for p[0] > 0xDF
BT_LEAD3 <==> p[0] >= 0xE0 and p[0] <= 0xEF,
  therefore we don't have to check for p[0] < 0xE0
  and p[0] > 0xEF
BT_LEAD4 <==> p[0] >= 0xF0 and p[0] <= 0xF4,
  therefore we don't have to check for p[0] < 0xF0
  and p[0] > 0xF4

  so, our checks for an invalid UTF-8 sequence are:

  BT_LEAD2:
    p[0] < 0xC2 || p[1] < 0x80 || p[1] > 0xBF

  BT_LEAD3:
    p[2] < 0x80 || p[2] > 0xBF ||
    if p[0] == 0xE0 then p[1] < 0xA0 || p[1] > 0xBF
    if p[0] == 0xED then p[1] < 0x80 || p[1] > 0x9F
    otherwise p[1] < 0x80 || p[1] > 0xBF

  BT_LEAD4:
    p[3] < 0x80 || p[3] > 0xBF ||
    p[2] < 0x80 || p[2] > 0xBF ||
    if p[0] == 0xF0 then p[1] < 0x90 || p[1] > 0xBF
    if p[0] == 0xF4 then p[1] < 0x80 || p[1] > 0x8F
    otherwise p[1] < 0x80 || p[1] > 0xBF

Optimization 2)

  Use conditional expressions, i.e. ( ? : )

Optimization 3)

  In theory, it should be more efficient to write

    (A & 0x80) == 0     instead of  A < 0x80
  and
    (A & 0xC0) == 0xC0  instead of  A > 0xBF

Check the attached file xmltok.c for the actual
implementation. The patch is based on
revision 1.15 of xmltok.c.

Karl

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=310127&aid=562005&group_id=10127