[ expat-Patches-562005 ] Detect invalid UTF-8 sequences
noreply@sourceforge.net
noreply@sourceforge.net
Wed May 29 11:15:04 2002
Patches item #562005, was opened at 2002-05-29 14:14
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=310127&aid=562005&group_id=10127
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Karl Waclawek (kwaclaw)
Assigned to: Nobody/Anonymous (nobody)
Summary: Detect invalid UTF-8 sequences
Initial Comment:
This patch is based on the patch attached to bug #
477667.
I have updated the patch to Unicode 3.2 and
made some optimizing (hopefully) modifications
to the code.
This is the table it is now based on:
Table 3.1B. Legal UTF-8 Byte Sequences in Unicode 3.2
Code Points 1st Byte 2nd 3rd 4th
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF ill-formed
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Optimization 1)
Analyzing the code in xmltok.c, I found that the
functions utf8_isInvalid2,3,4 are called only when
the first byte of the UTF-8 sequence maps to
BT_LEAD2,3,4 respectively in the table in utf8tab.h
(I looked at the pre-processed output for that).
This means for the first byte p[0]:
BT_LEAD2 <==> p[0] >= 0xC0 and p[0] <= 0xDF,
therefore we don't have to check for p[0] > 0xDF
BT_LEAD3 <==> p[0] >= 0xE0 and p[0] <= 0xEF,
therefore we don't have to check for p[0] < 0xE0
and p[0] > 0xEF
BT_LEAD4 <==> p[0] >= 0xF0 and p[0] <= 0xF4,
therefore we don't have to check for p[0] < 0xF0
and p[0] > 0xF4
so, our checks for an invalid UTF-8 sequence are:
BT_LEAD2:
p[0] < 0xC2 || p[1] < 0x80 || p[1] > 0xBF
BT_LEAD3:
p[2] < 0x80 || p[2] > 0xBF ||
if p[0] == 0xE0 then p[1] < 0xA0 || p[1] > 0xBF
if p[0] == 0xED then p[1] < 0x80 || p[1] > 0x9F
otherwise p[1] < 0x80 || p[1] > 0xBF
BT_LEAD4:
p[3] < 0x80 || p[3] > 0xBF ||
p[2] < 0x80 || p[2] > 0xBF ||
if p[0] == 0xF0 then p[1] < 0x90 || p[1] > 0xBF
if p[0] == 0xF4 then p[1] < 0x80 || p[1] > 0x8F
otherwise p[1] < 0x80 || p[1] > 0xBF
Optimization 2)
Use conditional expressions, i.e. ( ? : )
Optimization 3)
In theory, it should be more efficient to write
(A & 0x80) == 0 instead of A < 0x80
and
(A & 0xC0) == 0xC0 instead of A > 0xBF
Check the attached file xmltok.c for the actual
implementation. The patch is based on
revision 1.15 of xmltok.c.
Karl
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=310127&aid=562005&group_id=10127