[Patches] Objects/unicodeobject.c(PyUnicode_DecodeUTF8): Fix error handling

13 May 2000 11:49:56 +0200

--=-=-=

The attached patch fixes error handling and improves the treatment of
invalid characters in "replace" mode.  Now, an incomplete or otherwise
invalid UTF-8 sequence generates exactly one replacement character.
As a result, the Python UTF-8 decoder now passes Markus Kuhn's UTF-8
stress test.

I confirm that, to the best of my knowledge and belief, this
contribution is free of any claims of third parties under copyright,
patent or other rights or interests ("claims").  To the extent that
I have any such claims, I hereby grant to CNRI a nonexclusive,
irrevocable, royalty-free, worldwide license to reproduce, distribute,
perform and/or display publicly, prepare derivative versions, and
otherwise use this contribution as part of the Python software and its
related documentation, or any derivative versions thereof, at no cost
to CNRI or its licensed users, and to authorize others to do so.

I acknowledge that CNRI may, at its sole discretion, decide whether or
not to incorporate this contribution in the Python software and its
related documentation.  I further grant CNRI permission to use my name
and other identifying information provided to CNRI by me for use in
connection with the Python software and its related documentation.


--=-=-=
Content-Type: text/x-patch
Content-Disposition: attachment; filename=python-utf8.diff

Index: unicodeobject.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/unicodeobject.c,v
retrieving revision 2.21
diff -u -r2.21 unicodeobject.c

--- unicodeobject.c	2000/05/09 19:54:43	2.21
+++ unicodeobject.c	2000/05/13 09:32:13
@@ -582,7 +582,8 @@
 #define UTF8_ERROR(details)  do {                       \
     if (utf8_decoding_error(&s, &p, errors, details))   \
         goto onError;                                   \
-    continue;                                           \
+    else                                                \
+        goto nextCharacter;                             \
 } while (0)
 
 PyObject *PyUnicode_DecodeUTF8(const char *s,
@@ -631,31 +632,48 @@
             break;
 
         case 2:
-            if ((s[1] & 0xc0) != 0x80) 
+	    if ((s[1] & 0xc0) != 0x80) {
                 UTF8_ERROR("invalid data");
+	    }
             ch = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
-            if (ch < 0x80)
+            if (ch < 0x80) {
+		/* Skip rest of this sequence. */
+		s++;
                 UTF8_ERROR("illegal encoding");
-	    else
+	    } else
 		*p++ = ch;
             break;
 
         case 3:
             if ((s[1] & 0xc0) != 0x80 || 
-                (s[2] & 0xc0) != 0x80) 
+                (s[2] & 0xc0) != 0x80) {
+		/* Skip character which likely belongs to this sequence. */
+		if ((s[1] & 0xc0) == 0x80) {
+		    s++;
+		}
                 UTF8_ERROR("invalid data");
+	    }
             ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] & 0x3f);
-            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000))
+            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000)) {
+		/* Skip rest of this sequence. */
+		s += 2;
                 UTF8_ERROR("illegal encoding");
-	    else
+	    } else
 		*p++ = ch;
             break;
 
         default:
             /* Other sizes are only needed for UCS-4 */
-            UTF8_ERROR("unsupported Unicode code range");
+	    /* Skip over these characters. */
+	    s++;
+	    while (s < e && ((*s & 0xc0) == 0x80)) s++;
+	    /* UTF8_ERROR will skip one character. */
+	    s--;
+	    UTF8_ERROR("unsupported Unicode code range");
         }
         s += n;
+
+    nextCharacter:
     }
 
     /* Adjust length */

--=-=-=--