UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)

Fredrik Lundh fredrik@pythonware.com
Wed, 10 Nov 1999 09:14:21 +0100


Tim Peters wrote:
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.

unless you're using a UTF-8 aware editor, of course ;-)

(some days, I think we need some way to tell the compiler
what encoding we're using for the source file...)

> This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs
> directly.  So, as discussed earlier, we should follow Java's lead
> and also introduce a \u escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

good idea.  and by some reason, patches for this is included
in the unicode distribution (see the attached str2utf.c).

> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

I vote for 'outlaw'.

</F>


/* A small code snippet that translates \uxxxx syntax to UTF-8 text.
   To be cut and pasted into Python/compile.c */

/* Written by Fredrik Lundh, January 1999. */

/* Documentation (for the language reference):

\uxxxx -- Unicode character with hexadecimal value xxxx.  The
character is stored using UTF-8 encoding, which means that this
sequence can result in up to three encoded characters.

Note that the 'u' must be followed by four hexadecimal digits.  If
fewer digits are given, the sequence is left in the resulting string
exactly as given.  If more digits are given, only the first four are
translated to Unicode, and the remaining digits are left in the
resulting string.

*/

#define Py_CHARMASK(ch) ch

void
convert(const char *s, char *p)
{
    while (*s) {
        if (*s != '\\') {
            *p++ = *s++;
            continue;
        }
        s++;
        switch (*s++) {

/* -------------------------------------------------------------------- */
/* copy this section to the appropriate place in compile.c... */

        case 'u':
            /* \uxxxx => UTF-8 encoded unicode character */
            if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) &&
                isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) {
                /* fetch hexadecimal character value */
                unsigned int n, ch = 0;
                for (n = 0; n < 4; n++) {
                    int c = Py_CHARMASK(*s);
                    s++;
                    ch = (ch << 4) & ~0xF;
                    if (isdigit(c))
                        ch += c - '0';
                    else if (islower(c))
                        ch += 10 + c - 'a';
                    else
                        ch += 10 + c - 'A';
                }
                /* store as UTF-8 */
                if (ch < 0x80)
                    *p++ = (char) ch;
                else {
                    if (ch < 0x800) {
                        *p++ = 0xc0 | (ch >> 6);
                        *p++ = 0x80 | (ch & 0x3f);
                    } else {
                        *p++ = 0xe0 | (ch >> 12);
                        *p++ = 0x80 | ((ch >> 6) & 0x3f);
                        *p++ = 0x80 | (ch & 0x3f);
                    }
                }
                break;
            } else
                goto bogus;

/* -------------------------------------------------------------------- */

        default:

bogus:      *p++ = '\\';
            *p++ = s[-1];
            break;
        }
    }
    *p++ = '\0';
}

main()
{
    int i;
    unsigned char buffer[100];
    
    convert("Link\\u00f6ping", buffer);

    for (i = 0; buffer[i]; i++)
        if (buffer[i] < 0x20 || buffer[i] >= 0x80)
            printf("\\%03o", buffer[i]);
        else
            printf("%c", buffer[i]);
}