[I18n-sig] UCS-4 configuration

Wed, 27 Jun 2001 04:24:44 -0400

[Martin v. Loewis]
> I would never remotely consider questioning your authority, how could I?

LOL!  If authority were of any help in getting software to work, Guido
wouldn't need any of us:  he could just scowl at it, and it would all fall
into place <wink>.

> The specific code in question is in PyUnicode_DecodeUTF16. It gets a
> char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short).
> It then fetches a Py_UCS2 after another, byte-swapping if appropriate,
> and advances the Py_UCS2* by one. The intention is that this retrieves
> the bytes of the input in pairs.
>
> Is that code correct even if sizeof(unsigned short)>2?

Oh no.  Clearly, if sizeof(Py_UCS2) > 2, it will read more than 2 bytes each
time.  But the *obvious* way to read two bytes is to use a char* pointer!
Say q and e were declared

    const unsigned char*

instead of Py_UCS2*.  Then for big-endian getting "the next" char is just

    ch = (q[0] << 8) | q[1];
    q += 2;

and swap "0" and "1" for a little-endian machine.  The code would get
substantially simpler.  In fact, you can skip all the embedded #ifdefs and
repeated (bo == 1), (bo == -1) tests by setting up invariants

int lo_index, hi_index;

appropriately at the start before the loop-- setting one of those to 1 and
the other to 0 --and then do

    ch = (q[hi_index] << 8) | q[lo_index]
    q += 2;

unconditionally inside the loop whenever fetching another pair.  Now C
doesn't guarantee that a byte is 8 bits either, but that's one thing that's
true even on a Cray (they actually read 64 bits under the covers and
shift+mask, but it looks like "8 bits" to C code); I don't know of any
modern box on which it isn't true, and it's exceedingly unlikely any new
architecture won't play along.

Everything else should "just work" then.  BTW, the existing byte-swapping
code doesn't work right either for sizeof(Py_UCS2) > 2, because in

    ch = (ch >> 8) | (ch << 8);

there's an assumption that the left shift is end-off.  Fetch a byte at a
time as above and none of that fiddling is needed.  Else the existing
byte-swapping code needs either

    ch &= 0xffff;

after, or

    ch = (ch >> 8) | ((ch & 0xff) << 8);

in the body.  But we'd be better off getting rid of Py_UCS2 thingies
entirely in this routine (they don't *mean* "UCS2", they *mean* "exactly two
bytes", and that can't always be met).