[Python-Dev] Internationalization Toolkit

Greg Stein gstein@lyra.org
Fri, 12 Nov 1999 15:26:08 -0800 (PST)


On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.

True.

>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

I agree and believe that we can avoid putting it into sys altogether.

>  >   BOM_BE: '\376\377' 
>  >     (corresponds to Unicode 0x0000FEFF in UTF-16 
>  >      == ZERO WIDTH NO-BREAK SPACE)

Are you sure about that interpretation? I thought the BOM characters
(0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

>   I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
> even instead of sys.byte_order_mark (just to localize the areas of
> code that are affected).

### unicodec.py ###
import struct

BOM = struct.pack('h', 0x0000FEFF)
BOM_BE = '\376\377'
...


If somebody needs the BOM, then they should go to unicodec.py (or some
other module). I do not believe we need to put that stuff into the sys
module. It is just too easy to create the value in Python.

Cheers,
-g

p.s. to be pedantic, the pack() format could be '@h'

--
Greg Stein, http://www.lyra.org/