[Python-Dev] Internationalization Toolkit
Greg Stein
gstein@lyra.org
Fri, 12 Nov 1999 15:26:08 -0800 (PST)
On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
> > The abbreviation BOM is quite common w/r to Unicode.
True.
> Yes: "w/r to Unicode". In sys, it's out of context and should
> receive a more descriptive name. I think using BOM in unicodec is
> good.
I agree and believe that we can avoid putting it into sys altogether.
> > BOM_BE: '\376\377'
> > (corresponds to Unicode 0x0000FEFF in UTF-16
> > == ZERO WIDTH NO-BREAK SPACE)
Are you sure about that interpretation? I thought the BOM characters
(0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.
> I'd also add BOM to be the same as sys.byte_order_mark. Perhaps
> even instead of sys.byte_order_mark (just to localize the areas of
> code that are affected).
### unicodec.py ###
import struct
BOM = struct.pack('h', 0x0000FEFF)
BOM_BE = '\376\377'
...
If somebody needs the BOM, then they should go to unicodec.py (or some
other module). I do not believe we need to put that stuff into the sys
module. It is just too easy to create the value in Python.
Cheers,
-g
p.s. to be pedantic, the pack() format could be '@h'
--
Greg Stein, http://www.lyra.org/