editing in Unicode

effbot at pythonware.com effbot at pythonware.com
Fri Sep 8 06:42:31 EDT 2000


marcin wrote:
> Since UTF-16 is not compatible with ASCII, it does not make much
> sense to have just a string encoded in UTF-16 and the rest of code in
> ASCII. If UTF-16 is to be used, it would probably have to be specified
> externally to the source.

Not necessarily: XML solves this by requiring a certain character
sequence first in the file, but only if you insist on using a non-
ASCII compatible encoding.  In Python, this sequence could for
example be "#!", and the compiler could figure things out by
looking at the first four bytes:

  00 00 00 23: UCS-4, big-endian machine
  23 00 00 00: UCS-4, little-endian machine
  FE FF -- --: UTF-16, big-endian
  FF FE -- --: UTF-16, little-endian
  00 23 00 21: UTF-16, big-endian, no Byte Order Mark
  23 00 21 00: UTF-16, little-endian, no Byte Order Mark
  3C 23 -- --: UTF-8 or other ASCII-compatible encoding
  -- -- -- --: same, hopefully
  (check the encoding pragma for details; default is
  "unknown" as in 2.0.  also see below)

> IMHO there should be a way of specifying the encoding of the source
> in the source

Definitely.  Hopefully, that will go into 2.1.

Note that in 2.0, the default source encoding is "unknown". With
this encoding, "" string literals stores 8-bit characters as is,
and u"" string literals treats 8-bit characters as ISO 8859-1.

> and they should be only ASCII-compatible encodings.

Maybe, maybe not.

</F>

<!-- daily news from the python universe:
http://www.pythonware.com/daily/index.htm
-->


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list