[Python-Dev] Support of UTF-16 and UTF-32 source encodings

Stephen J. Turnbull stephen at xemacs.org
Sun Nov 15 11:40:09 EST 2015


Laura Creighton writes:

 > Steve Turnbull, who lives in Japan, and speaks and writes Japanese
 > is saying that "he cannot see any reason for allowing non-ASCII
 > compatible encodings in Cpython".
 > 
 > This makes me wonder.
 > 
 > Is this along the lines of 'even in Japan we do not want such
 > things' or along the lines of 'when in Japan we want such things
 > we want to so brutally do so much more, so keep the reference
 > implementation simple, and don't try to help us with this 
 > seems-like-a-good-idea-but-isnt-in-practice' ideas like this one,
 > or
 > ....

I'm saying that to my knowledge Japan is the most complicated place
there is when it comes to encodings, and even so, nobody here seems to
be using UTF-16 as the encoding for program sources (or any other
text/* media).  Of course as Steve Dower pointed out it's in heavy use
as an internal text encoding, in OS APIs, in some languages' stdlib
APIs (ie, Java and I suppose .NET), and I guess in single-application
file formats (Word), but the programs that use those APIs are written
in ASCII compatible-encodings (and Shift JIS and Big5).  The Japanese
don't need or want UTF-16 in text files, etc.

Besides that, I can also say that PEP 263 didn't legislate the use of
ASCII-compatible encodings.  For one thing, Shift JIS and Big5 aren't
100% compatible because they uses 0x20-0x7f in multibyte characters.
They're just close enough to ASCII compatible to mostly "just work",
at least on Microsoft OSes provided by OEMs in the relevant countries.

What PEP 263 did do was to specify that non-ASCII-compatible encodings
are not supported by the PEP 263 mechanism for declaring the encoding
of a Python source program.  That's because it looks for a "magic
number" which is the ASCII-encoded form of "coding:" in the first two
lines.  It doesn't rule out alternative mechanisms for encoding
detection (specifically, use of the UTF-16 "BOM" signature); it just
doesn't propose implementing them.

IIRC nobody has ever asked for them, but I think the idea is absurd
so I have to admit I may have seen a request and forgot it instantly.

Bottom line: as long as Python (or the launcher) is able to transcode
the source to the internal Unicode format (UTF-8 in Python 2, and
widechar or PEP 393 in Python 3) before actually beginning parsing,
any on-disk encoding is OK.  But I just don't see a use case for
UTF-16.  If I'm wrong, I think that this feature should be added to
launchers, not CPython, because it forces the decoder to know what
formats other than ASCII are implemented and to try heuristics to
guess, rather than just obeying the coding cookie.



More information about the Python-Dev mailing list