[Python-Dev] Support of UTF-16 and UTF-32 source encodings

Sun Nov 15 10:17:49 EST 2015

I'm approaching this from the premise that we would like to avoid 
needless surprises for users not versed in text encoding. I did a simple 
experiment with notepad on Windows 7 as if a naïve user. If I write the 
one-line program:

print("Hello world.") # by Jeff

It runs, no surprise.

We may legitimately encounter Unicode in string literals and comments. 
If I write:

print("j't'kif Anaïs!") # par Hervé

and try to save it, notepad tells me this file "contains characters in 
Unicode format which will be lost if you save this as an ANSI encoded 
text file." To keep the Unicode information I should cancel and choose a 
Unicode option. In the "Save as" dialogue the default encoding is ANSI. 
The second option "Unicode" is clearly right as the warning said 
"Unicode" 3 times and I don't know what big-endian or UTF-8 mean. Good 
that worked. Closed and opened it looks exactly as I typed it.

But the bytes I actually wrote on disk consist of a BOM and UTF-16-LE. 
And running it I get:
   File "bonjour.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file bonjour.py on 
line 1, but no encoding declared; see 
http://python.org/dev/peps/pep-0263/ for details

If I take the hint here and save as UTF-8, then it works, including 
printing the accent. Inspection of the bytes shows it starts with a 
UTF-8 BOM.

In Jython I get the same results (choking on UTF-16), but saved as 
UTF-8, it works. I just have to make sure that's a Unicode constant if I 
want it to print correctly, as we're at 2.7. Jython has a checkered past 
with encodings, but tries to do exactly the same as CPython 2.7.x.

Now, a fact I haven't mentioned is that my machine was localised to 
simplified Chinese (to diagnose some bug) during this test. If I 
re-localise to my usual English (UK), I do not get the guidance from 
notepad: instead it quietly saves as Latin-1 (cp1252), perhaps because 
I'm in Western Europe. Python baulks at this, at the first accented 
character. If I save from notepad as Unicode or UTF-8 the results are as 
before, including the BOM.

In some circumstances, then, the natural result of using notepad and not 
sticking to ASCII may be UTF-16-LE with a BOM, or Latin-1 depending on 
localisation, it seems. The Python error message provides a clue what a 
user should do, but they would need some background, a helpful teacher, 
or the Internet to sort it out.

Jeff Allen

On 15/11/2015 07:23, Stephen J. Turnbull wrote:
> Steve Dower writes:
>
>   > Saying [UTF-16] is rarely used is rather exposing your own
>   > unawareness though - it could arguably be the most commonly used
>   > encoding (depending on how you define "used").
>
> Because we're discussing the storage of .py files, the relevant
> definition is the one used by the Unicode Standard, of course: a
> text/plain stream intended to be manipulated by any conformant Unicode
> processor that claims to handle text/plain.  File formats with in-band
> formatting codes and allowing embedded non-text content like Word, or
> operating system or stdlib APIs, don't count.  Nor have I seen UTF-16
> used in email or HTML since the unregretted days of Win2k betas[1]
> (but I don't frequent Windows- or Java-oriented sites, so I have to
> admit my experience is limited in a possibly relevant way).
>
> In Japan my impression is that modern versions of Windows have
> Memopad[sic] configured to emit UTF-8-with-signature by default for
> new files, and if not, the abomination known as Shift JIS (I'm not
> sure if that is a user or OEM option, though).  Never a widechar
> encoding (after all, the whole point of Shift JIS was to use an 8-bit
> encoding for the katakana syllabary to save space or bandwidth).
>
> I think if anyone wants to use UTF-16 or UTF-32 for exchange of Python
> programs, they probably already know how to convert them to UTF-8.  As
> somebody already suggested, this can be delegated to the py.exe
> launcher, if necessary, AFAICS.
>
> I don't see any good reason for allowing non-ASCII-compatible
> encodings in the reference CPython interpreter.
>
> However, having mentioned Windows and Java, I have to wonder about
> IronPython and Jython, respectively.  Having never lived in either of
> those environments, I don't know what text encoding their users might
> prefer (or even occasionally encounter) in Python program source.
>
> Steve
>
> Footnotes:
> [1]  The version of Outlook Express shipped with them would emit
> "HTML" mail with ASCII tags and UTF-8-encoded text (even if it was
> encodable in pure ASCII).  No, it wasn't spam, either, so it probably
> really was Outlook Express as it claimed to be in one of the headers.
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:https://mail.python.org/mailman/options/python-dev/ja.py%40farowl.co.uk
>