[Python-ideas] Py3 unicode impositions

Mon Feb 13 12:14:42 CET 2012

On 13 February 2012 05:42, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Paul Moore writes:
>
>  > > But you obviously do know the convention -- use UTF-8.
>  >
>  > No. I know that a lot of Unix people advocate UTF-8, and I gather it's
>  > rapidly becoming standard in the Unix world. But I work on Windows,
>  > and UTF-8 is not the standard there. I have no idea if UTF-8 is
>  > accepted cross-platform,
>
> It is.  All of Microsoft's programs (and I suppose most third-party
> software, too) that I know of will happily import UTF-8-encoded text,
> and produce it as well.  Most Microsoft-specific file formats (eg,
> Word) use UTF-16 internally, but they can't be read by most
> text-oriented programs, so in practice they're app/octet-strm.

If I create a new text file in Notepad or Vim on my PC, it's not
created in UTF-8 by default. Vim uses Latin-1, and Notepad uses "ANSI"
(which I'm pretty sure translates to CP1252 (but there are so few
differences between this and latin-1, that I can't easily test this at
the moment). If I do "chcp" on a console window, I get codepage 850,
and in CMD, echo a£b >file.txt encodes the file in CP850.

echo a£b >file.txt in Powershell creates little-endian UTF-16 with a
BOM. The out-file cmdlet in Powershell (which lets me specify an
encoding to override the UTF-16 of the standard redirection) says this
about the encoding parameter:

 -Encoding <string>
     Specifies the type of character encoding used in the file. Valid
values are "Unicode", "UTF7", "UTF8", "UTF32
      "ASCII", "BigEndianUnicode", "Default", and "OEM". "Unicode" is
the default.

     "Default" uses the encoding of the system's current ANSI code page.

     "OEM" uses the current original equipment manufacturer code page
identifier for the operating system.

With this I can at least get UTF-8 (with BOM). But it's a long way
from simple to do so...

Basically, In my experience, Windows users are not likely to produce
UTF-8 formatted files unless they make specific efforts to do so.

I have heard anecdotal evidence that attempts to set the configuration
on Windows to produce UTF-8 by default hit significant issues. So
don't expect to see Windows users producing UTF-8 by default anytime
soon.

> The problem is the one you point out: files you receive from third
> parties are still fairly likely to be in a non-Unicode encoding.

And, if I don't concentrate, I produce non-UTF8 files myself.

The good news is that Python 3 generally works fine with files I
produce myself, as it follows the system encoding.

>python
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Near enough, as the only character I tend to use is £, and latin-1 and
cp1252 concur on that (and I know what CP850 £ signs look like in
latin-1/cp1252, so I can spot that particular error).

Of course, that means that processing UTF-8 always needs me to
explicitly set the encoding. Which in turn means that (if I care -
back to the original point) I need to go checking for non-ASCII
characters, do a quick hex dump to check they look like utf-8 and set
the encoding. Or go with the default and risk mojibake (cp1252 is not
latin-1 AIUI, so won't roundtrip bytes). Or go the "don't care" route.
All of this simply because I feel that it's impolite to corrupt
someone's name in my output just because they have an accented letter
in their name :-)

As I say:
- I know what to do
- It can be a lot of work
- Frankly, the damage is minor (these are usually personal or low-risk scripts)
- The temptation to say "stuff it" and get on with my life is high
- It frustrates me that Python by default tempts me to *not* do the right thing

Maybe the answer is to have some form of encoding-detection function
in the standard library. It doesn't have to be 100% accurate, and it
certainly shouldn't be used anywhere by default, but it would be
available for people who want to do the right thing without
over-engineering things totally.

> True.  But for personal use, and for communicating with people you
> have some influence over, you can use/recommend UTF-8 safely as far I
> know.  I occasionally get asked by Japanese people why files I send in
> UTF-8 are broken; it invariably turns out that they sent me a file in
> Shift JIS that contained a non-JIS (!) character and my software
> translated it to REPLACEMENT CHARACTER before sending as UTF-8.

Maybe it's different in Japan, where character sets are more of a
common knowledge issue? But if I tried to say to one of my colleagues
that the spooled output of a SQL query they sent me (from a database
with one encoding, through a client with no real encoding handling
beyond global OS-level defaults) didn't use UTF-8, I'd get a blank
look at best.

I've had to debug encoding issues for database programmers only to
find that they don't even know what encodings are about - and they are
writing multilingual applications! (Before someone says, yes, of
course this is terrible, and shouldn't happen - but it does, and these
are the places I get weirdly-encoded text files from...)

>  > I think people are much more aware of the issues, but cross-platform
>  > handling remains a hard problem. I don't wish to make assumptions, but
>  > your insistence that UTF-8 is a viable solution suggests to me that
>  > you don't know much about the handling of Unicode on Windows. I wish I
>  > had that luxury...
>
> I don't understand what you mean by that.  Windows doesn't make
> handling any non-Unicode encodings easy, in my experience, except for
> the local code page.  So, OK, if you're in a monolingual Windows
> environment (eg, the typical Japanese office), everybody uses a common
> legacy encoding for file exchange (including URLs and MIME filename=
> :-(, in particular Shift JIS), and only that encoding works well (ie,
> without the assistance of senior tech support personnel).  Handling
> Unicode, though, isn't really an issue; all of Microsoft's programs
> happily deal with UTF-8 and UTF-16 (in its several varieties).

What I was trying to say was that typical Windows environments (where
people don't interact often with Unix utilities, or if they do it's
with ASCII characters almost exclusively) hide the details of Unicode
from the end user to the extent that they don't know what's going on
under the hood, and don't need to care. Much like Python 2, I guess
:-)

> Indeed.  Do you really see UTF-16 in files that you process with
> Python?

Powershell generates it. See above. But no, not often, and it's easy
to fix. Meh, for easy read

   cmd /c "iconv -f utf-16 -t utf-8 u1 >u2"
or
   set-content u2 (get-content u1) -encoding utf8
if I don't mind a BOM.

No, Unicode on Windows isn't easy :-(

Paul