[issue13643] 'ascii' is a bad filesystem default encoding

Thu Dec 22 00:02:31 CET 2011

Martin Pool <mbp at sourcefrog.net> added the comment:

On 21 December 2011 12:41, Antoine Pitrou <report at bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou at free.fr> added the comment:
>
>> The standard encoding is UTF-8.
>
> How so? I don't know of any Linux or Unix spec which says so. If you get
> the Linux heads to standardize this then I'll certainly be very happy
> (and countless others will, too). But AFAIK this it not the case and I
> don't see why you are asking Python to make a choice that OS vendors
> refuse to make. You are certainly asking the wrong project to solve this
> problem.

It is a de facto, not de jure standard: UTF-8 is how things are
typically stored.  Other software (eg gnome file handling utilities)
makes this assumption.  See eg
<http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

I would be happy to see an authoritative document saying this is how
things _should_ be stored, but I can't find one yet.  But in Unix
there are no ultimate authorities: even if someone announced filenames
are utf-8 there will obviously continue to be many machines where in
practice they are not.

I started asking about it over here, to see if at least Ubuntu can
have an opinion that this is how things should normally be:
https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html

I'm not sure what you expect a technical solution at the OS level
would look like.  The api is 8-bit strings and that's not likely to
change.  It's possible to have a situation where no locale is
specified.  Applications unavoidably need to have some opinion about
what to do there.  Other applications assume the filenames are utf-8.
Python assumes that text in general will be UTF-8
(getdefaultencoding).

It is almost like your caricature of OS developers as being
anglocentric, but in fact here it's Python that assumes everything is
probably ascii - or more charitably, it is just assuming that failing
when things aren't ascii is the best tradeoff.  Maybe it is.

One OS-level fix is to try to reduce the number of situations where
people see no locale, or the C locale, and give them C.UTF-8 instead.
That is probably worth doing.  But having no locale can still happen,
and I think Python could handle that better, so the changes are
complimentary.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue13643>
_______________________________________