Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Fri Mar 6 10:27:22 EST 2015


On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Rustom Mody wrote:
>
>> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
>
> [snip example of an analogous situation with NULs]
>
>> Strawman.
>
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

If I had a dollar for every time anyone said "If I had <insert
currency unit here> for every time...", I'd go meta all day long and
profit from it... :)

> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

I agree that that part of the Unix model needs to change, but there
are two viable ways to move forward:

1) Keep file names as bytes, but mandate that they be valid UTF-8
streams, and recommend that they be decoded UTF-8 for display to a
human
2) Change the entire protocol stack from the file system upwards so
that file names become Unicode strings.

Trouble with #2 is that file names need to be passed around somehow,
which means bytes in memory. So ultimately, #2 really means "keep file
names as bytes, and mandate an encoding all the way up the stack"...
so it's a massive documentation change that really comes down to the
same thing as #1.

This is one area where, as I understand it, Mac OS got it right. It's
time for other Unix variants to adopt the same policy. The bulk of
file names will be ASCII-only anyway, so requiring UTF-8 won't affect
them; a lot of others are already UTF-8; so all we need is a
transition scheme for the remaining ones. If there's a known FS
encoding, it ought to be possible to have a file system conversion
tool that goes through everything, decodes, re-encodes UTF-8, and then
flags the file system as UTF-8 compliant. All that'd be left would be
the file names that are broken already - ones that don't decode in the
FS encoding - and there's nothing to be done with them but wrap them
up into something probably-meaningless-but reversible.

When can we start doing this? ext5?

ChrisA



More information about the Python-list mailing list