Ah Python, you have spoiled me for all other languages

Chris Angelico rosuav at gmail.com
Sun Jun 7 10:47:14 EDT 2015


On Sun, Jun 7, 2015 at 11:24 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>> (Unless you want to say that all strings are guaranteed to
>> be NFC/NFD normalized, such that s1 and s2 would actually be
>> identical, which I suppose is plausible. I'm not sure what the
>> advantage would be, though. And certainly you wouldn't want to
>> K-normalize strings automatically.)
>
> I believe that filenames on Apple file systems (HFS+ if I remember
> correctly) are guaranteed to be both normalised and correctly encoded as
> UTF-8. If you could live in a purely Apple world, you'd have far fewer
> filename hassles.

Yep. Actually, there should be nothing stopping the next Linux file
system ("ext5" or whatever) from enforcing the same thing;
byte-oriented filename APIs would still work just fine, but you could
have some confidence that at least local file systems will normally be
decodable as UTF-8. Then the only time you'd have to worry about
encoding problems would be network or removable file systems - no
worrying about "what's the FS encoding", because it'll just be UTF-8.
(Hmm. Point of interest: What happens on a Mac if you network-mount
something that isn't Unicode? If the enforcement of UTF-8 and
normalization is done at the file system level, it's no different from
the current Linux situation, where basically anything goes.)

But that's file names, not strings in a program. I'm not sure that
mandating that strings be normalized is particularly useful, but on
the flip side, I'm not sure of any situation where it'd be majorly
problematic either. There are ambiguities in some encodings, and as
soon as you decode from them and re-encode, you've effectively folded
those ambiguities to some canonical form; if your language
automatically normalized strings, you'd just have the same effect of
folding. And then you could have encode methods that stipulate the
other form of normalization - say you NFD everything internally, you
could then have a method "a\u0301".encode("utf-8", combine=True) which
NFC normalizes prior to encoding (and would thus be C3 A1 instead of
61 CC 81). Are there any languages out there that work this way?

ChrisA



More information about the Python-list mailing list