[Python-3000] canonicalization [was: On PEP 3116: new I/O base classes]

Fri Jun 22 16:15:14 CEST 2007

Guido van Rossum writes:

 > > If I ask for just one character, do I get only the o, without the
 > > diaeresis, or do I get both (since they are linguistically one
 > > letter), or does it depend on how some editor happened to store it?
 > 
 > It should get you the next code unit as it comes out of the
 > incremental codec. (Did you see my semantic model I described in a
 > different thread?)

I don't like this<wink>, but since that's the way it's gonna be ...

 > > Distinguishing strings based on an accident of storage would violate
 > > unicode standards.  (More precisely, it would be a violation of
 > > standards to assume that they are distinguished.)
 > 
 > I don't give a damn about this requirement of the Unicode standard.

... this requirement does not apply to the Python str type as you have
described it.

I think at this stage we're asking for trouble to have any
normalization by default, even in the TextIO module.  str is not text,
it's an array of code units.  str is going to be used to implement
codecs, I/O buffers, all kinds of things that don't necessarily have
Unicode text semantics.  Unless the Python language itself defines the
semantics of the array of code units, EIBTI.  This accords with
Martin's statement about identifiers being the only thing he proposed
normalizing.

Even if we know a user wants text, I don't see any state of the art
that allows us to guess which normalization will be most useful to
him.  I think for identifiers, NFKC is almost a no-brainer.  But for
strings it is not at all obvious.  NFC violates such useful string
invariants such as len(a) + len(b) == len(a+b).  AFAICS, NKD does
not.  OTOH, if you don't need strings to obey array invariants, NFC is
much more friendly to "dumb" UIs that just display the characters as
they get them, without trying to find an equivalent that is in the
font for missing charactes.

And it seems plausible that some applications will mix normalizations
inside of the Python instance.  The app must handle this; Python
can't.  Even if you carry normalization information around with your
str object, what normalization is Python supposed to apply to nfd_str
+ nfc_str?  But surely that operation is permissible!

 > > In practice, binary concerns do intrude even for text data; you may
 > > well want to save it back out in the original encoding, without any
 > > spurious changes.

Then for the purposes of this discussion, it's not text, it's binary.
In many cases it will need to be read as bytes and stored that way
until written back out.

Ie, many legacy encodings do not support roundtrips, such as those
that use ISO 2022 extension techniques: there's no rule against having
a mode-changing sequence and its inverse in succession, and it's
occasionally seen in the wild.  Even UTF-8 has unnormalized
representations for many characters, and it was only recently that
Unicode came to require that they be treated as errors, and not
interpreted (producing them has always been forbidden).