[Python-3000] canonicalization [was: On PEP 3116: new I/O base classes]

Thu Jun 21 18:12:22 CEST 2007

Should canonicalization should be an extra feature of the Text IO, on
par with character encoding?

On 6/20/07, Daniel Stutzbach <daniel at stutzbachenterprises.com> wrote:
> On 6/20/07, Bill Janssen <janssen at parc.com> wrote:

[For the TextIO, as opposed to the raw IO, Bill originally proposed
dropping read(n), because character count is not well-defined.  Dan
objected that not all text has useful line breaks.]

> > ... just saying "give me N characters" isn't enough.
> > We need to say, "N characters assuming a text
> > encoding of M, with a normalization policy of Q,
> > and a newline policy of R".

[ Daniel points out that TextIO already handles M and R ]

> I'm not sure I 100% understand what you mean by
> "normalization policy" (Q).  Could you give an example?

How many characters are there in ö?

If I ask for just one character, do I get only the o, without the
diaeresis, or do I get both (since they are linguistically one
letter), or does it depend on how some editor happened to store it?

Distinguishing strings based on an accident of storage would violate
unicode standards.  (More precisely, it would be a violation of
standards to assume that they are distinguished.)

To the extent that you are treating the data as text rather than
binary, NFC or NFD normalization should always be appropriate.

In practice, binary concerns do intrude even for text data; you may
well want to save it back out in the original encoding, without any
spurious changes.

Proposal:

    open would default to NFC.

    import would open source code with NFKC.

    An explict None canonicalization would allow round-trips without
spurious binary-level changes.

-jJ