[Python-3000] canonicalization [was: On PEP 3116: new I/O base classes]

Fri Jun 22 07:54:23 CEST 2007

On 6/21/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> Should canonicalization should be an extra feature of the Text IO, on
> par with character encoding?
>
> On 6/20/07, Daniel Stutzbach <daniel at stutzbachenterprises.com> wrote:
> > On 6/20/07, Bill Janssen <janssen at parc.com> wrote:
>
> [For the TextIO, as opposed to the raw IO, Bill originally proposed
> dropping read(n), because character count is not well-defined.  Dan
> objected that not all text has useful line breaks.]
>
> > > ... just saying "give me N characters" isn't enough.
> > > We need to say, "N characters assuming a text
> > > encoding of M, with a normalization policy of Q,
> > > and a newline policy of R".
>
> [ Daniel points out that TextIO already handles M and R ]
>
> > I'm not sure I 100% understand what you mean by
> > "normalization policy" (Q).  Could you give an example?
>
> How many characters are there in ö?
>
> If I ask for just one character, do I get only the o, without the
> diaeresis, or do I get both (since they are linguistically one
> letter), or does it depend on how some editor happened to store it?

It should get you the next code unit as it comes out of the
incremental codec. (Did you see my semantic model I described in a
different thread?)

> Distinguishing strings based on an accident of storage would violate
> unicode standards.  (More precisely, it would be a violation of
> standards to assume that they are distinguished.)

I don't give a damn about this requirement of the Unicode standard. At
least, I don't think Python should enforce it at the level of the str
data type, and that includes str objects returned by the I/O library.

> To the extent that you are treating the data as text rather than
> binary, NFC or NFD normalization should always be appropriate.
>
> In practice, binary concerns do intrude even for text data; you may
> well want to save it back out in the original encoding, without any
> spurious changes.
>
> Proposal:
>
>     open would default to NFC.
>
>     import would open source code with NFKC.
>
>     An explict None canonicalization would allow round-trips without
> spurious binary-level changes.

Counter-proposal: normalization is provided as library functionality.
Applications are responsible for normalization data when they need it
to be normalized and they can't be sure that it isn't already
normalized. The source parser used by import and a few other places is
an "application" in this sense and can certainly apply whatever
normalization is required. Have we agreed on the level of
normalization for source code yet? I'm pretty sure we have agreed on
*when* it happens, i.e. (logically) before the lexer starts scanning
the source code.

I would not be against an additional optional layer in the I/O stack
that applies normalization. We could even have an optional parameter
to open() to push this onto the stack. But I don't think it should be
the default.

What is the status of normalization in Java? Does Java source code get
normalized before it is parsed? What if \u.... is used? Do the Java
I/O library classes normalize text?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)