[Python-3000] canonicalization [was: On PEP 3116: new I/O base classes]

"Martin v. Löwis" martin at v.loewis.de
Fri Jun 22 08:45:27 CEST 2007


> Counter-proposal: normalization is provided as library functionality.
> Applications are responsible for normalization data when they need it
> to be normalized and they can't be sure that it isn't already
> normalized. The source parser used by import and a few other places is
> an "application" in this sense and can certainly apply whatever
> normalization is required. Have we agreed on the level of
> normalization for source code yet? I'm pretty sure we have agreed on
> *when* it happens, i.e. (logically) before the lexer starts scanning
> the source code.

That isn't actually my view: I would apply normalization *only* to
identifiers, i.e. leave string literals unmodified. If people would
rather see normalization applied to the entire input, that would be
an option, of course (although perhaps more expensive to implement,
as you need to perform it on all source, even if that source turns
out to be ASCII only).

> What is the status of normalization in Java? Does Java source code get
> normalized before it is parsed? 

The JLS is silent on that issue, so I think the answer is "no".
A quick test (see attached file) shows that it doesn't: i.e.
it reports an error "cannot find symbol" even though the symbol
would be defined under NFC (or NFD).

> What if \u.... is used? 

It just gets inserted as-is.

> Do the Java I/O library classes normalize text?

The java.io.InputStreamReader doesn't, see attached code.
It appears that Java JRE doesn't support normalization at all
until Java 6, where you can use java.text.Normalizer. Before,
this class was in sun.text.Normalizer, and (apparently)
only used for URI (normalizing to NFC), collation (performing
NFD on request), and regular expressions (likewise).

Apparently, Sun doesn't consider Unicode normalization
as an issue.

Regards,
Martin


-------------- next part --------------
A non-text attachment was scrubbed...
Name: foo.java
Type: text/x-java
Size: 53 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment.java 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r.java
Type: text/x-java
Size: 479 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-3000/attachments/20070622/ec269a30/attachment-0001.java 


More information about the Python-3000 mailing list