[Python-Dev] RE: Defining Unicode Literal Encodings

M.-A. Lemburg mal at lemburg.com
Fri Jul 13 17:56:40 EDT 2001


Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > PEP: 0263 (?)
> > Title: Defining Unicode Literal Encodings
> > Version: $Revision: 1.0 $
> > Author: mal at lemburg.com (Marc-André Lemburg)
> > Status: Draft
> > Type: Standards Track
> > Python-Version: 2.3
> > Created: 06-Jun-2001
> > Post-History:
> 
> Since this depends on PEP 244, it should also have a
> 
>   Requires: 244
> 
> header line.

Ok, I'll add that.
 
> > ...
> > ... can be set using the "directive" statement proposed in PEP 244.
> >
> >     The syntax for the directives is as follows:
> >
> >     'directive' WS+ 'unicodeencoding' WS* '=' WS* PYTHONSTRINGLITERAL
> >     'directive' WS+ 'rawunicodeencoding' WS* '=' WS* PYTHONSTRINGLITERAL
> 
> PEP 244 doesn't allow these spellings:  at most one atom is allowed after
> the directive name, and
> 
>     = "whatever"
> 
> isn't an atom.  Remove the '=' and PEP 244 is happy, though.  If you want to
> keep the "=", PEP 244 has to change.

True... would that pose a problem ?
 
[Paul]
> I think that there should be a single directive for:
> 
>  * unicode strings
>  * 8-bit strings
>  * comments
> 
> If a user uses UTF-8 for 8-bit strings and Shift-JIS for Unicode, there
> is basically no text editor in the world that is going to do the right
> thing. And it isn't possible for a web server to properly associate an
> encoding. In general, it isn't a useful configuration.

Please don't mix 8-bit strings with Unicode literals: 8-bit
strings don't carry any encoding information, so providing encoding
information cannot be stored anywhere. 

Comments, OTOH, are part of the program text, so they have to be ASCII
just like the Python source itself.

Note that it doesn't make sense to use a non-ASCII superset
for the Unicode literal encoding (as you and others have noted).
Since all builtin Python encodings are ASCII-supersets, this
shouldn't pose much of a problem, though ;-)
 
> Also, no matter what the directive says, I think that \uXXXX should
> continue to work. Just as in 8-bit strings, it should be possible to mix
> and match direct encoded input and backslash-escaped characters.
> Sometimes one is convenient (because of your keyboard setup) and
> sometimes the other is convenient. This proposal exists only to improve
> typing convenience so we should go all the way and allow both.

Hmm, good point, but hard to implement. We'd probably need a two
phase decoding for this to work:

1. decode the given Unicode literal encoding
2. decode any Unicode escapes in the Unicode string
 
> I strongly think we should restrict the directive to one per file and in
> fact I would say it should be one of the first two lines. It should be
> immediately following the shebang line if there is one. This is to allow
> text editors to detect it as they detect XML encoding declarations.
> 
> My opinions are influenced by the fact that I've helped implement
> Unicode support in an Python/XML editor. XML makes it easy to give the
> user a good experience. Python could too if we are careful.

I think that allowing one directive per file is the way to go,
but I'm not sure about the exact position. Basically, I think it
should go "near" the top, but not necessarily before any doc-string
in the file.
 
> [Guido]
> > Hm, then the directive would syntactically have to *precede* the
> > docstring.  That currently doesn't work -- the docstring may only be
> > preceded by blank lines and comments.  Lots of tools for processing
> > docstrings already have this built into them.  Is it worth breaking
> > them so that editors can remain stupid?
> 
> No.

Agreed.

Note that the PEP doesn't require the directive to be placed before the
doc-string. That point is still open. Technically, the compiler
will only need to know about the encoding before the first
Unicode literal in the source file.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/




More information about the Python-list mailing list