PEP: Defining Unicode Literal Encodings (revision 1.1)

M.-A. Lemburg mal at lemburg.com
Sat Jul 14 07:32:10 EDT 2001


Skip Montanaro wrote:
> 
>     mal> Here's an updated version which clarifies some issues...
>     ...
>     mal>     I propose to make the Unicode literal encodings (both standard
>     mal>     and raw) a per-source file option which can be set using the
>     mal>     "directive" statement proposed in PEP 244 in a slightly
>     mal>     extended form (by adding the '=' between the directive name and
>     mal>     it's value).
> 
> I think you need to motivate the need for a different syntax than is defined
> in PEP 244.  I didn't see any obvious reason why the '=' is required.

I'm not picky about the '='; if people don't want it, I'll
happily drop it from the PEP. The only reason I think it may be
worthwhile adding it is because it simply looks right:

directive unicodeencoding = 'latin-1'

rather than

directive unicodeencoding 'latin-1'

(Note that internally this will set a flag to a value, so the
assigning character of '=' seems to fit in nicely.)
 
> Also, how do you propose to address /F's objections, particularly that the
> directive can't syntactically appear before the module's docstring (where it
> makes sense that the module author would logically want to use a non-default
> encoding)?

Guido hinted to the problem of breaking code, Tim objected
to requiring this. 

I don't see the need to use Unicode literals
as module doc-strings, so I think the problem is not a real one
(8-bit strings can be written using any encoding just like you can 
now).

Still, if people would like to use Unicode literals for module
doc-strings, then they should place the directive *before* the
doc-string accepting that this could break some tools (the PEP currently
does not restrict the placement of the directive). Alternatively,
we could allow placing the directive into a comment, e.g.

#!/usr/local/python
#directive unicodeencoding = 'utf-8'
u"""
     This is a Unicode doc-string
"""

About Fredrik's idea that the source code should only use one 
encoding: 

Well, that's possible with the proposed directive, since 
only Unicode literals carry data for Python is encoding-aware
and all other parts are under the programmer's control, e.g.

#!/usr/local/python
""" Module Docs...
"""
directive unicodeencoding = 'latin-1'
...
u = "Héllô Wörld !"
...

will give you pretty much what Fredrik asked for. 

Note that since Python does not assign encoding information to 
8-bit strings, comments etc. the only parts in a Python program 
for which the programmer must explicitly tell Python which 
encoding to assume are the Unicode literals.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/





More information about the Python-list mailing list