[Python-Dev] Re: PEP: Defining Unicode Literal Encodings (revision 1.1)

M.-A. Lemburg mal@lemburg.com
Sat, 14 Jul 2001 13:32:10 +0200


Skip Montanaro wrote:
>=20
>     mal> Here's an updated version which clarifies some issues...
>     ...
>     mal>     I propose to make the Unicode literal encodings (both stan=
dard
>     mal>     and raw) a per-source file option which can be set using t=
he
>     mal>     "directive" statement proposed in PEP 244 in a slightly
>     mal>     extended form (by adding the '=3D' between the directive n=
ame and
>     mal>     it's value).
>=20
> I think you need to motivate the need for a different syntax than is de=
fined
> in PEP 244.  I didn't see any obvious reason why the '=3D' is required.

I'm not picky about the '=3D'; if people don't want it, I'll
happily drop it from the PEP. The only reason I think it may be
worthwhile adding it is because it simply looks right:

directive unicodeencoding =3D 'latin-1'

rather than

directive unicodeencoding 'latin-1'

(Note that internally this will set a flag to a value, so the
assigning character of '=3D' seems to fit in nicely.)
=20
> Also, how do you propose to address /F's objections, particularly that =
the
> directive can't syntactically appear before the module's docstring (whe=
re it
> makes sense that the module author would logically want to use a non-de=
fault
> encoding)?

Guido hinted to the problem of breaking code, Tim objected
to requiring this.=20

I don't see the need to use Unicode literals
as module doc-strings, so I think the problem is not a real one
(8-bit strings can be written using any encoding just like you can=20
now).

Still, if people would like to use Unicode literals for module
doc-strings, then they should place the directive *before* the
doc-string accepting that this could break some tools (the PEP currently
does not restrict the placement of the directive). Alternatively,
we could allow placing the directive into a comment, e.g.

#!/usr/local/python
#directive unicodeencoding =3D 'utf-8'
u"""
     This is a Unicode doc-string
"""

About Fredrik's idea that the source code should only use one=20
encoding:=20

Well, that's possible with the proposed directive, since=20
only Unicode literals carry data for Python is encoding-aware
and all other parts are under the programmer's control, e.g.

#!/usr/local/python
""" Module Docs...
"""
directive unicodeencoding =3D 'latin-1'
...
u =3D "H=E9ll=F4 W=F6rld !"
...

will give you pretty much what Fredrik asked for.=20

Note that since Python does not assign encoding information to=20
8-bit strings, comments etc. the only parts in a Python program=20
for which the programmer must explicitly tell Python which=20
encoding to assume are the Unicode literals.

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/