[Python-Dev] PEP 263 - Defining Python Source Code Encodings

Sun, 14 Jul 2002 19:02:13 +0200

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>>Can you elaborate what you think the difference is? I believe the PEP
>>>is silent on this specific aspect,
>>
>>It does mention this as part of phase 2.
> 
> 
> All I can find is
> 
> <quote>
> The builtin compile() API will be enhanced to accept Unicode as input.
> </quote>
> 
> That leaves the question open what the compile function *does* beyond
> merely accepting Unicode strings; it is canonical that it tries to
> compile it, as it would with a byte string.

Oh, I thought it would be natural from reading the complete
text:

"""
     2. Change the tokenizer/compiler base string type from char* to
        Py_UNICODE* and apply the encoding to the complete file.

        Source files which fail to decode cause an error to be raised
        during compilation.

        The builtin compile() API will be enhanced to accept Unicode as
        input. 8-bit string input is subject to the standard procedure
        for encoding detection as decsribed above.
"""

Of course, we no longer need to convert the tokenizer to
work on Py_UNICODE, so the updated text should mention
that compile() encodes Unicode input to UTF-8 to the continue
with the usual processing. (Also see my reply to Fredrik).

> The unspecified aspect is the treatment of byte strings within the
> Unicode string. The current compiler treats them "as-is"; this is
> clearly no option. The reasonable options are:
> 
> 1. convert to byte string using "ascii" encoding,
> 2. convert to byte string using "utf-8" encoding,
> 3. convert to byte string using system default encoding,
> 4. convert to byte string using encoding declared inside the code
>    string. If that route is taken, the question is what happens
>    if no encoding declaration is found.
> 
> 
>>No need for this. The PEP already mentions it.
> 
> 
> Can you please quote the precise words in the text of the PEP that
> answer the question which of the four options above is taken?

Option 2. Ideal would be to have the tokenizer skip the
encoding declaration detection and start directly with the
UTF-8 string (this also solves the problems you'd run into
in case the Unicode source code has a source code encoding
comment).

Is that possible with the implementation ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/