[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 18:58:11 CET 2006

Guido van Rossum wrote:
> On 2/13/06, M.-A. Lemburg <mal at egenix.com> wrote:
>> Guido van Rossum wrote:
>>> It'd be cruel and unusual punishment though to have to write
>>>
>>>   bytes("abc", "Latin-1")
>>>
>>> I propose that the default encoding (for basestring instances) ought
>>> to be "ascii" just like everywhere else. (Meaning, it should really be
>>> the system default encoding, which defaults to "ascii" and is
>>> intentionally hard to change.)
>> We're talking about Py3k here: "abc" will be a Unicode string,
>> so why restrict the conversion to 7 bits when you can have 8 bits
>> without any conversion problems ?
> 
> As Phillip guessed, I was indeed thinking about introducing bytes()
> sooner than that, perhaps even in 2.5 (though I don't want anything
> rushed).

Hmm, that is probably going to be too early. As the thread shows
there are lots of things to take into account, esp. since if you
plan to introduce byte() in 2.x, the upgrade path to 3.x would
have to be carefully planned. Otherwise, we end up introducing
a feature which is meant to prepare for 3.x and then we end up
causing breakage when the move is finally implemented.

> Even in Py3k though, the encoding issue stands -- what if the file
> encoding is Unicode? Then using Latin-1 to encode bytes by default
> might not by what the user expected. Or what if the file encoding is
> something totally different? (Cyrillic, Greek, Japanese, Klingon.)
> Anything default but ASCII isn't going to work as expected. ASCII
> isn't going to work as expected either, but it will complain loudly
> (by throwing a UnicodeError) whenever you try it, rather than causing
> subtle bugs later.

I think there's a misunderstanding here: in Py3k, all "string"
literals will be converted from the source code encoding to
Unicode. There are no ambiguities - a Klingon character will still
map to the same ordinal used to create the byte content regardless
of whether the source file is encoded in UTF-8, UTF-16 or
some Klingon charset (are there any ?).

Furthermore, by restricting to ASCII you'd also outrule hex escapes
which seem to be the natural choice for presenting binary data in
literals - the Unicode representation would then only be an
implementation detail of the way Python treats "string" literals
and a user would certainly expect to find e.g. \x88 in the bytes object
if she writes bytes('\x88').

But maybe you have something different in mind... I'm talking
about ways to create bytes() in Py3k using "string" literals.

>> While we're at it: I'd suggest that we remove the auto-conversion
>> from bytes to Unicode in Py3k and the default encoding along with
>> it.
> 
> I'm not sure which auto-conversion you're talking about, since there
> is no bytes type yet. If you're talking about the auto-conversion from
> str to unicode: the bytes type should not be assumed to have *any*
> properties that the current str type has, and that includes
> auto-conversion.

I was talking about the automatic conversion of 8-bit strings to
Unicode - which was a key feature to make the introduction of
Unicode less painful, but will no longer be necessary in Py3k.

>> In Py3k the standard lib will have to be Unicode compatible
>> anyway and string parser markers like "s#" will have to go away
>> as well, so there's not much need for this anymore.
>>
>> (Maybe a bit radical, but I guess that's what Py3k is meant for.)
> 
> Right.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 14 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::