[Python-3000] setup.py fails in the py3k-struni branch

"Martin v. Löwis" martin at v.loewis.de
Sat Jun 16 01:00:10 CEST 2007


> This was in the context that it is decided by the community that a st8
> type is needed and it does not go away.

I think *that* context has not occurred. People wanted a read-only
bytes type, not a byte-oriented character string type.

> The alternative is for str8 to be replaced by byte objects which I
> believe was, and still is, the plan if possible.

That type is already implemented.

> The same semantic issues will also be present in bytes objects in one
> form or another when handling data acquired from sources that use
> encoded strings.  They don't go away even if str8 does go away.

No they don't. The bytes type doesn't have an encoding associated
with it, and it shouldn't. Values may not even represent text,
but, say, image data.

> It sort of depends on how someone wants to handle situations where
> encoded strings are encountered.  Do they decode them and convert
> everything to unicode and then convert back as needed for any output. 
> Or can they keep the data in the encoded form for the duration?  I
> expect different people will feel differently on this.

In Py3k, they will use the string type, because anything else will
just be too tedious.

>> As for creating str8 objects from bytes objects: If you want
>> the str8 object to carry an encoding, you would have to *specify*
>> the encoding when creating the str8 object, since the bytes object
>> does not have that information. This is *very* hard, as you may
>> not know what the encoding is when you need to create the str8
>> object.
> 
> True, and this also applies if you want to convert an already encoded
> bytes object to unicode as well.

Right, and therefore it can never be automatic - whereas the conversion
between a bytes object and a str8 object *could* be automatic otherwise
(assuming the str8 type survives at all).

> One approach is to possibly use a factory function that uses metaclass's
> or mixins to create these based either on a str base type or a bytes
> object.
> 
>      Latin1 = get_encoded_str_type('latin-1')
> 
>      s1 = Latin1('Hello ')
[snip]

I think I lost track now what problem precisely you are trying to solve.

>> It's easy to tell what happens now: the bytes of those input
>> strings are just appended; the result string does not follow
>> a consistent character encoding anymore. This answer does
>> not apply to your proposed modification, as it does not answer
>> what the value of the .encoding attribute of the str8 would be
>> after concatenation (likewise for slicing).
> 
> And what is the use of appending unlike encoded str8 types?

You may need to put encoded text into binary data, e.g. putting
a file name into a zip file. Some of the bytes will be utf-8
encoded, others will be cp437 encode, others will be data structures
of the zip file, and the rest will be compressed bytes.

Likewise for constructing MIME messages: different pieces will
use different encodings.

> I think what Guido is thinking is we may need keep str8 around (for a
> while) as a 'C' compatible string type for purposes of interfacing to
> 'C' code.

That might be. I hope not, and I have plans to eliminate the need for
many such places (providing Unicode-oriented APIs in some cases,
and using the bytes type in other cases).

In cases where we still have char*, I think the API should specify that
this must be ASCII most of them time, with UTF-8 in selected other
cases; arbitrary binary data only when interfacing to the bytes
type.

Regards,
Martin


More information about the Python-3000 mailing list