diferences between 22 and python 23

Sun Dec 7 14:39:44 EST 2003

On 07 Dec 2003 18:31:49 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>"Fredrik Lundh" <fredrik at pythonware.com> writes:
>
>> otoh, it would make sense to use 8-bit strings to store Unicode strings
>> that happen to contain only Unicode code points in the full 8-bit range
>> (0..255).
>
>I'm not sure about the advantages. It would give a more efficient
>representation, yes, but at the cost a slower implementation. Codecs
>often cannot know in advance whether a string will contain only
>latin-1 (unless they are the latin-1 or the ascii codec), so they
>would need to scan over the input first.
But if all strings that represented characters had .coding attributes, you
would just use that and know without looking.

>
>In addition, operations like PyUnicode_AsUnicode would be very
>difficult to implement (unless you have *two* representation pointers
>in the Unicode object - at which time the memory savings are
>questionable).
Maybe if unicode objects used strings with a standard .coding attribute
of 'unicode' for normalized representation, meaning the system's standard unicode
encoding (probably utf-16le byte strings), then any string with a .coding attribute
could be instantly captured and plugged into the unicode object by reference as an
alternative unicode data representation, and conversions of actual
representation format could be lazy, normalizing when sesnible, but also possibly
doing multi-string operations in their native encodings when compatible.

IOW, u'abc' + u'def' might capture 'abc' from the source code text with # -*- coding: latin-1 -*-
and lazily use an internal pointer to a 'abc' string with .coding='latin-1'. Ditto with
the u'def', so when they are added, you get u'abcdef' but internally it was adding two
latin-1 representations of the unicode characters, so it could produce the concatenation
without changing encoding. As far as the unicode object interface was concerned, nothing
would change (except maybe some debugging/inspecting things to get at internal details),
but representation could vary privately.

(BTW all the standard encoding names could be interned so is-comparisons could be used
to check them).

>
>> I assume you meant:
>> 
>>     Yes, all library functions that expect *text* strings should support
>>     Unicode objects.
>
>Correct.
>
>> having written Python's Unicode string type, I'm now thinking that
>> it might have been better to use a polymorphic "text" type with
>> either UTF-8 or encoded char or wchar buffers, and do dynamic
>> translation based on usage patterns.  I've been playing with this
>> idea in Pytte, but as usual, there's so much code, and so little
>> time...
This sounds very similar to what I have been trying to say.

>
>"Better" in what sense? Would it even be better if you had to preserve
>all the C-level API that we currently have?
>
Not sure what that entails.
Regards,
Bengt Richter