Flexible string representation, unicode, typography, ...

Thu Aug 30 12:00:52 EDT 2012

On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:

> In article <503f0e45$0$9416$c3e8da3$76491128 at news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
> 
>> The only thing which is innovative here is that instead of the Python
>> compiler declaring that "all strings will be stored in UCS-2", the
>> compiler chooses an implementation for each string as needed. So some
>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>> ASCII (which is a standard, but not the Unicode consortium's standard).
> 
> Is the implementation smart enough to know that x == y is always False
> if x and y are using different internal representations?

But x and y are not necessarily always False just because they have 
different representations. There may be circumstances where two strings 
have different internal representations even though their content is the 
same, so it's an unsafe optimization to automatically treat them as 
unequal.

The closest existing equivalent here is the relationship between ints and 
longs in Python 2. 42 == 42L even though they have different internal 
representations and take up a different amount of space.

My expectation is that the initial implementation of PEP 393 will be 
relatively unoptimized, and over the next few releases it will get more 
efficient. That's usually the way these things go.

-- 
Steven