Flexible string representation, unicode, typography, ...

Thu Aug 30 16:44:32 EDT 2012

On 8/30/2012 12:00 PM, Steven D'Aprano wrote:
> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
>
>> In article <503f0e45$0$9416$c3e8da3$76491128 at news.astraweb.com>,
>>   Steven D'Aprano <steve+comp.lang.python at pearwood.info> wrote:
>>
>>> The only thing which is innovative here is that instead of the Python
>>> compiler declaring that "all strings will be stored in UCS-2", the
>>> compiler chooses an implementation for each string as needed. So some
>>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>>> ASCII (which is a standard, but not the Unicode consortium's standard).
>>
>> Is the implementation smart enough to know that x == y is always False
>> if x and y are using different internal representations?

Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
     int result;

     if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
         PyObject *v;
         if (PyUnicode_READY(left) == -1 ||
             PyUnicode_READY(right) == -1)
             return NULL;
         if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
             PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
             if (op == Py_EQ) {
                 Py_INCREF(Py_False);
                 return Py_False;
             }
             if (op == Py_NE) {
                 Py_INCREF(Py_True);
                 return Py_True;
             }
         }
...
KIND is 1,2,4 bytes/char

'a in s' is also False if a chars are wider than s chars.

If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast, 
constant time operation, as I showed earlier in this discussion. This is 
one thing that is much faster in 3.3.

Such things can be tested by timing with different lengths of strings, 
where the initial string creation is done in setup code rather than in 
the repeated operation code.

> But x and y are not necessarily always False just because they have
> different representations. There may be circumstances where two strings
> have different internal representations even though their content is the
> same, so it's an unsafe optimization to automatically treat them as
> unequal.

I am sure that str objects are always in canonical form once visible to 
Python code. Note that unready (non-canonical) objects are rejected by 
the rich comparison function.

> My expectation is that the initial implementation of PEP 393 will be
> relatively unoptimized,

The initial implementation was a year ago. At least three people have 
expended considerable effort improving it since, so that the slowdown 
mentioned in the PEP has mostly disappeared. The things that are still 
slower are somewhat balanced by things that are faster.

-- 
Terry Jan Reedy