Flexible string representation, unicode, typography, ...

Sun Sep 2 21:54:42 EDT 2012

On Sun, 02 Sep 2012 23:38:49 +0300, Serhiy Storchaka wrote:

> On 30.08.12 09:55, Steven D'Aprano wrote:
>> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.
> 
> I see that this misconception widely spread.

I am not familiar enough with the C implementation to tell what Python 
3.3 actually does, and the PEP assumes a fair amount of familiarity with 
the CPython source. So I welcome corrections.

> In fact Python 3.3 uses four kinds of ready strings.
> 
> * ASCII. All codes <= U+007F.
> * UCS1. All codes <= U+00FF, at least one code > U+007F. 
> * UCS2. All codes <= U+FFFF, at least one code > U+00FF. 
> * UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Where UCS1 is equivalent to Latin-1, correct?

UCS2 is what Python 3.2 narrow builds uses for all strings, including 
codes > U+FFFF using surrogate pairs.

UCS4 is what Python 3.2 wide builds uses for all strings.

This means that Python 3.3 will no longer have surrogate pairs.

Am I right?

> Indexing is O(0) for any string.

I think you mean O(1) for constant-time lookups.

> Also the string can optionally cache UTF-8 and wchar_t* representation.

Right, that's the bit that wasn't clear -- the UTF-8 data is a cache, not 
the canonical representation.

-- 
Steven