Flexible string representation, unicode, typography, ...

Serhiy Storchaka storchaka at gmail.com
Sun Sep 2 16:38:49 EDT 2012


On 30.08.12 09:55, Steven D'Aprano wrote:
> And Python's solution uses those: UCS-2, UCS-4, and UTF-8.

I see that this misconception widely spread. In fact Python 3.3 uses 
four kinds of ready strings.

* ASCII. All codes <= U+007F.
* UCS1. All codes <= U+00FF, at least one code > U+007F.
* UCS2. All codes <= U+FFFF, at least one code > U+00FF.
* UCS4. All codes <= U+0010FFFF, at least one code > U+FFFF.

Indexing is O(0) for any string.

Also the string can optionally cache UTF-8 and wchar_t* representation.





More information about the Python-list mailing list