[Python-Dev] len(chr(i)) = 2?

Thu Nov 25 10:57:17 CET 2010

Alexander Belopolsky wrote:
> On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> ..
>>  > I note that an opinion has been raised on this thread that
>>  > if we want compressed internal representation for strings, we should
>>  > use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as
>>  > too hard to implement.  What makes UTF-16 easier than UTF-8?  Only the
>>  > fact that you can ignore bugs longer, in my view.
>>
>> That's mostly true.  My guess is that we can probably ignore those
>> bugs for as long as it takes someone to write the higher-level
>> libraries that James suggests and MAL has actually proposed and
>> started a PEP for.
>>
> 
> As far as I can tell, that PEP generated grand total of one comment in
> nine years.  This may or may not be indicative of how far away we are
> from seeing it implemented.  :-)

At the time it was too early for people to start thinking about
these issues. Actual use of Unicode really only started a few years
ago.

Since I didn't have a need for such an indexing module myself
(and didn't have much time to work on it anyway), I punted on the
idea.

If someone else wants to pick up the idea, I'd gladly help out with
the details.

> As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even
> more far fetched.  Once upon a time, Python Unicode strings supported
> buffer protocol and would lazily fill an internal buffer with bytes in
> the default encoding.  In 3.x the default encoding has been fixed as
> UTF-8, buffer protocol support was removed from strings, but the
> internal buffer caching (now UTF-8) encoded representation remained.
> Maybe we can now implement defenc logic in reverse.  Recall that
> strings are stored as UCS-2/4 sequences, but once buffer is requested
> in 2.x Python code or char* is obtained via
> _PyUnicode_AsStringAndSize() at the C level in 3.x, an internal buffer
> is filled with UTF-8 bytes and  defenc is set to point to that buffer.

The original idea was for that buffer to go away once we moved
to Unicode for strings. Reality has shown that we still need
to stick the buffer, though, since the UTF-8 representation
of Unicode objects is used a lot.

>   So the idea is for strings to store their data as UTF-8 buffer
> pointed by defenc upon construction.  If an application uses string
> indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer.
> Proper, Unicode-aware algorithms such as grapheme, word or line
> iteration or simple operations such as concatenation, search or
> substitution would operate directly on defenc buffers.  Presumably
> over time fewer and fewer applications would use code unit indexing
> that require UCS-2/4 buffer and eventually Python strings can stop
> supporting indexing altogether just like they stopped supporting the
> buffer protocol in 3.x.

I don't follow you: how would UTF-8, which has even more issues
with variable length representation of code points, make something
easier compared to UTF-16, which has far fewer such issues and
then only for non-BMP code points ?

Please note that we can only provide one way of string indexing
in Python using the standard s[1] notation and since we don't
want that operation to be fast and no more than O(1), using the
code units as items is the only reasonable way to implement it.

With an indexing module, we could then let applications work
based on higher level indexing schemes such as complete code
points (skipping surrogates), combined code points, graphemes
(ignoring e.g. most control code points and zero width
code points), words (with some customizations as to where to
break words, which will likely have to be language dependent),
lines (which can be complicated for scripts that use columns
instead ;-)), paragraphs, etc.

It would also help to add transparent indexing for right-to-left
scripts and text that uses both left-to-right and right-to-left
text (BIDI).

However, in order for these indexing methods to actually work,
they will need to return references to the code units, so we cannot
just drop that access method.

* Back on the surrogates topic:

In any case, I think this discussion is losing its grip on reality.

By far, most strings you find in actual applications don't use
surrogates at all, so the problem is being exaggerated.

If you need to be careful about surrogates for some reason, I think
a single new method .hassurrogates() on string objects would
go a long way in making detection and adding special-casing for
these a lot easier.

If adding support for surrogates doesn't make sense (e.g. in the
case of the formatting methods), then we simply punt on that and
leave such handling to other tools.

* Regarding preventing surrogates from entering the Python
runtime:

It is by far more important to maintain round-trip safety for
Unicode data, than getting every bit of code work correctly
with surrogates (often, there won't be a single correct way).

With a new method for fast detection of surrogates, we could
protect code which obviously doesn't work with surrogates and
then consider each case individually by either adding special
cases as necessary or punting on the support.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 25 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/