[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 20:28:06 CEST 2011

Glenn Linderman writes:

 > Windows 7 64-bit on one of my computers happily crashes several
 > times a day when it detects inconsistent internal state... under
 > the theory, I guess, that losing work is better than saving bad
 > work.  You sound the opposite.

Definitely.  Windows apps habitually overwrite existing work; saving
when inconsistent would be a bad idea.  The apps I work on dump their
unsaved buffers to new files, and give you a chance to look at them
before instating them as the current version when you restart.

 > Except, I'm not sure how PEP 393 space optimization fits with the other 
 > operations.  It may even be that an application-wide complex-grapheme 
 > cache would save significant space, although if it uses high-bits in a 
 > string representation to reference the cache, PEP 393 would jump 
 > immediately to something > 16 bits per grapheme... but likely would 
 > anyway, if complex-graphemes are in the data stream.

The only language I know of that uses thousands of complex graphemes
is Korean ... and the precomposed forms are already in the BMP.  I
don't know how many accented forms you're likely to see in Vietnamese,
but I suspect it's less than 6400 (the number of characters in private
space in the BMP).  So for most applications, I believe that mapping
both non-BMP code points and grapheme clusters into that private space
should be feasible.  The only potential counterexample I can think of
is display of Arabic, which I have heard has thousands of glyphs in
good fonts because of the various ways ligatures form in that script.
However AFAIK no apps encode these as characters; I'm just admitting
that it *might* be useful.

This will require some care in registering such characters and
clusters because input text may already use private space according to
some convention, which would need to be respected.  Still, 6400
characters is a lot, even for the Japanese (IIRC the combined
repertoire of "corporate characters" that for some reason never made
it into the JIS sets is about 600, but almost all of them are already
in the BMP).  I believe the total number of Japanese emoticons is
about 200, but I doubt that any given text is likely to use more than
a few.  So I think there's plenty of space there.

This has a few advantages: (1) since these are real characters, all
Unicode algorithms will apply as long as the appropriate properties
are applied to the character in the database, and (2) it works with a
narrow code unit (specifically, UCS-2, but it could also be used with
UTF-8).  If you really need more than 6400 grapheme clusters, promote
to UTF-32, and get two more whole planes full (about 130,000 code
points).

 > I didn't attribute any efficiency to flagging lone surrogates (BI-5).  
 > Since Windows uses a non-validated UCS-2 or UTF-16 character type, any 
 > Python program that obtains data from Windows APIs may be confronted 
 > with lone surrogates or inappropriate combining characters at any
 > time.

I don't think so.  AFAIK all that data must pass through a codec,
which will validate it unless you specifically tell it not to.

 > Round-tripping that data seems useful,

The standard doesn't forbid that.  (ISTR it did so in the past, but
what is required in 6.0 is a specific algorithm for identifying
well-formed portions of the text, basically "if you're currently in an
invalid region, read individual code units and attempt to assemble a
valid sequence -- as soon as you do, that is a valid code point, and
you switch into valid state and return to the normal algorithm".)

Specifically, since surrogates are not characters, leaving them in the
data does not constitute "interpreting them as characters."  I don't
recall if any of the error handlers allow this, though.

 > However, returning modified forms of it to Windows as UCS-2 or
 > UTF-16 data may still cause other applications to later
 > accidentally combine the characters, if the modifications
 > juxtaposed things to make them look reasonably, even if
 > accidentally.

In CPython AFAIK (I don't do Windows) this can only happen if you use
a non-default error setting in the output codec.

 > After writing all those ideas down, I actually preferred some of
 > the others, that achieved O(1) real grapheme indexing, rather than
 > caching character properties.

If you need O(1) grapheme indexing, use of private space seems a
winner to me.  It's just defining private precombined characters, and
they won't bother any Unicode application, even if they leak out.

 > > What are the costs to applications that don't want the cache?
 > > How is the bit-cache affected by PEP 393?
 > 
 > If it is a separate type from str, then it costs nothing except the
 > extra code space to implement the cache for those applications that
 > do want it... most of which wouldn't be loaded for applications
 > that don't, if done as a module or C extension.

I'm talking about the bit-cache (which all of your BI-N referred to,
at least indirectly).  Many applications will want to work with fully
composed characters, whether they're represented in a single code
point or not.  But they may not care about any of the bit-cache ideas.

 > OK... ignore the bit-cache idea (BI-1), and reread the others without 
 > having your mind clogged with that one, and see if any of them make 
 > sense to you then.  But you may be too biased by the "minor" needs of 
 > keeping the internal representation similar to the stream representation 
 > to see any value in them.

No, I'm biased by the fact that I already good ways to do them without
leaving the set of representations provided by Unicode (often ways
which provide additional advantages), and by the fact that I myself
don't know any use cases for the bit-cache yet.

 > I rather like BI-2, since it allow O(1) indexing of graphemes.

I do too (without suggesting a non-standard representation, ie, using
private space), but I'm sure that wheel has been reinvented quite
frequently.  It's a very common trick in text processing, although I
don't know of other applications where it's specifically used to turn
data that "fails to be an array just a little bit" into a true array
(although I suppose you could view fixed-width EUC encodings that
way).