[Python-3000] How will unicode get used?

Wed Sep 20 23:59:22 CEST 2006

"Adam Olsen" <rhamph at gmail.com> wrote:
> 
> On 9/20/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> >
> > "Adam Olsen" <rhamph at gmail.com> wrote:
> > > Before we can decide on the internal representation of our unicode
> > > objects, we need to decide on their external interface.  My thoughts
> > > so far:
> >
> > I believe the only options up for actual decision is what the internal
> > representation of a unicode object will be.  Utf-8 that is never changed?
> > Utf-8 that is converted to ucs-2/4 on certain kinds of accesses?
> > Latin-1/ucs-2/ucs-4 depending on code point content?  Always ucs-2/4,
> > depending on compiler switch?
> 
> Just a minor nit.  I doubt we could accept UCS-2, we'd want UTF-16
> instead, with all the variable-width goodness that brings in.

If we are opting for a *single* internal representation, then UTF-16 or
UTF-32 are really the only options.

> > > * Most transformation and testing methods (.lower(), .islower(), etc)
> > > can be copied directly from 2.x.  They require no special
> > > implementation to perform reasonably.
> >
> > A decoding variant of these would be required if the underlying
> > representation of a particular string is not latin-1, ucs-2, or ucs-4.
> 
> That makes no sense.  They can operate on any encoding we design them
> to.  The cost is always O(n) with the length of the string.

I was thinking .startswith() and .endswith(), but assuming *some*
canonical representation (UTF-16, UTF-32, etc.) this is trivial to
implement.  I take back my concerns on this particular point.

> > Whether or not we choose to go with a varying internal representation
> > (the latin-1/ucs-2/ucs-4 variant I have been suggesting),
> >
> >
> > > * Indexing and slicing is the big issue.  Do we need constant-time
> > > integer slicing?  .find() could be changed to return a token that
> > > could be used as a constant-time offset.  Incrementing the token would
> > > have linear costs, but that's no big deal if the offsets are always
> > > small.
> >
> > If by "constant-time integer slicing" you mean "find the start and end
> > memory offsets of a slice in constant time", I would say yes.
> >
> > Generally, I think tokens (in unicode strings) are a waste of time and
> > implementation.  Giving each string a fixed-width per character allows
> > methods on those unicode strings to be far simpler in implementation.
> 
> However, I can imagine there might be use cases, such as the .find()
> output on one string being used to slice a different string, which
> tokens wouldn't support.  I haven't been able to dream up any sane
> examples, which is why I asked about it here.  I want to see specific
> examples showing that tokens won't work.

    p = s[6:-6]

Or even in actual code I use today:

    p = s.lstrip()
    lil = len(s) - len(p)
    si = s[:lil]
    lil += si.count('\t')*(self.GetTabWidth()-1)

    #s is the original line
    #p is the line without leading indentation
    #si is the line indentation characters
    #lil is the indentation of the line in columns

If I can't slice based on character index, then we end up with a similar
situation that the wxPython StyledTextCtrl runs into right now: the
content is encoded via utf-8 internally, so users have to use the fairly
annoying PositionBefore(pos) and PositionAfter(pos) methods to discover
where characters start/end.  While it is possible to handle everything
this way, it is *damn annoying*, and some users have gone so far as to
say that it *doesn't work* for Europeans.

While I won't make the claim that it *doesn't work*, it is a pain in the
ass.

> Using only utf-8 would be simpler than three distinct representations.
>  And if memory usage is an issue (which it seems to be, albeit in a
> vague way), we could make a custom encoding that's even simpler and
> more space efficient than utf-8.

One of the reasons I've been pushing for the 3 representations is
because it is (arguably) optimal for any particular string.

> > > * Grapheme clusters, words, lines, other groupings, do we need/want
> > > ways to slice based on them too?
> >
> > No.
> 
> Can you explain your reasoning?

We can already split based on words, lines, etc., usingsplit(), and
re.split().  Building additional functionality for text.word[4] seems to
be a waste of time.

> > > * Cheap slicing and concatenation (between O(1) and O(log(n))), do we
> > > want to support them?  Now would be the time.
> >
> > This would imply a tree-based string, which Guido has specifically
> > stated would not happen.  Never mind that it would be a beast to
> > implement and maintain or that it would exclude the possibility for
> > offering the single-segment buffer interface, without reprocessing.
> 
> The only reference I found was this:
> http://mail.python.org/pipermail/python-3000/2006-August/003334.html
> 
> I interpret that as him being very sceptical, not an outright refusal.
> 
> Allowing external code to operate on a python string in-place seems
> tenuous at best.  Even with three types (Latin-1, UCS-2, UCS-4) you
> would need to automatically copy and convert if the wrong type is
> given.

The only benefits that utf-8 gains over any other internal
representation is that it is an arguably minimal-sized representation,
and it is commonly used among other C libraries.

The benefits gained by using the three internal representations are
primarily from a simplicity standpoint.  That is to say, when
manipulating any one of the three representations, you know that the
value at offset X represents the code point of character X in the string.

Further, with a slight change in how the single-segment buffer interface
is defined (returns the width of the character), C extensions that want
to deal with unicode strings in *native* format (due to concerns about
speed), could do so without having to worry about reencoding,
variable-width characters, etc.

You can get this same behavior by always using UTF-32 (aka UCS-4), but
at least 1/4 of the underlying data is always going to be nulls (code
points are limited to 0x0010ffff), and for many people (in Europe, the
US, and anywhere else with code points < 65536), 1/2 to 3/4 of the
underlying data is going to be nulls.

While I would imagine that people could deal with UTF-16 as an
underlying representation (from a data waste perspective), the potential
for varying-width characters in such an encoding is a pain in the ass
(like it is for UTF-8).

Regardless of our choice, *some platform* is going to be angry.  Why? 
GTK takes utf-8 encoded strings.  (I don't know what Qt or linux system
calls take) Windows takes utf-16. Whatever underlying representation,
*someone* is going to have to recode when dealing with GUI or OS-level
operations.

 - Josiah