Flexible string representation, unicode, typography, ...

Wed Aug 29 04:05:10 EDT 2012

On Tue, 28 Aug 2012 22:15:31 -0600, Ian Kelly wrote:

> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:

>> How difficult would it be to giving the choice of string engine as a
>> command-line flag?
>> This would avoid the nuisance of having two binaries -- narrow and
>> wide.
> 
> Quite difficult.  Even if we avoid having two or three separate
> binaries, we would still have separate binary representations of the
> string structs.  It makes the maintainability of the software go down
> instead of up.

In fairness, there are already multiple binary representations of strings 
in Python 3.3:

- ASCII-only strings use a 1-byte format (PyASCIIObject);

- Compact Unicode objects (PyCompactObject), which if I'm reading
  correctly, appears to use a non-fixed width UTF-8 format, but are only
  used when the string length and maximum character are known ahead of
  time;

- Legacy string objects (PyUnicodeObject), which are not compact, and
  which may use as their internal format:

    * 1-byte characters for Latin1-compatible strings;

    * 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;

    * 4-byte UCS-4 characters for strings with at least one non-BMP
      character.

http://www.python.org/dev/peps/pep-0393/#specification

By my calculations, that makes *five* different internal formats for 
strings, at least two of which are capable of representing all Unicode 
characters. I don't think it would add that much additional complexity to 
have a runtime option --always-wide-strings to always use the UCS-4 
format. For, you know, crazy people with more memory than sense.

But I don't think there's any point in exposing further runtime options 
to choose the string representation:

- neither the ASCII nor Latin1 representations can store arbitrary
  Unicode chars, so they're out;

- the UTF-8 format is only used under restrictive circumstances, and so
  is (probably?) unsuitable for all strings.

- the UCS-2 format can, by using surrogate pairs, but that's troublesome
  to get right, some might even say buggy.

>> And it would give the python programmer a choice of efficiency
>> profiles.
> 
> So instead of having just one test for my Unicode-handling code, I'll
> now have to run that same test *three times* -- once for each possible
> string engine option.  Choice isn't always a good thing.

There is that too.

-- 
Steven