Flexible string representation, unicode, typography, ...
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Aug 29 04:05:10 EDT 2012
On Tue, 28 Aug 2012 22:15:31 -0600, Ian Kelly wrote:
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:
>> How difficult would it be to giving the choice of string engine as a
>> command-line flag?
>> This would avoid the nuisance of having two binaries -- narrow and
>> wide.
>
> Quite difficult. Even if we avoid having two or three separate
> binaries, we would still have separate binary representations of the
> string structs. It makes the maintainability of the software go down
> instead of up.
In fairness, there are already multiple binary representations of strings
in Python 3.3:
- ASCII-only strings use a 1-byte format (PyASCIIObject);
- Compact Unicode objects (PyCompactObject), which if I'm reading
correctly, appears to use a non-fixed width UTF-8 format, but are only
used when the string length and maximum character are known ahead of
time;
- Legacy string objects (PyUnicodeObject), which are not compact, and
which may use as their internal format:
* 1-byte characters for Latin1-compatible strings;
* 2-byte UCS-2 characters for strings in the Basic Multilingual Plane;
* 4-byte UCS-4 characters for strings with at least one non-BMP
character.
http://www.python.org/dev/peps/pep-0393/#specification
By my calculations, that makes *five* different internal formats for
strings, at least two of which are capable of representing all Unicode
characters. I don't think it would add that much additional complexity to
have a runtime option --always-wide-strings to always use the UCS-4
format. For, you know, crazy people with more memory than sense.
But I don't think there's any point in exposing further runtime options
to choose the string representation:
- neither the ASCII nor Latin1 representations can store arbitrary
Unicode chars, so they're out;
- the UTF-8 format is only used under restrictive circumstances, and so
is (probably?) unsuitable for all strings.
- the UCS-2 format can, by using surrogate pairs, but that's troublesome
to get right, some might even say buggy.
>> And it would give the python programmer a choice of efficiency
>> profiles.
>
> So instead of having just one test for my Unicode-handling code, I'll
> now have to run that same test *three times* -- once for each possible
> string engine option. Choice isn't always a good thing.
There is that too.
--
Steven
More information about the Python-list
mailing list