New internal string format in 3.3

Michael Torrie torriem at gmail.com
Mon Aug 20 01:38:58 EDT 2012


On 08/19/2012 11:51 AM, wxjmfauth at gmail.com wrote:
> Five minutes after a closed my interactive interpreters windows,
> the day I tested this stuff. I though:
> "Too bad I did not noted the extremely bad cases I found, I'm pretty
> sure, this problem will arrive on the table".

Reading through this thread (which is entertaining), I am reminded of
the old saying, "premature optimization is the root of all evil." This
"problem" that you have discovered, if fixed the way you propose,
(4-byte USC-4 representation internally always) would be just such a
premature optimization.  It would come at a high cost with very little
real-world impact.

As others have made abundantly clear, the overhead of changing internal
string representations is a cost that's only manifest during the
creation of the immutable string object.  If your code is doing a lot of
operations on immutable strings, which by definition creates new
immutable string objects, then the real speed problem is in your
algorithm.  If you are working on a string as if it were a buffer, doing
many searches, replaces, etc, then you need to work on an object
designed for IO, such as io.StringIO.  If implemented half correctly, I
imagine that StringIO uses internally the widest possible character
representation in the buffer.  I could be wrong here.

As to your other problem, Python generally tries to follow unicode
encoding rules to the letter.  Thus if a piece of text cannot be
represented in the character set of the terminal, then Python will
properly err out.  Other languages you have tried, likely fudge it
somehow.  Display what they can, or something similar.  In general the
Windows command window is an outdated thing that no serious programmer
can rely on to display unicode text.  Use a proper GUI api, or use a
better terminal that can handle utf-8.

The TLDR version: You're right that converting string representations
internally incurs overhead, but if your program is slow because of this
you're doing it wrong.  It's not symptomatic of some python disease.



More information about the Python-list mailing list