[Patches] [ python-Patches-1629305 ] The Unicode "lazy strings" patches

Wed Jan 10 21:30:36 CET 2007

Patches item #1629305, was opened at 2007-01-06 09:37
Message generated for change (Comment added) made by lhastings
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1629305&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Core (C code)
Group: Python 3000
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Larry Hastings (lhastings)
Assigned to: Nobody/Anonymous (nobody)
Summary: The Unicode "lazy strings" patches

Initial Comment:
These are patches to add lazy processing to Unicode strings for Python 3000.  I plan to post separate patches for both "lazy concatenation" and "lazy slices", as I suspect "lazy concatenation" has a much higher chance of being accepted.

There is a long discussion about "lazy concatenation" here:
http://mail.python.org/pipermail/python-dev/2006-October/069224.html
And another long discussion about "lazy slices" here:
http://mail.python.org/pipermail/python-dev/2006-October/069506.html

Note that, unlike the 8-bit-character strings patches, I don't expect the "lazy slices" patch to be dependent on the "lazy concatenation" patch.  Unicode objects are stored differently, and already use a pointer to a separately-allocated buffer.   This was the big (and mildly controversial) change made by the 8-bit-character "lazy concatenation" patch, and "lazy slices" needed it too.  Since Unicode objects already look like that, the Unicode lazy patches should be independent.

----------------------------------------------------------------------

>Comment By: Larry Hastings (lhastings)
Date: 2007-01-10 20:30

Message:
Logged In: YES 
user_id=364875
Originator: YES

Much of what I do in Python is text processing.  My largest Python project
to date was an IDL which spewed out loads of text; I've also written an
HTML formatter or two.  I seem to do an awful lot of string concatenation
in Python, and I'd like it to be fast.  I'm not alone in this, as there
have been several patches to Python in recent years to speed up string
concatenation.

Perhaps you aren't familiar with my original justification for the patch. 
I've always hated the "".join() idiom for string concatenation, as it
violates the "There should be one--and preferably only one--obvious way to
do it" principle (and arguably others).  With lazy concatenation, the
obvious way (using +) becomes competitive with "".join(), thus dispensing
with the need for this inobvious and distracting idiom.

For a more thorough dissection of the (original) patch, including its
implementation and lots of discussion from other people, please see the
original thread on c.l.p:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/b8a8f20bc3c81bcf
Please ignore the benchmarks there, as they were quite flawed.

And, no, I haven't seen a lot of code manipulating Unicode strings yet,
but then I'm not a Python shaker-and-mover.  Obviously I expect to see a
whole lot more when Py3k is adopted.

----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-01-10 18:24

Message:
Logged In: YES 
user_id=341410
Originator: NO

>From what I understand, the point of the lazy strings patch is to make
certain operations faster.  What operations?  Generally speaking, looped
concatenation (x += y), and other looping operations that have
traditionally been slow; O(n^2).

While this error is still common among new users of Python, generally
users only get bit once.  They ask about it on python-list and are told: z
= []; z.append(y); x = ''.join(z) .

Then again, the only place where I've seen the iterative building up of
*text* is really in document reformatting (like textwrap).  Basically all
other use-cases (that I have seen) generally involve the manipulation of
binary data.  Larry, out of curiosity, have you found code out there that
currently loops and concatenates unicode?

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-09 01:26

Message:
Logged In: YES 
user_id=364875
Originator: YES

Continuing the comedy of errors, concat patch #2 was actually the same as
#1, it didn't have the fix for detecting a NULL return of PyMem_NEW(). 
Fixed in concat patch #3.  (Deleting concat patch #2.)
File Added: lch.py3k.unicode.lazy.concat.patch.3.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-09 01:10

Message:
Logged In: YES 
user_id=364875
Originator: YES

Revised the lazy concatenation patch to add (doh!) a check for when
PyMem_NEW() fails in PyUnicode_AsUnicode().
File Added: lch.py3k.unicode.lazy.concat.patch.2.txt

----------------------------------------------------------------------

Comment By: Larry Hastings (lhastings)
Date: 2007-01-08 18:50

Message:
Logged In: YES 
user_id=364875
Originator: YES

jcarlson:
The first time someone calls PyUnicode_AsUnicode() on a concatenation
object, it renders the string, and that's an O(something) operation.  In
general this rendering is O(i), aka linear time, though linear related to
*what* depends.  (It iterates over the m concatenated strings, and each of
the n characters in those strings, and whether n or m is more important
depends on their values.)  After rendering, the object behaves like any
other Unicode string, including O(1) for array element lookup.

If you're referring to GvR's statement "I mention performance because s[i]
should remain an O(1) operation.", here:
http://mail.python.org/pipermail/python-3000/2006-December/005281.html
I suspect this refers to the UCS-2 vs. UTF-16 debate.

lemberg:
Your criticisms are fair; lazy evaluation is a tradeoff.  In general my
response to theories about how it will affect performance is "I invite you
to try it and see".

As for causing memory errors, the only problem I see is not checking for a
NULL return from PyMem_NEW() in PyUnicode_AsUnicode().  But that's a bug,
not a flaw in my approach, and I'll fix that bug today.  I don't see how
"[my] approach can cause memory errors" in any sort of larger sense.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2007-01-08 10:59

Message:
Logged In: YES 
user_id=38388
Originator: NO

While I don't think the added complexity in the implementation is worth
it, given that there are other ways of achieving the same kind of
performance (e.g. list of Unicode strings), some comments:

 * you add a long field to every Unicode object - so every single object
in the system pays 4-8 bytes for the small performance advantage

 * Unicode objects are often references using PyUnicode_AS_UNICODE(); this
operation doesn't allow passing back errors, yet your lazy evaluation
approach can cause memory errors - how are you going to deal with them ? 
(currently you don't even test for them)

 * the lazy approach keeps all partial Unicode objects alive until they
finally get concatenated; if you have lots of those (e.g. if you use x +=
y in a loop), then you pay the complete Python object overhead for every
single partial Unicode object in the list of strings - given that most
such operations use short strings, you are likely creating a memory
overhead far greater than the the total length of all the strings

----------------------------------------------------------------------

Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-01-07 05:08

Message:
Logged In: YES 
user_id=341410
Originator: NO

What are the performance characteristics of each operation?  I presume
that a + b for unicode strings a and b is O(1) time (if I understand your
implementation correctly).  But according to my reading, (a + b + c +
...)[i] is O(number of concatenations performed).  Is this correct?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1629305&group_id=5470