Light slices + COW

castironpi at gmail.com castironpi at gmail.com
Sat May 3 19:54:54 EDT 2008


On May 3, 1:59 pm, bearophileH... at lycos.com wrote:
> Sometimes different languages suggests me ways to cross-pollinate
> them.
>
> (Note: probably someone has already suggested (and even implemented)
> the following ideas, but I like to know why they aren't fit for
> Python).
>
> Python generators now allow me to package some code patterns common in
> my code, that other (older) languages didn't allow me to extract; for
> example given a sequence (array, list, etc) of items now and then I
> need to scan all the upper triangle of their cross matrix:
>
> def xpairs(seq):
>
>     len_seq = len(seq)
>
>     for i, e1 in enumerate(seq):
>
>         for j in xrange(i+1, len_seq):
>
>             yield e1, seq[j]
>
> Or adjacent ones (there are ways to generalize the following function,
> but this the most common situation for me):
>
> def xpairwise(iterable):
>
>     return izip(iterable, islice(iterable, 1, None))
>
> That xpairs() generator is nice, but it's not the best possible code
> (but you may disagree with me, and you may think that code better than
> the successive D code that uses two slices). Inside it I can't use a
> list slicing because it copies subparts of the list, probably becoming
> too much slow (for example if len(seq) == 1000).
>
> Compared to Python, the D language is at lower level, so you may
> expect it to have a worse syntax (but the ShedSkin compiler has shown
> me once for all that you can have a language that is both high-level,
> with a short & sexy syntax, and very fast), but you can define a
> xpairs() iterable class in it too, and the core of that iterable class
> there's a method that may look like this:
>
> if (this.seq.length > 1)
>
>   foreach (i, e1; this.seq[0 .. $-1])
>
>     foreach (e2; this.seq[i+1 .. $]) {
>
>       result = dg(e1, e2); if (result) break; // yield
>
>     }
>
> Where the strange last line is the yield, and the $ inside [] is the
> length of the current array (a very clever idea that avoids the need
> for negative indexes).
>
> That D code is as fast or faster than the code you can back-translate
> from Python, this is possible because in D arrays slices are very
> light, they are a struct of <length, pointer> (in the future they may
> change to a couple of pointers, to speed up the array slice scanning).
> So if you slice an array you don't copy its contents, just the start-
> end points of the slice. If you read/scan the slice that's all you
> have, while if you write on it, D uses a Copy-On-Write strategy, that
> is a just-in-time copy of the slice contents.
>
> In practice this speeds up lot of code that manages strings and arrays
> (like a XML parser, making it among the faster ones).
>
> Being the slice a struct, it's stack allocated, so the inner foreach
> doesn't even create a slice object in the heap each time the outer
> foreach loops.
>
> One problem with this strategy is that if you have a huge string, and
> you keep only a little slice of it, the D garbage collector will keep
> it all in memory. To avoid that problem you have to duplicate the
> slice manually:
>
> somestring[inf...sup].dup;
>
> I think Psyco in some situations is able to manage string slices
> avoiding the copy.
>
> I think Python 3.0 too may enjoy a similar strategy of light slices +
> COW for strings, lists and arrays (tuples and strings don't need the
> COW).
>
> Bye,
>
> bearophile

In my understanding, the operating system links files across multiple
sections on a disk, while keeping those details from client software.

Files:

AAAA    BBBB AA CCCCCCCCC

While File A still reads as: AAAAAA, correctly.

Modifications to B as follow:

Files:

AAAA    BBBB AA CCCCCCCCC BBBB

In the case of a large mutable string, modifications can cause linear-
time operations, even if only making a constant-time change.

String:

abcdefg

Modification:

aAbcdefg

causes the entire string to be recopied.  Expensive at scale.

A string-on-disk structure could provide linking, such as a String
File Allocation Table.

abcdefg  A

correctly reads as:

aAbcdefg



More information about the Python-list mailing list