[Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion

Wed Mar 20 21:02:39 CET 2013

I am on 64bit Linux (I use windows too, but try to avoid whenever possible!)

I agree with your assessment wrt 32/64
and perf - 

I am not sure that these corner cases r that big a deal, more important I think is that we test the most common cases for perf

On Mar 20, 2013, at 3:56 PM, Stephen Lin <swlin at post.harvard.edu> wrote:

> Thanks Jeff!
> 
> So ignoring the testing methodology issue for now, I've done the small
> fix suggested but apparently it *is* too restrictive because it
> negatively affects two other tests that were previously improved (so
> the two "degenerate" tests improved 25% by adding the restriction
> while these two tests regressed 25%). I will do some more testing to
> see if I can find a justifiable way of avoiding this degenerate case,
> (hopefully) without hardcoding a magic number... (But maybe we should
> just not bother with this degenerate case anyway, perhaps? I'm a fan
> of making all improvements monotonic, so I'd prefer not to have to
> regress this case even if it's degenerate, but I don't know yet how
> reliably I can do that for situations and all processor/compiler/OS
> combinations...)
> 
> Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub
> for this issue, and I think the reason that some of my larger
> performance impacts are not shown in your results is because of the
> vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure
> but I really think it's likely that it's because x86-64 allows more
> vectorization optimizations even without memmove, so the effect of
> this optimization is not that great. However, there's plenty of people
> still using 32-bit OSes (I have a 64-bit machine but just never
> bothered to install 64-bit Ubuntu), so it's definitely worthwhile
> still to do this.
> 
> In any case, I believe that VC++9 (i.e. 2008) (which still hosts the
> pre-built binary windows build still, I think? correct me if I'm
> wrong) does rather poorly on vectorization, even when it's allowed.
> Worse, though, it's usually not allowed because Windows 32-bit builds
> generally have to assume lowest-common-denominator hardware (SSE,
> which is from Pentium III, and SSE2, from Pentium IV, only became a
> requirements to install Windows with Windows *8* :D) since they are
> not compiled on the user machine. (You can only avoid this by
> abandoning compatibility with older machines or going through hoops to
> detect CPUID at runtime and modifying program behavior accordingly,
> which I don't think Cython does.)
> 
> Anyway, I'll fill in with more info when I have some.
> 
> Stephen
> 
> On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>> awesome explanation Stephen!
>> 
>> I'd vote for #2
>> 
>> essentially create a testing constructor (kind of like y-p's mkdf),
>> but creates only a numpy random array, that by default is c-continguous
>> (with option for f ), and then use that where we have (EVERYWHERE)!
>> np.random.randn.......
>> 
>> and second I guess if it helps, look at the c/f contiguous ness
>> of the ops where appropriate...
>> 
>> my 2c
>> 
>> 
>> 
>> 
>> On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin <swlin at post.harvard.edu> wrote:
>>> 
>>> OK, here goes, the issue is the following...
>>> 
>>> The optimization is question optimizes to row-by-row or
>>> column-by-column copying for 2-d arrays when possible, namely when:
>>> 
>>> 1. the input array (where the array in question is Block.values) is
>>> c-contiguous for takes along axis0 or f-contiguous for takes along
>>> axis1 of the array, and
>>> 2. the contiguity of the output array matches the contiguity of the input
>>> 
>>> Almost all the time, Block.values is stored c-contiguously, such that
>>> each row of the Block corresponds to a column of the DataFrame. So the
>>> optimization only really kicks in, effectively, when reindexing along
>>> the column axis of the DataFrame (i.e. axis 0 of the Block); it
>>> basically means we call memmove once per DataFrame column rather than
>>> iterating in a loop and copying elements.  This is good because most
>>> sane DataFrame objects are have more rows than columns, so we call
>>> memmove few times (i.e. once per column) for a large block of values
>>> (i.e. all rows for that column at a time), so any overhead from
>>> calling memmove will be outweighed by the benefit of a hand optimized
>>> copy (which probably involves vectorization, alignment/cache
>>> optimization, loop unrolling, etc.)
>>> 
>>> C-contiguous blocks result from basically every Pandas operation that
>>> operates on blocks, with the only exceptions of (as far as I can tell)
>>> creating a DataFrame directly from a 2-d ndarray or creating the
>>> transpose of a homogenous DataFrame (but not a heterogenous one)
>>> without copying; this is basically an optimization to avoid creating
>>> the c-contigous version of an array when the f-contiguous one is
>>> already available, but it's the exception rather than the rule and
>>> pretty any modification of the DataFrame will immediately require
>>> reallocation and copying to a new c-contiguous block.
>>> 
>>> Unfortunately many of the DataFrame tests, including the two in
>>> question here, are (for simplicity) only testing the case where a
>>> homogenous 2-d data is passed to the DataFrame, which results in
>>> (non-representative) f-contiguous blocks. An additional issue with
>>> this test is that it's creating a very long but thin array (10,000
>>> long, 4 wide) and reindexing along the index dimension, so row-by-row
>>> (from the DataFrame perspective) copying is done over and over using
>>> memmove on 4 element arrays. Furthermore, the alignment and width in
>>> bytes of each 4 element array happens to be a convenient multiple of
>>> 128bits, which is the multiple required for vectorized SIMD
>>> instructions, so it turns out the element-by-element copying is fairly
>>> efficient when such operations are available (as is guaranteed on
>>> x86-64, but not necessarily x86-32), and the call to memmove has more
>>> overhead than element-by-element copying.
>>> 
>>> So the issue is basically only happening because all the following are
>>> true:
>>> 
>>> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray
>>> (which has the default c-contiguous continuity, so the block becomes
>>> f-contiguous).
>>> 2. There has been no operation after construction of the DataFrame
>>> requiring reallocation of any sort (otherwise the block would become
>>> c-contiguous).
>>> 3. The reindexing is done on the index axis (otherwise no optimization
>>> would be triggered, since it requires the right axis/contiguity
>>> combination).
>>> 4. The DataFrame is long but thin (otherwise memmove would not be
>>> called repeatedly to do small copies).
>>> 5. The C compiler is not inlining memmove properly, for whatever reason,
>>> and
>>> 6. (possibly) The alignment/width of the data happens to be such that
>>> SIMD operations can be used directly, so the overhead of the eliding
>>> the loop is not very great and exceeded by the overhead of the
>>> memmove.
>>> 
>>> To be honest, it's common C practice to call memmove/memcpy (the
>>> performance of the two don't really differ from my testing in this
>>> case) even for very small arrays and assuming that the implementation
>>> is sane enough to inline it and do the right thing either way, so I'm
>>> really surprised about #5: I would not have thought it to be an issue
>>> with a modern compiler, since calling memcpy can't do anything but
>>> provide the compiler more, not less, information about your intentions
>>> (and the overhead of the memmove aliasing check is not significant
>>> here).
>>> 
>>> Anyway, so it's a corner case, and I didn't catch it originally
>>> because I tested independently the effect of 1) allocates the output
>>> array to be f-contiguous instead of c-contiguous by default when the
>>> input array is f-contiguous and 2) converting loops into memmove when
>>> possible, both of which have a positive performance effect
>>> independently but combine to adversely affect these two tests.
>>> 
>>> I can revert the change that "allocates the output array to be
>>> f-contiguous instead of c-contiguous by default when the input array
>>> is f-contiguous", meaning that this optimization will almost never be
>>> triggered for an f-contiguous input array (unless the caller
>>> explicitly provides an output array as f-contiguous), but I'd rather
>>> not because the optimization is actually kind of useful in less
>>> degenerate cases when you want to quickly produce a reindexed version
>>> of a f-contiguous array, for whatever reason, even though the cases
>>> are rarer.
>>> 
>>> So I think what I'm going to do instead, to avoid the degenerate case
>>> above, is to trigger the optimization only when the take operation is
>>> done along the shorter of the two dimensions (i.e. so the copied
>>> dimension is the longer of the two): that will definitely fix this
>>> test (since it'll avoid this optimization completely) but I suppose
>>> there might be other degenerate cases I haven't thought about it. I'll
>>> submit a PR later today for this, if no one finds any objection to the
>>> idea.
>>> 
>>> However, I think it might be skewed our performance results to be
>>> testing DataFrame objects constructed by 2-d ndarrays, since they're
>>> not representative; in addition to the issue above, it means that many
>>> tests are actually incorporating the cost of converting an
>>> f-contiguous array into a c-contiguous array on top of what they're
>>> actually trying to test. Two possible solutions are:
>>> 
>>> 1. Change DataFrame constructor (and possibly DataFrame.T) to
>>> normalize all blocks as c-contiguous.
>>> 2. Leave DataFrame constructor as-is but either change existing tests
>>> to exercise the more common use case (c-contiguous blocks) or add them
>>> in addition to the current ones.
>>> 
>>> I think #2 is probably best, since #1 will have a performance impact
>>> for the use cases (however rare) where an entire workflow can avoid
>>> triggering conversion from f-contiguous blocks to c-contiguous blocks.
>>> 
>>> Let me know what you all think,
>>> Stephen
>>> 
>>> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin <swlin at post.harvard.edu>
>>> wrote:
>>>> Ahh! I figured it out...the platform issue is part of it, but mostly
>>>> it's that two (independently tested) commits had a weird effect when
>>>> merged.
>>>> 
>>>> And the reason they did so is because this particular test turns out
>>>> all of our reindexing tests are testing something very
>>>> non-representative, because of the way they're constructed, so we're
>>>> not really getting representative performance data unfortunately (it
>>>> has to do with the DataFrame constructor and c-contiguity vs
>>>> f-contiguity). We should probably write new tests to fix this issue.
>>>> 
>>>> I'll write up a fuller explanation when I get a chance. Anyway, sorry
>>>> for sending you on a git bisect goose chase, Jeff.
>>>> 
>>>> Stephen
>>>> 
>>>> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin <swlin at post.harvard.edu>
>>>> wrote:
>>>>> As per the "we're getting too chatty on GitHub" comment, should we be
>>>>> moving extended issue discussion about bugs to this list whenever
>>>>> possible?
>>>>> 
>>>>> I posted a few comments on #3089 just now but realized maybe starting
>>>>> an e-mail chain would be better..
>>>>> 
>>>>> Anyway, I'm looking into the issue, I suspect it's a corner case due
>>>>> to an array that's very large in one dimension but small in another,
>>>>> and possibly that there's compiler and architecture differences
>>>>> causing different results as well....Jeff, do you mind sending me your
>>>>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine
>>>>> you ran vb_suite on?
>>>>> 
>>>>> I'll set up a 64-bit dev machine going forward so I can test on both
>>>>> platforms.
>>>>> 
>>>>> Thanks,
>>>>> Stephen
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> http://mail.python.org/mailman/listinfo/pandas-dev
>> 
>> 
>> 
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> http://mail.python.org/mailman/listinfo/pandas-dev
>>