[Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion

Wed Mar 20 21:12:19 CET 2013

(also, just one more thing to note...native numpy take operations ARE
using memmove, so it's possible that prior to using memmove ourselves,
we were actually performing more poorly than numpy with our Cython
takes in some cases, if the host platform/compiler/OS optimizes
memmove much better than normal array ops...which is not out of the
question for Windows builds)

On Wed, Mar 20, 2013 at 4:07 PM, Stephen Lin <swlin at post.harvard.edu> wrote:
> So we can just ignore this and write new tests, if everyone thinks
> that's appropriate. Anyone else have thoughts?
>
> I'd prefer monotonic improvement myself, honestly...but it might not
> be a reasonable ideal to strive for in this case.
>
> Stephen
>
> On Wed, Mar 20, 2013 at 4:02 PM, Jeff Reback <jreback at yahoo.com> wrote:
>> I am on 64bit Linux (I use windows too, but try to avoid whenever possible!)
>>
>> I agree with your assessment wrt 32/64
>> and perf -
>>
>> I am not sure that these corner cases r that big a deal, more important I think is that we test the most common cases for perf
>>
>>
>> On Mar 20, 2013, at 3:56 PM, Stephen Lin <swlin at post.harvard.edu> wrote:
>>
>>> Thanks Jeff!
>>>
>>> So ignoring the testing methodology issue for now, I've done the small
>>> fix suggested but apparently it *is* too restrictive because it
>>> negatively affects two other tests that were previously improved (so
>>> the two "degenerate" tests improved 25% by adding the restriction
>>> while these two tests regressed 25%). I will do some more testing to
>>> see if I can find a justifiable way of avoiding this degenerate case,
>>> (hopefully) without hardcoding a magic number... (But maybe we should
>>> just not bother with this degenerate case anyway, perhaps? I'm a fan
>>> of making all improvements monotonic, so I'd prefer not to have to
>>> regress this case even if it's degenerate, but I don't know yet how
>>> reliably I can do that for situations and all processor/compiler/OS
>>> combinations...)
>>>
>>> Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub
>>> for this issue, and I think the reason that some of my larger
>>> performance impacts are not shown in your results is because of the
>>> vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure
>>> but I really think it's likely that it's because x86-64 allows more
>>> vectorization optimizations even without memmove, so the effect of
>>> this optimization is not that great. However, there's plenty of people
>>> still using 32-bit OSes (I have a 64-bit machine but just never
>>> bothered to install 64-bit Ubuntu), so it's definitely worthwhile
>>> still to do this.
>>>
>>> In any case, I believe that VC++9 (i.e. 2008) (which still hosts the
>>> pre-built binary windows build still, I think? correct me if I'm
>>> wrong) does rather poorly on vectorization, even when it's allowed.
>>> Worse, though, it's usually not allowed because Windows 32-bit builds
>>> generally have to assume lowest-common-denominator hardware (SSE,
>>> which is from Pentium III, and SSE2, from Pentium IV, only became a
>>> requirements to install Windows with Windows *8* :D) since they are
>>> not compiled on the user machine. (You can only avoid this by
>>> abandoning compatibility with older machines or going through hoops to
>>> detect CPUID at runtime and modifying program behavior accordingly,
>>> which I don't think Cython does.)
>>>
>>> Anyway, I'll fill in with more info when I have some.
>>>
>>> Stephen
>>>
>>> On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>>> awesome explanation Stephen!
>>>>
>>>> I'd vote for #2
>>>>
>>>> essentially create a testing constructor (kind of like y-p's mkdf),
>>>> but creates only a numpy random array, that by default is c-continguous
>>>> (with option for f ), and then use that where we have (EVERYWHERE)!
>>>> np.random.randn.......
>>>>
>>>> and second I guess if it helps, look at the c/f contiguous ness
>>>> of the ops where appropriate...
>>>>
>>>> my 2c
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin <swlin at post.harvard.edu> wrote:
>>>>>
>>>>> OK, here goes, the issue is the following...
>>>>>
>>>>> The optimization is question optimizes to row-by-row or
>>>>> column-by-column copying for 2-d arrays when possible, namely when:
>>>>>
>>>>> 1. the input array (where the array in question is Block.values) is
>>>>> c-contiguous for takes along axis0 or f-contiguous for takes along
>>>>> axis1 of the array, and
>>>>> 2. the contiguity of the output array matches the contiguity of the input
>>>>>
>>>>> Almost all the time, Block.values is stored c-contiguously, such that
>>>>> each row of the Block corresponds to a column of the DataFrame. So the
>>>>> optimization only really kicks in, effectively, when reindexing along
>>>>> the column axis of the DataFrame (i.e. axis 0 of the Block); it
>>>>> basically means we call memmove once per DataFrame column rather than
>>>>> iterating in a loop and copying elements.  This is good because most
>>>>> sane DataFrame objects are have more rows than columns, so we call
>>>>> memmove few times (i.e. once per column) for a large block of values
>>>>> (i.e. all rows for that column at a time), so any overhead from
>>>>> calling memmove will be outweighed by the benefit of a hand optimized
>>>>> copy (which probably involves vectorization, alignment/cache
>>>>> optimization, loop unrolling, etc.)
>>>>>
>>>>> C-contiguous blocks result from basically every Pandas operation that
>>>>> operates on blocks, with the only exceptions of (as far as I can tell)
>>>>> creating a DataFrame directly from a 2-d ndarray or creating the
>>>>> transpose of a homogenous DataFrame (but not a heterogenous one)
>>>>> without copying; this is basically an optimization to avoid creating
>>>>> the c-contigous version of an array when the f-contiguous one is
>>>>> already available, but it's the exception rather than the rule and
>>>>> pretty any modification of the DataFrame will immediately require
>>>>> reallocation and copying to a new c-contiguous block.
>>>>>
>>>>> Unfortunately many of the DataFrame tests, including the two in
>>>>> question here, are (for simplicity) only testing the case where a
>>>>> homogenous 2-d data is passed to the DataFrame, which results in
>>>>> (non-representative) f-contiguous blocks. An additional issue with
>>>>> this test is that it's creating a very long but thin array (10,000
>>>>> long, 4 wide) and reindexing along the index dimension, so row-by-row
>>>>> (from the DataFrame perspective) copying is done over and over using
>>>>> memmove on 4 element arrays. Furthermore, the alignment and width in
>>>>> bytes of each 4 element array happens to be a convenient multiple of
>>>>> 128bits, which is the multiple required for vectorized SIMD
>>>>> instructions, so it turns out the element-by-element copying is fairly
>>>>> efficient when such operations are available (as is guaranteed on
>>>>> x86-64, but not necessarily x86-32), and the call to memmove has more
>>>>> overhead than element-by-element copying.
>>>>>
>>>>> So the issue is basically only happening because all the following are
>>>>> true:
>>>>>
>>>>> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray
>>>>> (which has the default c-contiguous continuity, so the block becomes
>>>>> f-contiguous).
>>>>> 2. There has been no operation after construction of the DataFrame
>>>>> requiring reallocation of any sort (otherwise the block would become
>>>>> c-contiguous).
>>>>> 3. The reindexing is done on the index axis (otherwise no optimization
>>>>> would be triggered, since it requires the right axis/contiguity
>>>>> combination).
>>>>> 4. The DataFrame is long but thin (otherwise memmove would not be
>>>>> called repeatedly to do small copies).
>>>>> 5. The C compiler is not inlining memmove properly, for whatever reason,
>>>>> and
>>>>> 6. (possibly) The alignment/width of the data happens to be such that
>>>>> SIMD operations can be used directly, so the overhead of the eliding
>>>>> the loop is not very great and exceeded by the overhead of the
>>>>> memmove.
>>>>>
>>>>> To be honest, it's common C practice to call memmove/memcpy (the
>>>>> performance of the two don't really differ from my testing in this
>>>>> case) even for very small arrays and assuming that the implementation
>>>>> is sane enough to inline it and do the right thing either way, so I'm
>>>>> really surprised about #5: I would not have thought it to be an issue
>>>>> with a modern compiler, since calling memcpy can't do anything but
>>>>> provide the compiler more, not less, information about your intentions
>>>>> (and the overhead of the memmove aliasing check is not significant
>>>>> here).
>>>>>
>>>>> Anyway, so it's a corner case, and I didn't catch it originally
>>>>> because I tested independently the effect of 1) allocates the output
>>>>> array to be f-contiguous instead of c-contiguous by default when the
>>>>> input array is f-contiguous and 2) converting loops into memmove when
>>>>> possible, both of which have a positive performance effect
>>>>> independently but combine to adversely affect these two tests.
>>>>>
>>>>> I can revert the change that "allocates the output array to be
>>>>> f-contiguous instead of c-contiguous by default when the input array
>>>>> is f-contiguous", meaning that this optimization will almost never be
>>>>> triggered for an f-contiguous input array (unless the caller
>>>>> explicitly provides an output array as f-contiguous), but I'd rather
>>>>> not because the optimization is actually kind of useful in less
>>>>> degenerate cases when you want to quickly produce a reindexed version
>>>>> of a f-contiguous array, for whatever reason, even though the cases
>>>>> are rarer.
>>>>>
>>>>> So I think what I'm going to do instead, to avoid the degenerate case
>>>>> above, is to trigger the optimization only when the take operation is
>>>>> done along the shorter of the two dimensions (i.e. so the copied
>>>>> dimension is the longer of the two): that will definitely fix this
>>>>> test (since it'll avoid this optimization completely) but I suppose
>>>>> there might be other degenerate cases I haven't thought about it. I'll
>>>>> submit a PR later today for this, if no one finds any objection to the
>>>>> idea.
>>>>>
>>>>> However, I think it might be skewed our performance results to be
>>>>> testing DataFrame objects constructed by 2-d ndarrays, since they're
>>>>> not representative; in addition to the issue above, it means that many
>>>>> tests are actually incorporating the cost of converting an
>>>>> f-contiguous array into a c-contiguous array on top of what they're
>>>>> actually trying to test. Two possible solutions are:
>>>>>
>>>>> 1. Change DataFrame constructor (and possibly DataFrame.T) to
>>>>> normalize all blocks as c-contiguous.
>>>>> 2. Leave DataFrame constructor as-is but either change existing tests
>>>>> to exercise the more common use case (c-contiguous blocks) or add them
>>>>> in addition to the current ones.
>>>>>
>>>>> I think #2 is probably best, since #1 will have a performance impact
>>>>> for the use cases (however rare) where an entire workflow can avoid
>>>>> triggering conversion from f-contiguous blocks to c-contiguous blocks.
>>>>>
>>>>> Let me know what you all think,
>>>>> Stephen
>>>>>
>>>>> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin <swlin at post.harvard.edu>
>>>>> wrote:
>>>>>> Ahh! I figured it out...the platform issue is part of it, but mostly
>>>>>> it's that two (independently tested) commits had a weird effect when
>>>>>> merged.
>>>>>>
>>>>>> And the reason they did so is because this particular test turns out
>>>>>> all of our reindexing tests are testing something very
>>>>>> non-representative, because of the way they're constructed, so we're
>>>>>> not really getting representative performance data unfortunately (it
>>>>>> has to do with the DataFrame constructor and c-contiguity vs
>>>>>> f-contiguity). We should probably write new tests to fix this issue.
>>>>>>
>>>>>> I'll write up a fuller explanation when I get a chance. Anyway, sorry
>>>>>> for sending you on a git bisect goose chase, Jeff.
>>>>>>
>>>>>> Stephen
>>>>>>
>>>>>> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin <swlin at post.harvard.edu>
>>>>>> wrote:
>>>>>>> As per the "we're getting too chatty on GitHub" comment, should we be
>>>>>>> moving extended issue discussion about bugs to this list whenever
>>>>>>> possible?
>>>>>>>
>>>>>>> I posted a few comments on #3089 just now but realized maybe starting
>>>>>>> an e-mail chain would be better..
>>>>>>>
>>>>>>> Anyway, I'm looking into the issue, I suspect it's a corner case due
>>>>>>> to an array that's very large in one dimension but small in another,
>>>>>>> and possibly that there's compiler and architecture differences
>>>>>>> causing different results as well....Jeff, do you mind sending me your
>>>>>>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine
>>>>>>> you ran vb_suite on?
>>>>>>>
>>>>>>> I'll set up a 64-bit dev machine going forward so I can test on both
>>>>>>> platforms.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Stephen
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> http://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> http://mail.python.org/mailman/listinfo/pandas-dev
>>>>