From wesmckinn at gmail.com Fri Mar 15 02:21:25 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 14 Mar 2013 21:21:25 -0400 Subject: [Pandas-dev] Welcome Message-ID: I just had this new mailing list created for high level discussions around pandas development. Hard to believe we've made it this long without one. I'll make an announcement about the development mailing list on PyData. Thanks, Wes From wesmckinn at gmail.com Wed Mar 20 00:00:48 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 19 Mar 2013 19:00:48 -0400 Subject: [Pandas-dev] Managing the pandas firehose Message-ID: Hi all, Welcome to the new pandas developer list! I thought it would be good to have a place for higher level discussions about the project and other initiatives, so I made this. One note that I wanted to pass on as we move toward the 0.11 release and going forward-- if I could get your help classifying and categorizing incoming issues, that would be a big help of staying on top of things. What does this mean? - Incoming issues: mark milestone as next release (bugs and other "must fixes" or low hanging fruit), next next release at your discretion. "Someday" otherwise. On GitHub you can see there are 30-something issues that have no milestone-- in January there were over 100 and I had a "milestone classification" binge. Would be great to not be the only one =P - Pull requests: also mark with a milestone please! This helps keep track of what release pull requests were a part of later on. - Label accordingly-- you all have been doing a good job with this. Code review and pull requests: - For one or two commits that aren't likely to be controversial (e.g. Jeff has been doing a lot of little doc additions), I don't mind if you push directly to master. If you think having someone else (doesn't need to be me necessarily) sign off would be good, then leave until that happens. - I don't mind if you use the green button-- I waffle between regular merges and cherry-picks when the number of commits is small. My main concern with ongoing development is making sure that things don't fall through the cracks and that bugs that come into the issue tracker get promptly classified. Any other thoughts? At some point we'll have to think about release management-- I have been carrying that torch since pandas 0.1, but at some point maybe someone else will do it. Part of it relies on having access to a fully-equipped Windows VM with 32 and 64 bit versions across all Python versions-- I have a virtualbox image that should get hosted someplace that is not a physical box in my apartment at some point. - Wes From changshe at gmail.com Wed Mar 20 00:18:43 2013 From: changshe at gmail.com (Chang She) Date: Tue, 19 Mar 2013 16:18:43 -0700 Subject: [Pandas-dev] Managing the pandas firehose In-Reply-To: References: Message-ID: Just to tack on to this email, I've started talking to some folks about applying for a grant to fund pandas development for the next year or so and wanted to get your thoughts on hiring someone to spend substantial time on pandas. There are several big questions here: 1. What are the main things that need done in the next year? 2. What exactly would that person be responsible for? Would he/she be full-time or part-time? 3. How much money would that take? 4. What organization would the money be funneled through (needs to be a non-profit)? 5. What metrics can we track over the next year or so to show whether the grant was successful? 6. How/who do we hire? Some of the stuff that Wes outlined in his email can definitely fall on this hypothetical person. Since we're all volunteers, having someone hired to make sure things don't fall through the cracks would give us a peace of mind and save us some stress. In any case, your thoughts would be appreciated (alternative funding ideas are also very welcome!) On Mar 19, 2013, at 4:00 PM, Wes McKinney wrote: > Hi all, > > Welcome to the new pandas developer list! I thought it would be good > to have a place for higher level discussions about the project and > other initiatives, so I made this. > > One note that I wanted to pass on as we move toward the 0.11 release > and going forward-- if I could get your help classifying and > categorizing incoming issues, that would be a big help of staying on > top of things. What does this mean? > > - Incoming issues: mark milestone as next release (bugs and other > "must fixes" or low hanging fruit), next next release at your > discretion. "Someday" otherwise. On GitHub you can see there are > 30-something issues that have no milestone-- in January there were > over 100 and I had a "milestone classification" binge. Would be great > to not be the only one =P > - Pull requests: also mark with a milestone please! This helps keep > track of what release pull requests were a part of later on. > - Label accordingly-- you all have been doing a good job with this. > > Code review and pull requests: > - For one or two commits that aren't likely to be controversial (e.g. > Jeff has been doing a lot of little doc additions), I don't mind if > you push directly to master. If you think having someone else (doesn't > need to be me necessarily) sign off would be good, then leave until > that happens. > - I don't mind if you use the green button-- I waffle between regular > merges and cherry-picks when the number of commits is small. > > My main concern with ongoing development is making sure that things > don't fall through the cracks and that bugs that come into the issue > tracker get promptly classified. Any other thoughts? > > At some point we'll have to think about release management-- I have > been carrying that torch since pandas 0.1, but at some point maybe > someone else will do it. Part of it relies on having access to a > fully-equipped Windows VM with 32 and 64 bit versions across all > Python versions-- I have a virtualbox image that should get hosted > someplace that is not a physical box in my apartment at some point. > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Wed Mar 20 00:42:24 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 19 Mar 2013 19:42:24 -0400 Subject: [Pandas-dev] Managing the pandas firehose In-Reply-To: References: Message-ID: So from the developer page, here is the roadmap 1. DONE numpy.datetime64 integration, scikits.timeseries codebase integration. Substantially improved time series functionality*.* 2. Improved PyTables (HDF5) integration 3. Tools for working with data sets that do not fit into memory 4. Improved SQL / relational database tools 5. Better statistical graphics using matplotlib 6. Integration with D3.js 7. NDFrame data structure for arbitrarily high-dimensional labeled data 8. Extend GroupBy functionality to regular ndarrays, record arrays 9. Better support for NumPy dtype hierarchy without sacrificing usability 10. *DONE Add a Factor data type (in R parlance)* 11. Better support for integer NA values 12. (0.10) Better memory usage and performance when reading very large CSV files blue = done < 0.11 orange = 0.11 yellow = some support, more needed IMHO I think 8 is prob more trouble than its worth out-of-core (3) is very important 5,6 pretty useful 11 a toss-up, depends on if pandas waits for numpy support or roll your own any other items that should be on this list? On Tue, Mar 19, 2013 at 7:18 PM, Chang She wrote: > Just to tack on to this email, I've started talking to some folks about > applying for a grant to fund pandas development for the next year or so and > wanted to get your thoughts on hiring someone to spend substantial time on > pandas. > > There are several big questions here: > > 1. What are the main things that need done in the next year? > 2. What exactly would that person be responsible for? Would he/she be > full-time or part-time? > 3. How much money would that take? > 4. What organization would the money be funneled through (needs to be a > non-profit)? > 5. What metrics can we track over the next year or so to show whether the > grant was successful? > 6. How/who do we hire? > > > Some of the stuff that Wes outlined in his email can definitely fall on > this hypothetical person. Since we're all volunteers, having someone hired > to make sure things don't fall through the cracks would give us a peace of > mind and save us some stress. > > In any case, your thoughts would be appreciated (alternative funding ideas > are also very welcome!) > > > > On Mar 19, 2013, at 4:00 PM, Wes McKinney wrote: > > > Hi all, > > > > Welcome to the new pandas developer list! I thought it would be good > > to have a place for higher level discussions about the project and > > other initiatives, so I made this. > > > > One note that I wanted to pass on as we move toward the 0.11 release > > and going forward-- if I could get your help classifying and > > categorizing incoming issues, that would be a big help of staying on > > top of things. What does this mean? > > > > - Incoming issues: mark milestone as next release (bugs and other > > "must fixes" or low hanging fruit), next next release at your > > discretion. "Someday" otherwise. On GitHub you can see there are > > 30-something issues that have no milestone-- in January there were > > over 100 and I had a "milestone classification" binge. Would be great > > to not be the only one =P > > - Pull requests: also mark with a milestone please! This helps keep > > track of what release pull requests were a part of later on. > > - Label accordingly-- you all have been doing a good job with this. > > > > Code review and pull requests: > > - For one or two commits that aren't likely to be controversial (e.g. > > Jeff has been doing a lot of little doc additions), I don't mind if > > you push directly to master. If you think having someone else (doesn't > > need to be me necessarily) sign off would be good, then leave until > > that happens. > > - I don't mind if you use the green button-- I waffle between regular > > merges and cherry-picks when the number of commits is small. > > > > My main concern with ongoing development is making sure that things > > don't fall through the cracks and that bugs that come into the issue > > tracker get promptly classified. Any other thoughts? > > > > At some point we'll have to think about release management-- I have > > been carrying that torch since pandas 0.1, but at some point maybe > > someone else will do it. Part of it relies on having access to a > > fully-equipped Windows VM with 32 and 64 bit versions across all > > Python versions-- I have a virtualbox image that should get hosted > > someplace that is not a physical box in my apartment at some point. > > > > - Wes > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > http://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swlin at post.harvard.edu Wed Mar 20 06:01:35 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 01:01:35 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion Message-ID: As per the "we're getting too chatty on GitHub" comment, should we be moving extended issue discussion about bugs to this list whenever possible? I posted a few comments on #3089 just now but realized maybe starting an e-mail chain would be better.. Anyway, I'm looking into the issue, I suspect it's a corner case due to an array that's very large in one dimension but small in another, and possibly that there's compiler and architecture differences causing different results as well....Jeff, do you mind sending me your the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine you ran vb_suite on? I'll set up a 64-bit dev machine going forward so I can test on both platforms. Thanks, Stephen From swlin at post.harvard.edu Wed Mar 20 06:25:08 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 01:25:08 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: Ahh! I figured it out...the platform issue is part of it, but mostly it's that two (independently tested) commits had a weird effect when merged. And the reason they did so is because this particular test turns out all of our reindexing tests are testing something very non-representative, because of the way they're constructed, so we're not really getting representative performance data unfortunately (it has to do with the DataFrame constructor and c-contiguity vs f-contiguity). We should probably write new tests to fix this issue. I'll write up a fuller explanation when I get a chance. Anyway, sorry for sending you on a git bisect goose chase, Jeff. Stephen On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin wrote: > As per the "we're getting too chatty on GitHub" comment, should we be > moving extended issue discussion about bugs to this list whenever > possible? > > I posted a few comments on #3089 just now but realized maybe starting > an e-mail chain would be better.. > > Anyway, I'm looking into the issue, I suspect it's a corner case due > to an array that's very large in one dimension but small in another, > and possibly that there's compiler and architecture differences > causing different results as well....Jeff, do you mind sending me your > the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine > you ran vb_suite on? > > I'll set up a 64-bit dev machine going forward so I can test on both platforms. > > Thanks, > Stephen From jeffreback at gmail.com Wed Mar 20 11:03:17 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 20 Mar 2013 06:03:17 -0400 Subject: [Pandas-dev] Managing the pandas firehose In-Reply-To: References: Message-ID: <2E6A1D4A-5883-4C82-895B-8B2BBAE7D2D2@gmail.com> it seems that a detailed roadmap (with links to issues) could be easily setup at https://github.com/pydata/pandas/wiki then link this back to the developer page has this been thought about already? I can be reached on my cell 917-971-6387 On Mar 19, 2013, at 7:18 PM, Chang She wrote: > Just to tack on to this email, I've started talking to some folks about applying for a grant to fund pandas development for the next year or so and wanted to get your thoughts on hiring someone to spend substantial time on pandas. > > There are several big questions here: > > 1. What are the main things that need done in the next year? > 2. What exactly would that person be responsible for? Would he/she be full-time or part-time? > 3. How much money would that take? > 4. What organization would the money be funneled through (needs to be a non-profit)? > 5. What metrics can we track over the next year or so to show whether the grant was successful? > 6. How/who do we hire? > > > Some of the stuff that Wes outlined in his email can definitely fall on this hypothetical person. Since we're all volunteers, having someone hired to make sure things don't fall through the cracks would give us a peace of mind and save us some stress. > > In any case, your thoughts would be appreciated (alternative funding ideas are also very welcome!) > > > > On Mar 19, 2013, at 4:00 PM, Wes McKinney wrote: > >> Hi all, >> >> Welcome to the new pandas developer list! I thought it would be good >> to have a place for higher level discussions about the project and >> other initiatives, so I made this. >> >> One note that I wanted to pass on as we move toward the 0.11 release >> and going forward-- if I could get your help classifying and >> categorizing incoming issues, that would be a big help of staying on >> top of things. What does this mean? >> >> - Incoming issues: mark milestone as next release (bugs and other >> "must fixes" or low hanging fruit), next next release at your >> discretion. "Someday" otherwise. On GitHub you can see there are >> 30-something issues that have no milestone-- in January there were >> over 100 and I had a "milestone classification" binge. Would be great >> to not be the only one =P >> - Pull requests: also mark with a milestone please! This helps keep >> track of what release pull requests were a part of later on. >> - Label accordingly-- you all have been doing a good job with this. >> >> Code review and pull requests: >> - For one or two commits that aren't likely to be controversial (e.g. >> Jeff has been doing a lot of little doc additions), I don't mind if >> you push directly to master. If you think having someone else (doesn't >> need to be me necessarily) sign off would be good, then leave until >> that happens. >> - I don't mind if you use the green button-- I waffle between regular >> merges and cherry-picks when the number of commits is small. >> >> My main concern with ongoing development is making sure that things >> don't fall through the cracks and that bugs that come into the issue >> tracker get promptly classified. Any other thoughts? >> >> At some point we'll have to think about release management-- I have >> been carrying that torch since pandas 0.1, but at some point maybe >> someone else will do it. Part of it relies on having access to a >> fully-equipped Windows VM with 32 and 64 bit versions across all >> Python versions-- I have a virtualbox image that should get hosted >> someplace that is not a physical box in my apartment at some point. >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Mar 20 11:14:21 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 20 Mar 2013 06:14:21 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: <8C98ED00-AC93-46E9-80DB-F6E7E8F85CFA@gmail.com> It was an academic exercise :) not that these are actual quotes.... "premature optimization is the root of all evil" "benchmarking to widgets just helps you make better widgets" On Mar 20, 2013, at 1:25 AM, Stephen Lin wrote: > Ahh! I figured it out...the platform issue is part of it, but mostly > it's that two (independently tested) commits had a weird effect when > merged. > > And the reason they did so is because this particular test turns out > all of our reindexing tests are testing something very > non-representative, because of the way they're constructed, so we're > not really getting representative performance data unfortunately (it > has to do with the DataFrame constructor and c-contiguity vs > f-contiguity). We should probably write new tests to fix this issue. > > I'll write up a fuller explanation when I get a chance. Anyway, sorry > for sending you on a git bisect goose chase, Jeff. > > Stephen > > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin wrote: >> As per the "we're getting too chatty on GitHub" comment, should we be >> moving extended issue discussion about bugs to this list whenever >> possible? >> >> I posted a few comments on #3089 just now but realized maybe starting >> an e-mail chain would be better.. >> >> Anyway, I'm looking into the issue, I suspect it's a corner case due >> to an array that's very large in one dimension but small in another, >> and possibly that there's compiler and architecture differences >> causing different results as well....Jeff, do you mind sending me your >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >> you ran vb_suite on? >> >> I'll set up a 64-bit dev machine going forward so I can test on both platforms. >> >> Thanks, >> Stephen > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From swlin at post.harvard.edu Wed Mar 20 19:24:24 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 14:24:24 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: OK, here goes, the issue is the following... The optimization is question optimizes to row-by-row or column-by-column copying for 2-d arrays when possible, namely when: 1. the input array (where the array in question is Block.values) is c-contiguous for takes along axis0 or f-contiguous for takes along axis1 of the array, and 2. the contiguity of the output array matches the contiguity of the input Almost all the time, Block.values is stored c-contiguously, such that each row of the Block corresponds to a column of the DataFrame. So the optimization only really kicks in, effectively, when reindexing along the column axis of the DataFrame (i.e. axis 0 of the Block); it basically means we call memmove once per DataFrame column rather than iterating in a loop and copying elements. This is good because most sane DataFrame objects are have more rows than columns, so we call memmove few times (i.e. once per column) for a large block of values (i.e. all rows for that column at a time), so any overhead from calling memmove will be outweighed by the benefit of a hand optimized copy (which probably involves vectorization, alignment/cache optimization, loop unrolling, etc.) C-contiguous blocks result from basically every Pandas operation that operates on blocks, with the only exceptions of (as far as I can tell) creating a DataFrame directly from a 2-d ndarray or creating the transpose of a homogenous DataFrame (but not a heterogenous one) without copying; this is basically an optimization to avoid creating the c-contigous version of an array when the f-contiguous one is already available, but it's the exception rather than the rule and pretty any modification of the DataFrame will immediately require reallocation and copying to a new c-contiguous block. Unfortunately many of the DataFrame tests, including the two in question here, are (for simplicity) only testing the case where a homogenous 2-d data is passed to the DataFrame, which results in (non-representative) f-contiguous blocks. An additional issue with this test is that it's creating a very long but thin array (10,000 long, 4 wide) and reindexing along the index dimension, so row-by-row (from the DataFrame perspective) copying is done over and over using memmove on 4 element arrays. Furthermore, the alignment and width in bytes of each 4 element array happens to be a convenient multiple of 128bits, which is the multiple required for vectorized SIMD instructions, so it turns out the element-by-element copying is fairly efficient when such operations are available (as is guaranteed on x86-64, but not necessarily x86-32), and the call to memmove has more overhead than element-by-element copying. So the issue is basically only happening because all the following are true: 1. The DataFrame is constructed directly by a 2-d homogenous ndarray (which has the default c-contiguous continuity, so the block becomes f-contiguous). 2. There has been no operation after construction of the DataFrame requiring reallocation of any sort (otherwise the block would become c-contiguous). 3. The reindexing is done on the index axis (otherwise no optimization would be triggered, since it requires the right axis/contiguity combination). 4. The DataFrame is long but thin (otherwise memmove would not be called repeatedly to do small copies). 5. The C compiler is not inlining memmove properly, for whatever reason, and 6. (possibly) The alignment/width of the data happens to be such that SIMD operations can be used directly, so the overhead of the eliding the loop is not very great and exceeded by the overhead of the memmove. To be honest, it's common C practice to call memmove/memcpy (the performance of the two don't really differ from my testing in this case) even for very small arrays and assuming that the implementation is sane enough to inline it and do the right thing either way, so I'm really surprised about #5: I would not have thought it to be an issue with a modern compiler, since calling memcpy can't do anything but provide the compiler more, not less, information about your intentions (and the overhead of the memmove aliasing check is not significant here). Anyway, so it's a corner case, and I didn't catch it originally because I tested independently the effect of 1) allocates the output array to be f-contiguous instead of c-contiguous by default when the input array is f-contiguous and 2) converting loops into memmove when possible, both of which have a positive performance effect independently but combine to adversely affect these two tests. I can revert the change that "allocates the output array to be f-contiguous instead of c-contiguous by default when the input array is f-contiguous", meaning that this optimization will almost never be triggered for an f-contiguous input array (unless the caller explicitly provides an output array as f-contiguous), but I'd rather not because the optimization is actually kind of useful in less degenerate cases when you want to quickly produce a reindexed version of a f-contiguous array, for whatever reason, even though the cases are rarer. So I think what I'm going to do instead, to avoid the degenerate case above, is to trigger the optimization only when the take operation is done along the shorter of the two dimensions (i.e. so the copied dimension is the longer of the two): that will definitely fix this test (since it'll avoid this optimization completely) but I suppose there might be other degenerate cases I haven't thought about it. I'll submit a PR later today for this, if no one finds any objection to the idea. However, I think it might be skewed our performance results to be testing DataFrame objects constructed by 2-d ndarrays, since they're not representative; in addition to the issue above, it means that many tests are actually incorporating the cost of converting an f-contiguous array into a c-contiguous array on top of what they're actually trying to test. Two possible solutions are: 1. Change DataFrame constructor (and possibly DataFrame.T) to normalize all blocks as c-contiguous. 2. Leave DataFrame constructor as-is but either change existing tests to exercise the more common use case (c-contiguous blocks) or add them in addition to the current ones. I think #2 is probably best, since #1 will have a performance impact for the use cases (however rare) where an entire workflow can avoid triggering conversion from f-contiguous blocks to c-contiguous blocks. Let me know what you all think, Stephen On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin wrote: > Ahh! I figured it out...the platform issue is part of it, but mostly > it's that two (independently tested) commits had a weird effect when > merged. > > And the reason they did so is because this particular test turns out > all of our reindexing tests are testing something very > non-representative, because of the way they're constructed, so we're > not really getting representative performance data unfortunately (it > has to do with the DataFrame constructor and c-contiguity vs > f-contiguity). We should probably write new tests to fix this issue. > > I'll write up a fuller explanation when I get a chance. Anyway, sorry > for sending you on a git bisect goose chase, Jeff. > > Stephen > > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin wrote: >> As per the "we're getting too chatty on GitHub" comment, should we be >> moving extended issue discussion about bugs to this list whenever >> possible? >> >> I posted a few comments on #3089 just now but realized maybe starting >> an e-mail chain would be better.. >> >> Anyway, I'm looking into the issue, I suspect it's a corner case due >> to an array that's very large in one dimension but small in another, >> and possibly that there's compiler and architecture differences >> causing different results as well....Jeff, do you mind sending me your >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >> you ran vb_suite on? >> >> I'll set up a 64-bit dev machine going forward so I can test on both platforms. >> >> Thanks, >> Stephen From swlin at post.harvard.edu Wed Mar 20 19:46:25 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 14:46:25 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: p.s. also, "triggering the optimization only when the take operation is done along the shorter of the two dimensions" is probably more restrictive than it has to be, but I'm not comfortable hardcoding a lower-limit size for calling memmove (I searched for guidance on setting such a limit appropriately online, but couldn't find any: I think the presumption is usually that it doesn't matter if the compiler does the right thing) On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin wrote: > Ahh! I figured it out...the platform issue is part of it, but mostly > it's that two (independently tested) commits had a weird effect when > merged. > > And the reason they did so is because this particular test turns out > all of our reindexing tests are testing something very > non-representative, because of the way they're constructed, so we're > not really getting representative performance data unfortunately (it > has to do with the DataFrame constructor and c-contiguity vs > f-contiguity). We should probably write new tests to fix this issue. > > I'll write up a fuller explanation when I get a chance. Anyway, sorry > for sending you on a git bisect goose chase, Jeff. > > Stephen > > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin wrote: >> As per the "we're getting too chatty on GitHub" comment, should we be >> moving extended issue discussion about bugs to this list whenever >> possible? >> >> I posted a few comments on #3089 just now but realized maybe starting >> an e-mail chain would be better.. >> >> Anyway, I'm looking into the issue, I suspect it's a corner case due >> to an array that's very large in one dimension but small in another, >> and possibly that there's compiler and architecture differences >> causing different results as well....Jeff, do you mind sending me your >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >> you ran vb_suite on? >> >> I'll set up a 64-bit dev machine going forward so I can test on both platforms. >> >> Thanks, >> Stephen From jeffreback at gmail.com Wed Mar 20 19:56:17 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 20 Mar 2013 14:56:17 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: awesome explanation Stephen! I'd vote for #2 essentially create a testing constructor (kind of like y-p's mkdf), but creates only a numpy random array, that by default is c-continguous (with option for f ), and then use that where we have (EVERYWHERE)! np.random.randn....... and second I guess if it helps, look at the c/f contiguous ness of the ops where appropriate... my 2c On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin wrote: > OK, here goes, the issue is the following... > > The optimization is question optimizes to row-by-row or > column-by-column copying for 2-d arrays when possible, namely when: > > 1. the input array (where the array in question is Block.values) is > c-contiguous for takes along axis0 or f-contiguous for takes along > axis1 of the array, and > 2. the contiguity of the output array matches the contiguity of the input > > Almost all the time, Block.values is stored c-contiguously, such that > each row of the Block corresponds to a column of the DataFrame. So the > optimization only really kicks in, effectively, when reindexing along > the column axis of the DataFrame (i.e. axis 0 of the Block); it > basically means we call memmove once per DataFrame column rather than > iterating in a loop and copying elements. This is good because most > sane DataFrame objects are have more rows than columns, so we call > memmove few times (i.e. once per column) for a large block of values > (i.e. all rows for that column at a time), so any overhead from > calling memmove will be outweighed by the benefit of a hand optimized > copy (which probably involves vectorization, alignment/cache > optimization, loop unrolling, etc.) > > C-contiguous blocks result from basically every Pandas operation that > operates on blocks, with the only exceptions of (as far as I can tell) > creating a DataFrame directly from a 2-d ndarray or creating the > transpose of a homogenous DataFrame (but not a heterogenous one) > without copying; this is basically an optimization to avoid creating > the c-contigous version of an array when the f-contiguous one is > already available, but it's the exception rather than the rule and > pretty any modification of the DataFrame will immediately require > reallocation and copying to a new c-contiguous block. > > Unfortunately many of the DataFrame tests, including the two in > question here, are (for simplicity) only testing the case where a > homogenous 2-d data is passed to the DataFrame, which results in > (non-representative) f-contiguous blocks. An additional issue with > this test is that it's creating a very long but thin array (10,000 > long, 4 wide) and reindexing along the index dimension, so row-by-row > (from the DataFrame perspective) copying is done over and over using > memmove on 4 element arrays. Furthermore, the alignment and width in > bytes of each 4 element array happens to be a convenient multiple of > 128bits, which is the multiple required for vectorized SIMD > instructions, so it turns out the element-by-element copying is fairly > efficient when such operations are available (as is guaranteed on > x86-64, but not necessarily x86-32), and the call to memmove has more > overhead than element-by-element copying. > > So the issue is basically only happening because all the following are > true: > > 1. The DataFrame is constructed directly by a 2-d homogenous ndarray > (which has the default c-contiguous continuity, so the block becomes > f-contiguous). > 2. There has been no operation after construction of the DataFrame > requiring reallocation of any sort (otherwise the block would become > c-contiguous). > 3. The reindexing is done on the index axis (otherwise no optimization > would be triggered, since it requires the right axis/contiguity > combination). > 4. The DataFrame is long but thin (otherwise memmove would not be > called repeatedly to do small copies). > 5. The C compiler is not inlining memmove properly, for whatever reason, > and > 6. (possibly) The alignment/width of the data happens to be such that > SIMD operations can be used directly, so the overhead of the eliding > the loop is not very great and exceeded by the overhead of the > memmove. > > To be honest, it's common C practice to call memmove/memcpy (the > performance of the two don't really differ from my testing in this > case) even for very small arrays and assuming that the implementation > is sane enough to inline it and do the right thing either way, so I'm > really surprised about #5: I would not have thought it to be an issue > with a modern compiler, since calling memcpy can't do anything but > provide the compiler more, not less, information about your intentions > (and the overhead of the memmove aliasing check is not significant > here). > > Anyway, so it's a corner case, and I didn't catch it originally > because I tested independently the effect of 1) allocates the output > array to be f-contiguous instead of c-contiguous by default when the > input array is f-contiguous and 2) converting loops into memmove when > possible, both of which have a positive performance effect > independently but combine to adversely affect these two tests. > > I can revert the change that "allocates the output array to be > f-contiguous instead of c-contiguous by default when the input array > is f-contiguous", meaning that this optimization will almost never be > triggered for an f-contiguous input array (unless the caller > explicitly provides an output array as f-contiguous), but I'd rather > not because the optimization is actually kind of useful in less > degenerate cases when you want to quickly produce a reindexed version > of a f-contiguous array, for whatever reason, even though the cases > are rarer. > > So I think what I'm going to do instead, to avoid the degenerate case > above, is to trigger the optimization only when the take operation is > done along the shorter of the two dimensions (i.e. so the copied > dimension is the longer of the two): that will definitely fix this > test (since it'll avoid this optimization completely) but I suppose > there might be other degenerate cases I haven't thought about it. I'll > submit a PR later today for this, if no one finds any objection to the > idea. > > However, I think it might be skewed our performance results to be > testing DataFrame objects constructed by 2-d ndarrays, since they're > not representative; in addition to the issue above, it means that many > tests are actually incorporating the cost of converting an > f-contiguous array into a c-contiguous array on top of what they're > actually trying to test. Two possible solutions are: > > 1. Change DataFrame constructor (and possibly DataFrame.T) to > normalize all blocks as c-contiguous. > 2. Leave DataFrame constructor as-is but either change existing tests > to exercise the more common use case (c-contiguous blocks) or add them > in addition to the current ones. > > I think #2 is probably best, since #1 will have a performance impact > for the use cases (however rare) where an entire workflow can avoid > triggering conversion from f-contiguous blocks to c-contiguous blocks. > > Let me know what you all think, > Stephen > > On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin > wrote: > > Ahh! I figured it out...the platform issue is part of it, but mostly > > it's that two (independently tested) commits had a weird effect when > > merged. > > > > And the reason they did so is because this particular test turns out > > all of our reindexing tests are testing something very > > non-representative, because of the way they're constructed, so we're > > not really getting representative performance data unfortunately (it > > has to do with the DataFrame constructor and c-contiguity vs > > f-contiguity). We should probably write new tests to fix this issue. > > > > I'll write up a fuller explanation when I get a chance. Anyway, sorry > > for sending you on a git bisect goose chase, Jeff. > > > > Stephen > > > > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin > wrote: > >> As per the "we're getting too chatty on GitHub" comment, should we be > >> moving extended issue discussion about bugs to this list whenever > >> possible? > >> > >> I posted a few comments on #3089 just now but realized maybe starting > >> an e-mail chain would be better.. > >> > >> Anyway, I'm looking into the issue, I suspect it's a corner case due > >> to an array that's very large in one dimension but small in another, > >> and possibly that there's compiler and architecture differences > >> causing different results as well....Jeff, do you mind sending me your > >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine > >> you ran vb_suite on? > >> > >> I'll set up a 64-bit dev machine going forward so I can test on both > platforms. > >> > >> Thanks, > >> Stephen > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swlin at post.harvard.edu Wed Mar 20 20:56:35 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 15:56:35 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: Thanks Jeff! So ignoring the testing methodology issue for now, I've done the small fix suggested but apparently it *is* too restrictive because it negatively affects two other tests that were previously improved (so the two "degenerate" tests improved 25% by adding the restriction while these two tests regressed 25%). I will do some more testing to see if I can find a justifiable way of avoiding this degenerate case, (hopefully) without hardcoding a magic number... (But maybe we should just not bother with this degenerate case anyway, perhaps? I'm a fan of making all improvements monotonic, so I'd prefer not to have to regress this case even if it's degenerate, but I don't know yet how reliably I can do that for situations and all processor/compiler/OS combinations...) Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub for this issue, and I think the reason that some of my larger performance impacts are not shown in your results is because of the vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure but I really think it's likely that it's because x86-64 allows more vectorization optimizations even without memmove, so the effect of this optimization is not that great. However, there's plenty of people still using 32-bit OSes (I have a 64-bit machine but just never bothered to install 64-bit Ubuntu), so it's definitely worthwhile still to do this. In any case, I believe that VC++9 (i.e. 2008) (which still hosts the pre-built binary windows build still, I think? correct me if I'm wrong) does rather poorly on vectorization, even when it's allowed. Worse, though, it's usually not allowed because Windows 32-bit builds generally have to assume lowest-common-denominator hardware (SSE, which is from Pentium III, and SSE2, from Pentium IV, only became a requirements to install Windows with Windows *8* :D) since they are not compiled on the user machine. (You can only avoid this by abandoning compatibility with older machines or going through hoops to detect CPUID at runtime and modifying program behavior accordingly, which I don't think Cython does.) Anyway, I'll fill in with more info when I have some. Stephen On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback wrote: > awesome explanation Stephen! > > I'd vote for #2 > > essentially create a testing constructor (kind of like y-p's mkdf), > but creates only a numpy random array, that by default is c-continguous > (with option for f ), and then use that where we have (EVERYWHERE)! > np.random.randn....... > > and second I guess if it helps, look at the c/f contiguous ness > of the ops where appropriate... > > my 2c > > > > > On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin wrote: >> >> OK, here goes, the issue is the following... >> >> The optimization is question optimizes to row-by-row or >> column-by-column copying for 2-d arrays when possible, namely when: >> >> 1. the input array (where the array in question is Block.values) is >> c-contiguous for takes along axis0 or f-contiguous for takes along >> axis1 of the array, and >> 2. the contiguity of the output array matches the contiguity of the input >> >> Almost all the time, Block.values is stored c-contiguously, such that >> each row of the Block corresponds to a column of the DataFrame. So the >> optimization only really kicks in, effectively, when reindexing along >> the column axis of the DataFrame (i.e. axis 0 of the Block); it >> basically means we call memmove once per DataFrame column rather than >> iterating in a loop and copying elements. This is good because most >> sane DataFrame objects are have more rows than columns, so we call >> memmove few times (i.e. once per column) for a large block of values >> (i.e. all rows for that column at a time), so any overhead from >> calling memmove will be outweighed by the benefit of a hand optimized >> copy (which probably involves vectorization, alignment/cache >> optimization, loop unrolling, etc.) >> >> C-contiguous blocks result from basically every Pandas operation that >> operates on blocks, with the only exceptions of (as far as I can tell) >> creating a DataFrame directly from a 2-d ndarray or creating the >> transpose of a homogenous DataFrame (but not a heterogenous one) >> without copying; this is basically an optimization to avoid creating >> the c-contigous version of an array when the f-contiguous one is >> already available, but it's the exception rather than the rule and >> pretty any modification of the DataFrame will immediately require >> reallocation and copying to a new c-contiguous block. >> >> Unfortunately many of the DataFrame tests, including the two in >> question here, are (for simplicity) only testing the case where a >> homogenous 2-d data is passed to the DataFrame, which results in >> (non-representative) f-contiguous blocks. An additional issue with >> this test is that it's creating a very long but thin array (10,000 >> long, 4 wide) and reindexing along the index dimension, so row-by-row >> (from the DataFrame perspective) copying is done over and over using >> memmove on 4 element arrays. Furthermore, the alignment and width in >> bytes of each 4 element array happens to be a convenient multiple of >> 128bits, which is the multiple required for vectorized SIMD >> instructions, so it turns out the element-by-element copying is fairly >> efficient when such operations are available (as is guaranteed on >> x86-64, but not necessarily x86-32), and the call to memmove has more >> overhead than element-by-element copying. >> >> So the issue is basically only happening because all the following are >> true: >> >> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray >> (which has the default c-contiguous continuity, so the block becomes >> f-contiguous). >> 2. There has been no operation after construction of the DataFrame >> requiring reallocation of any sort (otherwise the block would become >> c-contiguous). >> 3. The reindexing is done on the index axis (otherwise no optimization >> would be triggered, since it requires the right axis/contiguity >> combination). >> 4. The DataFrame is long but thin (otherwise memmove would not be >> called repeatedly to do small copies). >> 5. The C compiler is not inlining memmove properly, for whatever reason, >> and >> 6. (possibly) The alignment/width of the data happens to be such that >> SIMD operations can be used directly, so the overhead of the eliding >> the loop is not very great and exceeded by the overhead of the >> memmove. >> >> To be honest, it's common C practice to call memmove/memcpy (the >> performance of the two don't really differ from my testing in this >> case) even for very small arrays and assuming that the implementation >> is sane enough to inline it and do the right thing either way, so I'm >> really surprised about #5: I would not have thought it to be an issue >> with a modern compiler, since calling memcpy can't do anything but >> provide the compiler more, not less, information about your intentions >> (and the overhead of the memmove aliasing check is not significant >> here). >> >> Anyway, so it's a corner case, and I didn't catch it originally >> because I tested independently the effect of 1) allocates the output >> array to be f-contiguous instead of c-contiguous by default when the >> input array is f-contiguous and 2) converting loops into memmove when >> possible, both of which have a positive performance effect >> independently but combine to adversely affect these two tests. >> >> I can revert the change that "allocates the output array to be >> f-contiguous instead of c-contiguous by default when the input array >> is f-contiguous", meaning that this optimization will almost never be >> triggered for an f-contiguous input array (unless the caller >> explicitly provides an output array as f-contiguous), but I'd rather >> not because the optimization is actually kind of useful in less >> degenerate cases when you want to quickly produce a reindexed version >> of a f-contiguous array, for whatever reason, even though the cases >> are rarer. >> >> So I think what I'm going to do instead, to avoid the degenerate case >> above, is to trigger the optimization only when the take operation is >> done along the shorter of the two dimensions (i.e. so the copied >> dimension is the longer of the two): that will definitely fix this >> test (since it'll avoid this optimization completely) but I suppose >> there might be other degenerate cases I haven't thought about it. I'll >> submit a PR later today for this, if no one finds any objection to the >> idea. >> >> However, I think it might be skewed our performance results to be >> testing DataFrame objects constructed by 2-d ndarrays, since they're >> not representative; in addition to the issue above, it means that many >> tests are actually incorporating the cost of converting an >> f-contiguous array into a c-contiguous array on top of what they're >> actually trying to test. Two possible solutions are: >> >> 1. Change DataFrame constructor (and possibly DataFrame.T) to >> normalize all blocks as c-contiguous. >> 2. Leave DataFrame constructor as-is but either change existing tests >> to exercise the more common use case (c-contiguous blocks) or add them >> in addition to the current ones. >> >> I think #2 is probably best, since #1 will have a performance impact >> for the use cases (however rare) where an entire workflow can avoid >> triggering conversion from f-contiguous blocks to c-contiguous blocks. >> >> Let me know what you all think, >> Stephen >> >> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin >> wrote: >> > Ahh! I figured it out...the platform issue is part of it, but mostly >> > it's that two (independently tested) commits had a weird effect when >> > merged. >> > >> > And the reason they did so is because this particular test turns out >> > all of our reindexing tests are testing something very >> > non-representative, because of the way they're constructed, so we're >> > not really getting representative performance data unfortunately (it >> > has to do with the DataFrame constructor and c-contiguity vs >> > f-contiguity). We should probably write new tests to fix this issue. >> > >> > I'll write up a fuller explanation when I get a chance. Anyway, sorry >> > for sending you on a git bisect goose chase, Jeff. >> > >> > Stephen >> > >> > On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin >> > wrote: >> >> As per the "we're getting too chatty on GitHub" comment, should we be >> >> moving extended issue discussion about bugs to this list whenever >> >> possible? >> >> >> >> I posted a few comments on #3089 just now but realized maybe starting >> >> an e-mail chain would be better.. >> >> >> >> Anyway, I'm looking into the issue, I suspect it's a corner case due >> >> to an array that's very large in one dimension but small in another, >> >> and possibly that there's compiler and architecture differences >> >> causing different results as well....Jeff, do you mind sending me your >> >> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >> >> you ran vb_suite on? >> >> >> >> I'll set up a 64-bit dev machine going forward so I can test on both >> >> platforms. >> >> >> >> Thanks, >> >> Stephen >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > From yoval at gmx.com Wed Mar 20 21:05:14 2013 From: yoval at gmx.com (yoval p.) Date: Wed, 20 Mar 2013 21:05:14 +0100 Subject: [Pandas-dev] Fast py2/py3 testing, fast vbench Message-ID: <20130320200515.28610@gmx.com> I've made some improvement to the tooling we have for development, just making sure everyone is aware of what's available. - closed GH3099, caching cython build artifacts when running setup.py. - setup.py now checks for BUILD_CACHE_DIR envar so you can enable it without touch the source code - Once enabled, with a warm cache testing py26/27/32/33 takes only a couple of minutes compares with travis' ~15 on a quad core machine - if caching is enabled (for future commits, the envar is sufficiant) test_perf.sh will run much faster. - i've added an option to filter vbench by regex when running test_perf.sh. Quick iteration makes everything easier, I hope these changes do that. Here's an example of all of the above, comparing two adjacent commits on a reduced set of vbenches in 1min flat: ? export BUILD_CACHE_DIR="/tmp/.pandas_build_cache/" ? time ./test_perf.sh -b 18c7e6c -t 18c7e6c^ -r reindex ... Results: t_head t_baseline ratio name dataframe_reindex 0.3726 0.3726 1.0000 reindex_fillna_backfill_float32 0.0961 0.0961 1.0000 reindex_fillna_pad_float32 0.0959 0.0959 1.0000 frame_reindex_upcast 17.7334 17.7334 1.0000 reindex_daterange_backfill 0.1649 0.1649 1.0000 reindex_fillna_pad 0.1052 0.1052 1.0000 reindex_daterange_pad 0.1757 0.1757 1.0000 reindex_frame_level_align 1.0109 1.0109 1.0000 reindex_fillna_backfill 0.1035 0.1035 1.0000 reindex_frame_level_reindex 0.9586 0.9586 1.0000 frame_reindex_columns 0.3101 0.3101 1.0000 reindex_multiindex 1.1427 1.1427 1.0000 Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio - a Ratio of 1.30 means the target commit is 30% slower then the baseline. Target [18c7e6c] : BLD: check for BUILD_CACHE_DIR envar in setup.py Baseline [18c7e6c] : BLD: check for BUILD_CACHE_DIR envar in setup.py *** Results were also written to the logfile at '/home/user1/src/pandas/vb_suite.log' real 0m58.561s user 0m52.699s sys 0m1.645s -------------- next part -------------- An HTML attachment was scrubbed... URL: From swlin at post.harvard.edu Wed Mar 20 21:07:15 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 16:07:15 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: <25AE4CC2-1087-4332-92F2-CD11B8080F03@yahoo.com> References: <25AE4CC2-1087-4332-92F2-CD11B8080F03@yahoo.com> Message-ID: So we can just ignore this and write new tests, if everyone thinks that's appropriate. Anyone else have thoughts? I'd prefer monotonic improvement myself, honestly...but it might not be a reasonable ideal to strive for in this case. Stephen On Wed, Mar 20, 2013 at 4:02 PM, Jeff Reback wrote: > I am on 64bit Linux (I use windows too, but try to avoid whenever possible!) > > I agree with your assessment wrt 32/64 > and perf - > > I am not sure that these corner cases r that big a deal, more important I think is that we test the most common cases for perf > > > On Mar 20, 2013, at 3:56 PM, Stephen Lin wrote: > >> Thanks Jeff! >> >> So ignoring the testing methodology issue for now, I've done the small >> fix suggested but apparently it *is* too restrictive because it >> negatively affects two other tests that were previously improved (so >> the two "degenerate" tests improved 25% by adding the restriction >> while these two tests regressed 25%). I will do some more testing to >> see if I can find a justifiable way of avoiding this degenerate case, >> (hopefully) without hardcoding a magic number... (But maybe we should >> just not bother with this degenerate case anyway, perhaps? I'm a fan >> of making all improvements monotonic, so I'd prefer not to have to >> regress this case even if it's degenerate, but I don't know yet how >> reliably I can do that for situations and all processor/compiler/OS >> combinations...) >> >> Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub >> for this issue, and I think the reason that some of my larger >> performance impacts are not shown in your results is because of the >> vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure >> but I really think it's likely that it's because x86-64 allows more >> vectorization optimizations even without memmove, so the effect of >> this optimization is not that great. However, there's plenty of people >> still using 32-bit OSes (I have a 64-bit machine but just never >> bothered to install 64-bit Ubuntu), so it's definitely worthwhile >> still to do this. >> >> In any case, I believe that VC++9 (i.e. 2008) (which still hosts the >> pre-built binary windows build still, I think? correct me if I'm >> wrong) does rather poorly on vectorization, even when it's allowed. >> Worse, though, it's usually not allowed because Windows 32-bit builds >> generally have to assume lowest-common-denominator hardware (SSE, >> which is from Pentium III, and SSE2, from Pentium IV, only became a >> requirements to install Windows with Windows *8* :D) since they are >> not compiled on the user machine. (You can only avoid this by >> abandoning compatibility with older machines or going through hoops to >> detect CPUID at runtime and modifying program behavior accordingly, >> which I don't think Cython does.) >> >> Anyway, I'll fill in with more info when I have some. >> >> Stephen >> >> On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback wrote: >>> awesome explanation Stephen! >>> >>> I'd vote for #2 >>> >>> essentially create a testing constructor (kind of like y-p's mkdf), >>> but creates only a numpy random array, that by default is c-continguous >>> (with option for f ), and then use that where we have (EVERYWHERE)! >>> np.random.randn....... >>> >>> and second I guess if it helps, look at the c/f contiguous ness >>> of the ops where appropriate... >>> >>> my 2c >>> >>> >>> >>> >>> On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin wrote: >>>> >>>> OK, here goes, the issue is the following... >>>> >>>> The optimization is question optimizes to row-by-row or >>>> column-by-column copying for 2-d arrays when possible, namely when: >>>> >>>> 1. the input array (where the array in question is Block.values) is >>>> c-contiguous for takes along axis0 or f-contiguous for takes along >>>> axis1 of the array, and >>>> 2. the contiguity of the output array matches the contiguity of the input >>>> >>>> Almost all the time, Block.values is stored c-contiguously, such that >>>> each row of the Block corresponds to a column of the DataFrame. So the >>>> optimization only really kicks in, effectively, when reindexing along >>>> the column axis of the DataFrame (i.e. axis 0 of the Block); it >>>> basically means we call memmove once per DataFrame column rather than >>>> iterating in a loop and copying elements. This is good because most >>>> sane DataFrame objects are have more rows than columns, so we call >>>> memmove few times (i.e. once per column) for a large block of values >>>> (i.e. all rows for that column at a time), so any overhead from >>>> calling memmove will be outweighed by the benefit of a hand optimized >>>> copy (which probably involves vectorization, alignment/cache >>>> optimization, loop unrolling, etc.) >>>> >>>> C-contiguous blocks result from basically every Pandas operation that >>>> operates on blocks, with the only exceptions of (as far as I can tell) >>>> creating a DataFrame directly from a 2-d ndarray or creating the >>>> transpose of a homogenous DataFrame (but not a heterogenous one) >>>> without copying; this is basically an optimization to avoid creating >>>> the c-contigous version of an array when the f-contiguous one is >>>> already available, but it's the exception rather than the rule and >>>> pretty any modification of the DataFrame will immediately require >>>> reallocation and copying to a new c-contiguous block. >>>> >>>> Unfortunately many of the DataFrame tests, including the two in >>>> question here, are (for simplicity) only testing the case where a >>>> homogenous 2-d data is passed to the DataFrame, which results in >>>> (non-representative) f-contiguous blocks. An additional issue with >>>> this test is that it's creating a very long but thin array (10,000 >>>> long, 4 wide) and reindexing along the index dimension, so row-by-row >>>> (from the DataFrame perspective) copying is done over and over using >>>> memmove on 4 element arrays. Furthermore, the alignment and width in >>>> bytes of each 4 element array happens to be a convenient multiple of >>>> 128bits, which is the multiple required for vectorized SIMD >>>> instructions, so it turns out the element-by-element copying is fairly >>>> efficient when such operations are available (as is guaranteed on >>>> x86-64, but not necessarily x86-32), and the call to memmove has more >>>> overhead than element-by-element copying. >>>> >>>> So the issue is basically only happening because all the following are >>>> true: >>>> >>>> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray >>>> (which has the default c-contiguous continuity, so the block becomes >>>> f-contiguous). >>>> 2. There has been no operation after construction of the DataFrame >>>> requiring reallocation of any sort (otherwise the block would become >>>> c-contiguous). >>>> 3. The reindexing is done on the index axis (otherwise no optimization >>>> would be triggered, since it requires the right axis/contiguity >>>> combination). >>>> 4. The DataFrame is long but thin (otherwise memmove would not be >>>> called repeatedly to do small copies). >>>> 5. The C compiler is not inlining memmove properly, for whatever reason, >>>> and >>>> 6. (possibly) The alignment/width of the data happens to be such that >>>> SIMD operations can be used directly, so the overhead of the eliding >>>> the loop is not very great and exceeded by the overhead of the >>>> memmove. >>>> >>>> To be honest, it's common C practice to call memmove/memcpy (the >>>> performance of the two don't really differ from my testing in this >>>> case) even for very small arrays and assuming that the implementation >>>> is sane enough to inline it and do the right thing either way, so I'm >>>> really surprised about #5: I would not have thought it to be an issue >>>> with a modern compiler, since calling memcpy can't do anything but >>>> provide the compiler more, not less, information about your intentions >>>> (and the overhead of the memmove aliasing check is not significant >>>> here). >>>> >>>> Anyway, so it's a corner case, and I didn't catch it originally >>>> because I tested independently the effect of 1) allocates the output >>>> array to be f-contiguous instead of c-contiguous by default when the >>>> input array is f-contiguous and 2) converting loops into memmove when >>>> possible, both of which have a positive performance effect >>>> independently but combine to adversely affect these two tests. >>>> >>>> I can revert the change that "allocates the output array to be >>>> f-contiguous instead of c-contiguous by default when the input array >>>> is f-contiguous", meaning that this optimization will almost never be >>>> triggered for an f-contiguous input array (unless the caller >>>> explicitly provides an output array as f-contiguous), but I'd rather >>>> not because the optimization is actually kind of useful in less >>>> degenerate cases when you want to quickly produce a reindexed version >>>> of a f-contiguous array, for whatever reason, even though the cases >>>> are rarer. >>>> >>>> So I think what I'm going to do instead, to avoid the degenerate case >>>> above, is to trigger the optimization only when the take operation is >>>> done along the shorter of the two dimensions (i.e. so the copied >>>> dimension is the longer of the two): that will definitely fix this >>>> test (since it'll avoid this optimization completely) but I suppose >>>> there might be other degenerate cases I haven't thought about it. I'll >>>> submit a PR later today for this, if no one finds any objection to the >>>> idea. >>>> >>>> However, I think it might be skewed our performance results to be >>>> testing DataFrame objects constructed by 2-d ndarrays, since they're >>>> not representative; in addition to the issue above, it means that many >>>> tests are actually incorporating the cost of converting an >>>> f-contiguous array into a c-contiguous array on top of what they're >>>> actually trying to test. Two possible solutions are: >>>> >>>> 1. Change DataFrame constructor (and possibly DataFrame.T) to >>>> normalize all blocks as c-contiguous. >>>> 2. Leave DataFrame constructor as-is but either change existing tests >>>> to exercise the more common use case (c-contiguous blocks) or add them >>>> in addition to the current ones. >>>> >>>> I think #2 is probably best, since #1 will have a performance impact >>>> for the use cases (however rare) where an entire workflow can avoid >>>> triggering conversion from f-contiguous blocks to c-contiguous blocks. >>>> >>>> Let me know what you all think, >>>> Stephen >>>> >>>> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin >>>> wrote: >>>>> Ahh! I figured it out...the platform issue is part of it, but mostly >>>>> it's that two (independently tested) commits had a weird effect when >>>>> merged. >>>>> >>>>> And the reason they did so is because this particular test turns out >>>>> all of our reindexing tests are testing something very >>>>> non-representative, because of the way they're constructed, so we're >>>>> not really getting representative performance data unfortunately (it >>>>> has to do with the DataFrame constructor and c-contiguity vs >>>>> f-contiguity). We should probably write new tests to fix this issue. >>>>> >>>>> I'll write up a fuller explanation when I get a chance. Anyway, sorry >>>>> for sending you on a git bisect goose chase, Jeff. >>>>> >>>>> Stephen >>>>> >>>>> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin >>>>> wrote: >>>>>> As per the "we're getting too chatty on GitHub" comment, should we be >>>>>> moving extended issue discussion about bugs to this list whenever >>>>>> possible? >>>>>> >>>>>> I posted a few comments on #3089 just now but realized maybe starting >>>>>> an e-mail chain would be better.. >>>>>> >>>>>> Anyway, I'm looking into the issue, I suspect it's a corner case due >>>>>> to an array that's very large in one dimension but small in another, >>>>>> and possibly that there's compiler and architecture differences >>>>>> causing different results as well....Jeff, do you mind sending me your >>>>>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >>>>>> you ran vb_suite on? >>>>>> >>>>>> I'll set up a 64-bit dev machine going forward so I can test on both >>>>>> platforms. >>>>>> >>>>>> Thanks, >>>>>> Stephen >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> http://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> http://mail.python.org/mailman/listinfo/pandas-dev >>> From swlin at post.harvard.edu Wed Mar 20 21:12:19 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Wed, 20 Mar 2013 16:12:19 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: <25AE4CC2-1087-4332-92F2-CD11B8080F03@yahoo.com> Message-ID: (also, just one more thing to note...native numpy take operations ARE using memmove, so it's possible that prior to using memmove ourselves, we were actually performing more poorly than numpy with our Cython takes in some cases, if the host platform/compiler/OS optimizes memmove much better than normal array ops...which is not out of the question for Windows builds) On Wed, Mar 20, 2013 at 4:07 PM, Stephen Lin wrote: > So we can just ignore this and write new tests, if everyone thinks > that's appropriate. Anyone else have thoughts? > > I'd prefer monotonic improvement myself, honestly...but it might not > be a reasonable ideal to strive for in this case. > > Stephen > > On Wed, Mar 20, 2013 at 4:02 PM, Jeff Reback wrote: >> I am on 64bit Linux (I use windows too, but try to avoid whenever possible!) >> >> I agree with your assessment wrt 32/64 >> and perf - >> >> I am not sure that these corner cases r that big a deal, more important I think is that we test the most common cases for perf >> >> >> On Mar 20, 2013, at 3:56 PM, Stephen Lin wrote: >> >>> Thanks Jeff! >>> >>> So ignoring the testing methodology issue for now, I've done the small >>> fix suggested but apparently it *is* too restrictive because it >>> negatively affects two other tests that were previously improved (so >>> the two "degenerate" tests improved 25% by adding the restriction >>> while these two tests regressed 25%). I will do some more testing to >>> see if I can find a justifiable way of avoiding this degenerate case, >>> (hopefully) without hardcoding a magic number... (But maybe we should >>> just not bother with this degenerate case anyway, perhaps? I'm a fan >>> of making all improvements monotonic, so I'd prefer not to have to >>> regress this case even if it's degenerate, but I don't know yet how >>> reliably I can do that for situations and all processor/compiler/OS >>> combinations...) >>> >>> Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub >>> for this issue, and I think the reason that some of my larger >>> performance impacts are not shown in your results is because of the >>> vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure >>> but I really think it's likely that it's because x86-64 allows more >>> vectorization optimizations even without memmove, so the effect of >>> this optimization is not that great. However, there's plenty of people >>> still using 32-bit OSes (I have a 64-bit machine but just never >>> bothered to install 64-bit Ubuntu), so it's definitely worthwhile >>> still to do this. >>> >>> In any case, I believe that VC++9 (i.e. 2008) (which still hosts the >>> pre-built binary windows build still, I think? correct me if I'm >>> wrong) does rather poorly on vectorization, even when it's allowed. >>> Worse, though, it's usually not allowed because Windows 32-bit builds >>> generally have to assume lowest-common-denominator hardware (SSE, >>> which is from Pentium III, and SSE2, from Pentium IV, only became a >>> requirements to install Windows with Windows *8* :D) since they are >>> not compiled on the user machine. (You can only avoid this by >>> abandoning compatibility with older machines or going through hoops to >>> detect CPUID at runtime and modifying program behavior accordingly, >>> which I don't think Cython does.) >>> >>> Anyway, I'll fill in with more info when I have some. >>> >>> Stephen >>> >>> On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback wrote: >>>> awesome explanation Stephen! >>>> >>>> I'd vote for #2 >>>> >>>> essentially create a testing constructor (kind of like y-p's mkdf), >>>> but creates only a numpy random array, that by default is c-continguous >>>> (with option for f ), and then use that where we have (EVERYWHERE)! >>>> np.random.randn....... >>>> >>>> and second I guess if it helps, look at the c/f contiguous ness >>>> of the ops where appropriate... >>>> >>>> my 2c >>>> >>>> >>>> >>>> >>>> On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin wrote: >>>>> >>>>> OK, here goes, the issue is the following... >>>>> >>>>> The optimization is question optimizes to row-by-row or >>>>> column-by-column copying for 2-d arrays when possible, namely when: >>>>> >>>>> 1. the input array (where the array in question is Block.values) is >>>>> c-contiguous for takes along axis0 or f-contiguous for takes along >>>>> axis1 of the array, and >>>>> 2. the contiguity of the output array matches the contiguity of the input >>>>> >>>>> Almost all the time, Block.values is stored c-contiguously, such that >>>>> each row of the Block corresponds to a column of the DataFrame. So the >>>>> optimization only really kicks in, effectively, when reindexing along >>>>> the column axis of the DataFrame (i.e. axis 0 of the Block); it >>>>> basically means we call memmove once per DataFrame column rather than >>>>> iterating in a loop and copying elements. This is good because most >>>>> sane DataFrame objects are have more rows than columns, so we call >>>>> memmove few times (i.e. once per column) for a large block of values >>>>> (i.e. all rows for that column at a time), so any overhead from >>>>> calling memmove will be outweighed by the benefit of a hand optimized >>>>> copy (which probably involves vectorization, alignment/cache >>>>> optimization, loop unrolling, etc.) >>>>> >>>>> C-contiguous blocks result from basically every Pandas operation that >>>>> operates on blocks, with the only exceptions of (as far as I can tell) >>>>> creating a DataFrame directly from a 2-d ndarray or creating the >>>>> transpose of a homogenous DataFrame (but not a heterogenous one) >>>>> without copying; this is basically an optimization to avoid creating >>>>> the c-contigous version of an array when the f-contiguous one is >>>>> already available, but it's the exception rather than the rule and >>>>> pretty any modification of the DataFrame will immediately require >>>>> reallocation and copying to a new c-contiguous block. >>>>> >>>>> Unfortunately many of the DataFrame tests, including the two in >>>>> question here, are (for simplicity) only testing the case where a >>>>> homogenous 2-d data is passed to the DataFrame, which results in >>>>> (non-representative) f-contiguous blocks. An additional issue with >>>>> this test is that it's creating a very long but thin array (10,000 >>>>> long, 4 wide) and reindexing along the index dimension, so row-by-row >>>>> (from the DataFrame perspective) copying is done over and over using >>>>> memmove on 4 element arrays. Furthermore, the alignment and width in >>>>> bytes of each 4 element array happens to be a convenient multiple of >>>>> 128bits, which is the multiple required for vectorized SIMD >>>>> instructions, so it turns out the element-by-element copying is fairly >>>>> efficient when such operations are available (as is guaranteed on >>>>> x86-64, but not necessarily x86-32), and the call to memmove has more >>>>> overhead than element-by-element copying. >>>>> >>>>> So the issue is basically only happening because all the following are >>>>> true: >>>>> >>>>> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray >>>>> (which has the default c-contiguous continuity, so the block becomes >>>>> f-contiguous). >>>>> 2. There has been no operation after construction of the DataFrame >>>>> requiring reallocation of any sort (otherwise the block would become >>>>> c-contiguous). >>>>> 3. The reindexing is done on the index axis (otherwise no optimization >>>>> would be triggered, since it requires the right axis/contiguity >>>>> combination). >>>>> 4. The DataFrame is long but thin (otherwise memmove would not be >>>>> called repeatedly to do small copies). >>>>> 5. The C compiler is not inlining memmove properly, for whatever reason, >>>>> and >>>>> 6. (possibly) The alignment/width of the data happens to be such that >>>>> SIMD operations can be used directly, so the overhead of the eliding >>>>> the loop is not very great and exceeded by the overhead of the >>>>> memmove. >>>>> >>>>> To be honest, it's common C practice to call memmove/memcpy (the >>>>> performance of the two don't really differ from my testing in this >>>>> case) even for very small arrays and assuming that the implementation >>>>> is sane enough to inline it and do the right thing either way, so I'm >>>>> really surprised about #5: I would not have thought it to be an issue >>>>> with a modern compiler, since calling memcpy can't do anything but >>>>> provide the compiler more, not less, information about your intentions >>>>> (and the overhead of the memmove aliasing check is not significant >>>>> here). >>>>> >>>>> Anyway, so it's a corner case, and I didn't catch it originally >>>>> because I tested independently the effect of 1) allocates the output >>>>> array to be f-contiguous instead of c-contiguous by default when the >>>>> input array is f-contiguous and 2) converting loops into memmove when >>>>> possible, both of which have a positive performance effect >>>>> independently but combine to adversely affect these two tests. >>>>> >>>>> I can revert the change that "allocates the output array to be >>>>> f-contiguous instead of c-contiguous by default when the input array >>>>> is f-contiguous", meaning that this optimization will almost never be >>>>> triggered for an f-contiguous input array (unless the caller >>>>> explicitly provides an output array as f-contiguous), but I'd rather >>>>> not because the optimization is actually kind of useful in less >>>>> degenerate cases when you want to quickly produce a reindexed version >>>>> of a f-contiguous array, for whatever reason, even though the cases >>>>> are rarer. >>>>> >>>>> So I think what I'm going to do instead, to avoid the degenerate case >>>>> above, is to trigger the optimization only when the take operation is >>>>> done along the shorter of the two dimensions (i.e. so the copied >>>>> dimension is the longer of the two): that will definitely fix this >>>>> test (since it'll avoid this optimization completely) but I suppose >>>>> there might be other degenerate cases I haven't thought about it. I'll >>>>> submit a PR later today for this, if no one finds any objection to the >>>>> idea. >>>>> >>>>> However, I think it might be skewed our performance results to be >>>>> testing DataFrame objects constructed by 2-d ndarrays, since they're >>>>> not representative; in addition to the issue above, it means that many >>>>> tests are actually incorporating the cost of converting an >>>>> f-contiguous array into a c-contiguous array on top of what they're >>>>> actually trying to test. Two possible solutions are: >>>>> >>>>> 1. Change DataFrame constructor (and possibly DataFrame.T) to >>>>> normalize all blocks as c-contiguous. >>>>> 2. Leave DataFrame constructor as-is but either change existing tests >>>>> to exercise the more common use case (c-contiguous blocks) or add them >>>>> in addition to the current ones. >>>>> >>>>> I think #2 is probably best, since #1 will have a performance impact >>>>> for the use cases (however rare) where an entire workflow can avoid >>>>> triggering conversion from f-contiguous blocks to c-contiguous blocks. >>>>> >>>>> Let me know what you all think, >>>>> Stephen >>>>> >>>>> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin >>>>> wrote: >>>>>> Ahh! I figured it out...the platform issue is part of it, but mostly >>>>>> it's that two (independently tested) commits had a weird effect when >>>>>> merged. >>>>>> >>>>>> And the reason they did so is because this particular test turns out >>>>>> all of our reindexing tests are testing something very >>>>>> non-representative, because of the way they're constructed, so we're >>>>>> not really getting representative performance data unfortunately (it >>>>>> has to do with the DataFrame constructor and c-contiguity vs >>>>>> f-contiguity). We should probably write new tests to fix this issue. >>>>>> >>>>>> I'll write up a fuller explanation when I get a chance. Anyway, sorry >>>>>> for sending you on a git bisect goose chase, Jeff. >>>>>> >>>>>> Stephen >>>>>> >>>>>> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin >>>>>> wrote: >>>>>>> As per the "we're getting too chatty on GitHub" comment, should we be >>>>>>> moving extended issue discussion about bugs to this list whenever >>>>>>> possible? >>>>>>> >>>>>>> I posted a few comments on #3089 just now but realized maybe starting >>>>>>> an e-mail chain would be better.. >>>>>>> >>>>>>> Anyway, I'm looking into the issue, I suspect it's a corner case due >>>>>>> to an array that's very large in one dimension but small in another, >>>>>>> and possibly that there's compiler and architecture differences >>>>>>> causing different results as well....Jeff, do you mind sending me your >>>>>>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >>>>>>> you ran vb_suite on? >>>>>>> >>>>>>> I'll set up a 64-bit dev machine going forward so I can test on both >>>>>>> platforms. >>>>>>> >>>>>>> Thanks, >>>>>>> Stephen >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> http://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> http://mail.python.org/mailman/listinfo/pandas-dev >>>> From jreback at yahoo.com Wed Mar 20 21:02:39 2013 From: jreback at yahoo.com (Jeff Reback) Date: Wed, 20 Mar 2013 16:02:39 -0400 Subject: [Pandas-dev] #3089 [PERF: regression from 0.10.1] discussion In-Reply-To: References: Message-ID: <25AE4CC2-1087-4332-92F2-CD11B8080F03@yahoo.com> I am on 64bit Linux (I use windows too, but try to avoid whenever possible!) I agree with your assessment wrt 32/64 and perf - I am not sure that these corner cases r that big a deal, more important I think is that we test the most common cases for perf On Mar 20, 2013, at 3:56 PM, Stephen Lin wrote: > Thanks Jeff! > > So ignoring the testing methodology issue for now, I've done the small > fix suggested but apparently it *is* too restrictive because it > negatively affects two other tests that were previously improved (so > the two "degenerate" tests improved 25% by adding the restriction > while these two tests regressed 25%). I will do some more testing to > see if I can find a justifiable way of avoiding this degenerate case, > (hopefully) without hardcoding a magic number... (But maybe we should > just not bother with this degenerate case anyway, perhaps? I'm a fan > of making all improvements monotonic, so I'd prefer not to have to > regress this case even if it's degenerate, but I don't know yet how > reliably I can do that for situations and all processor/compiler/OS > combinations...) > > Also, Jeff, I reviewed my vbenches vs the ones you published on GitHub > for this issue, and I think the reason that some of my larger > performance impacts are not shown in your results is because of the > vectorization issue (you ARE on 64-bit, right?)...I'm not 100% sure > but I really think it's likely that it's because x86-64 allows more > vectorization optimizations even without memmove, so the effect of > this optimization is not that great. However, there's plenty of people > still using 32-bit OSes (I have a 64-bit machine but just never > bothered to install 64-bit Ubuntu), so it's definitely worthwhile > still to do this. > > In any case, I believe that VC++9 (i.e. 2008) (which still hosts the > pre-built binary windows build still, I think? correct me if I'm > wrong) does rather poorly on vectorization, even when it's allowed. > Worse, though, it's usually not allowed because Windows 32-bit builds > generally have to assume lowest-common-denominator hardware (SSE, > which is from Pentium III, and SSE2, from Pentium IV, only became a > requirements to install Windows with Windows *8* :D) since they are > not compiled on the user machine. (You can only avoid this by > abandoning compatibility with older machines or going through hoops to > detect CPUID at runtime and modifying program behavior accordingly, > which I don't think Cython does.) > > Anyway, I'll fill in with more info when I have some. > > Stephen > > On Wed, Mar 20, 2013 at 2:56 PM, Jeff Reback wrote: >> awesome explanation Stephen! >> >> I'd vote for #2 >> >> essentially create a testing constructor (kind of like y-p's mkdf), >> but creates only a numpy random array, that by default is c-continguous >> (with option for f ), and then use that where we have (EVERYWHERE)! >> np.random.randn....... >> >> and second I guess if it helps, look at the c/f contiguous ness >> of the ops where appropriate... >> >> my 2c >> >> >> >> >> On Wed, Mar 20, 2013 at 2:24 PM, Stephen Lin wrote: >>> >>> OK, here goes, the issue is the following... >>> >>> The optimization is question optimizes to row-by-row or >>> column-by-column copying for 2-d arrays when possible, namely when: >>> >>> 1. the input array (where the array in question is Block.values) is >>> c-contiguous for takes along axis0 or f-contiguous for takes along >>> axis1 of the array, and >>> 2. the contiguity of the output array matches the contiguity of the input >>> >>> Almost all the time, Block.values is stored c-contiguously, such that >>> each row of the Block corresponds to a column of the DataFrame. So the >>> optimization only really kicks in, effectively, when reindexing along >>> the column axis of the DataFrame (i.e. axis 0 of the Block); it >>> basically means we call memmove once per DataFrame column rather than >>> iterating in a loop and copying elements. This is good because most >>> sane DataFrame objects are have more rows than columns, so we call >>> memmove few times (i.e. once per column) for a large block of values >>> (i.e. all rows for that column at a time), so any overhead from >>> calling memmove will be outweighed by the benefit of a hand optimized >>> copy (which probably involves vectorization, alignment/cache >>> optimization, loop unrolling, etc.) >>> >>> C-contiguous blocks result from basically every Pandas operation that >>> operates on blocks, with the only exceptions of (as far as I can tell) >>> creating a DataFrame directly from a 2-d ndarray or creating the >>> transpose of a homogenous DataFrame (but not a heterogenous one) >>> without copying; this is basically an optimization to avoid creating >>> the c-contigous version of an array when the f-contiguous one is >>> already available, but it's the exception rather than the rule and >>> pretty any modification of the DataFrame will immediately require >>> reallocation and copying to a new c-contiguous block. >>> >>> Unfortunately many of the DataFrame tests, including the two in >>> question here, are (for simplicity) only testing the case where a >>> homogenous 2-d data is passed to the DataFrame, which results in >>> (non-representative) f-contiguous blocks. An additional issue with >>> this test is that it's creating a very long but thin array (10,000 >>> long, 4 wide) and reindexing along the index dimension, so row-by-row >>> (from the DataFrame perspective) copying is done over and over using >>> memmove on 4 element arrays. Furthermore, the alignment and width in >>> bytes of each 4 element array happens to be a convenient multiple of >>> 128bits, which is the multiple required for vectorized SIMD >>> instructions, so it turns out the element-by-element copying is fairly >>> efficient when such operations are available (as is guaranteed on >>> x86-64, but not necessarily x86-32), and the call to memmove has more >>> overhead than element-by-element copying. >>> >>> So the issue is basically only happening because all the following are >>> true: >>> >>> 1. The DataFrame is constructed directly by a 2-d homogenous ndarray >>> (which has the default c-contiguous continuity, so the block becomes >>> f-contiguous). >>> 2. There has been no operation after construction of the DataFrame >>> requiring reallocation of any sort (otherwise the block would become >>> c-contiguous). >>> 3. The reindexing is done on the index axis (otherwise no optimization >>> would be triggered, since it requires the right axis/contiguity >>> combination). >>> 4. The DataFrame is long but thin (otherwise memmove would not be >>> called repeatedly to do small copies). >>> 5. The C compiler is not inlining memmove properly, for whatever reason, >>> and >>> 6. (possibly) The alignment/width of the data happens to be such that >>> SIMD operations can be used directly, so the overhead of the eliding >>> the loop is not very great and exceeded by the overhead of the >>> memmove. >>> >>> To be honest, it's common C practice to call memmove/memcpy (the >>> performance of the two don't really differ from my testing in this >>> case) even for very small arrays and assuming that the implementation >>> is sane enough to inline it and do the right thing either way, so I'm >>> really surprised about #5: I would not have thought it to be an issue >>> with a modern compiler, since calling memcpy can't do anything but >>> provide the compiler more, not less, information about your intentions >>> (and the overhead of the memmove aliasing check is not significant >>> here). >>> >>> Anyway, so it's a corner case, and I didn't catch it originally >>> because I tested independently the effect of 1) allocates the output >>> array to be f-contiguous instead of c-contiguous by default when the >>> input array is f-contiguous and 2) converting loops into memmove when >>> possible, both of which have a positive performance effect >>> independently but combine to adversely affect these two tests. >>> >>> I can revert the change that "allocates the output array to be >>> f-contiguous instead of c-contiguous by default when the input array >>> is f-contiguous", meaning that this optimization will almost never be >>> triggered for an f-contiguous input array (unless the caller >>> explicitly provides an output array as f-contiguous), but I'd rather >>> not because the optimization is actually kind of useful in less >>> degenerate cases when you want to quickly produce a reindexed version >>> of a f-contiguous array, for whatever reason, even though the cases >>> are rarer. >>> >>> So I think what I'm going to do instead, to avoid the degenerate case >>> above, is to trigger the optimization only when the take operation is >>> done along the shorter of the two dimensions (i.e. so the copied >>> dimension is the longer of the two): that will definitely fix this >>> test (since it'll avoid this optimization completely) but I suppose >>> there might be other degenerate cases I haven't thought about it. I'll >>> submit a PR later today for this, if no one finds any objection to the >>> idea. >>> >>> However, I think it might be skewed our performance results to be >>> testing DataFrame objects constructed by 2-d ndarrays, since they're >>> not representative; in addition to the issue above, it means that many >>> tests are actually incorporating the cost of converting an >>> f-contiguous array into a c-contiguous array on top of what they're >>> actually trying to test. Two possible solutions are: >>> >>> 1. Change DataFrame constructor (and possibly DataFrame.T) to >>> normalize all blocks as c-contiguous. >>> 2. Leave DataFrame constructor as-is but either change existing tests >>> to exercise the more common use case (c-contiguous blocks) or add them >>> in addition to the current ones. >>> >>> I think #2 is probably best, since #1 will have a performance impact >>> for the use cases (however rare) where an entire workflow can avoid >>> triggering conversion from f-contiguous blocks to c-contiguous blocks. >>> >>> Let me know what you all think, >>> Stephen >>> >>> On Wed, Mar 20, 2013 at 1:25 AM, Stephen Lin >>> wrote: >>>> Ahh! I figured it out...the platform issue is part of it, but mostly >>>> it's that two (independently tested) commits had a weird effect when >>>> merged. >>>> >>>> And the reason they did so is because this particular test turns out >>>> all of our reindexing tests are testing something very >>>> non-representative, because of the way they're constructed, so we're >>>> not really getting representative performance data unfortunately (it >>>> has to do with the DataFrame constructor and c-contiguity vs >>>> f-contiguity). We should probably write new tests to fix this issue. >>>> >>>> I'll write up a fuller explanation when I get a chance. Anyway, sorry >>>> for sending you on a git bisect goose chase, Jeff. >>>> >>>> Stephen >>>> >>>> On Wed, Mar 20, 2013 at 1:01 AM, Stephen Lin >>>> wrote: >>>>> As per the "we're getting too chatty on GitHub" comment, should we be >>>>> moving extended issue discussion about bugs to this list whenever >>>>> possible? >>>>> >>>>> I posted a few comments on #3089 just now but realized maybe starting >>>>> an e-mail chain would be better.. >>>>> >>>>> Anyway, I'm looking into the issue, I suspect it's a corner case due >>>>> to an array that's very large in one dimension but small in another, >>>>> and possibly that there's compiler and architecture differences >>>>> causing different results as well....Jeff, do you mind sending me your >>>>> the output of "gcc -dumpmachine" and "gcc -dumpspecs" on the machine >>>>> you ran vb_suite on? >>>>> >>>>> I'll set up a 64-bit dev machine going forward so I can test on both >>>>> platforms. >>>>> >>>>> Thanks, >>>>> Stephen >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> http://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev >> From yoval at gmx.com Sat Mar 23 00:44:20 2013 From: yoval at gmx.com (yoval p.) Date: Sat, 23 Mar 2013 00:44:20 +0100 Subject: [Pandas-dev] I fought travis and won (sort of). Message-ID: <20130322234420.28600@gmx.com> Hi guys, I've been frustrated with the turn-around time on travis, as it's become less a "CI" service then a "ATFMLI" (about 25 minutes later integration). Much of that build time is taken up by cythonizing and compilation even though the majority of PRs don't touch the cython code, and that's all wasted work. I hacked out a POC using network storage to cache build results bringing a complete run down to about ~8 minutes. Admittedly not amazing, but 2.5X-3X all the same. If any of you have an S3 API key you're willing to throw my way, I can set this up so anyone can opt in, via a magic incantation included in the commit message. Cheers, yoval -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Sat Mar 23 01:56:12 2013 From: jreback at yahoo.com (Jeff Reback) Date: Fri, 22 Mar 2013 20:56:12 -0400 Subject: [Pandas-dev] docs & builds Message-ID: <0F58569D-CD9B-4ECE-99B5-D83B7FA041F2@yahoo.com> Wes/Chang not sure exactly how the doc builds happen, though I usually see updated by 5pm est, working? also windows dev builds stopped as of 3/14 thanks Jeff I can be reached on my cell 917-971-6387 From wesmckinn at gmail.com Sat Mar 23 17:01:07 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 23 Mar 2013 12:01:07 -0400 Subject: [Pandas-dev] docs & builds In-Reply-To: <0F58569D-CD9B-4ECE-99B5-D83B7FA041F2@yahoo.com> References: <0F58569D-CD9B-4ECE-99B5-D83B7FA041F2@yahoo.com> Message-ID: On Fri, Mar 22, 2013 at 8:56 PM, Jeff Reback wrote: > Wes/Chang > > not sure exactly how the doc builds happen, though I usually see updated by 5pm est, working? > > also windows dev builds stopped as of 3/14 > > thanks > > Jeff > > > I can be reached on my cell 917-971-6387 > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev Windows dev builds are updated. I had turned off the VM before I left for PyCon fully intending to run it on my laptop but forgot to do so. It will be offline a couple days next week due to moving. - Wes From changshe at gmail.com Sat Mar 23 18:08:04 2013 From: changshe at gmail.com (Chang She) Date: Sat, 23 Mar 2013 10:08:04 -0700 Subject: [Pandas-dev] docs & builds In-Reply-To: References: <0F58569D-CD9B-4ECE-99B5-D83B7FA041F2@yahoo.com> Message-ID: <2FB2162E-D142-4CFF-9329-9C12918EA8C7@gmail.com> The docs built are kicked off everyday on the same machine that was running the VM, but thank god we're not building the docs in a windows environment :) On Mar 23, 2013, at 9:01 AM, Wes McKinney wrote: > On Fri, Mar 22, 2013 at 8:56 PM, Jeff Reback wrote: >> Wes/Chang >> >> not sure exactly how the doc builds happen, though I usually see updated by 5pm est, working? >> >> also windows dev builds stopped as of 3/14 >> >> thanks >> >> Jeff >> >> >> I can be reached on my cell 917-971-6387 >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > Windows dev builds are updated. I had turned off the VM before I left > for PyCon fully intending to run it on my laptop but forgot to do so. > It will be offline a couple days next week due to moving. > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Sun Mar 24 03:28:56 2013 From: jeffreback at gmail.com (Jeff Reback) Date: Sat, 23 Mar 2013 22:28:56 -0400 Subject: [Pandas-dev] docs & builds In-Reply-To: <2FB2162E-D142-4CFF-9329-9C12918EA8C7@gmail.com> References: <0F58569D-CD9B-4ECE-99B5-D83B7FA041F2@yahoo.com> <2FB2162E-D142-4CFF-9329-9C12918EA8C7@gmail.com> Message-ID: <1DBC5FFC-0D41-4241-B5F6-6F6AA1542588@gmail.com> looks like they r updated thxs! I can be reached on my cell 917-971-6387 On Mar 23, 2013, at 1:08 PM, Chang She wrote: > The docs built are kicked off everyday on the same machine that was running the VM, but thank god we're not building the docs in a windows environment :) > > On Mar 23, 2013, at 9:01 AM, Wes McKinney wrote: > >> On Fri, Mar 22, 2013 at 8:56 PM, Jeff Reback wrote: >>> Wes/Chang >>> >>> not sure exactly how the doc builds happen, though I usually see updated by 5pm est, working? >>> >>> also windows dev builds stopped as of 3/14 >>> >>> thanks >>> >>> Jeff >>> >>> >>> I can be reached on my cell 917-971-6387 >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> http://mail.python.org/mailman/listinfo/pandas-dev >> >> Windows dev builds are updated. I had turned off the VM before I left >> for PyCon fully intending to run it on my laptop but forgot to do so. >> It will be offline a couple days next week due to moving. >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> http://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Tue Mar 26 02:58:59 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 25 Mar 2013 21:58:59 -0400 Subject: [Pandas-dev] Fast py2/py3 testing, fast vbench In-Reply-To: <20130320200515.28610@gmx.com> References: <20130320200515.28610@gmx.com> Message-ID: On Wed, Mar 20, 2013 at 4:05 PM, yoval p. wrote: > I've made some improvement to the tooling we have > for development, just making sure everyone is > aware of what's available. > > - closed GH3099, caching cython build artifacts > when running setup.py. > - setup.py now checks for BUILD_CACHE_DIR envar > so you can enable it without touch the source code > - Once enabled, with a warm cache testing py26/27/32/33 > takes only a couple of minutes compares with travis' ~15 > on a quad core machine > - if caching is enabled (for future commits, the envar is sufficiant) > test_perf.sh will run much faster. > - i've added an option to filter vbench by regex when running > test_perf.sh. > > Quick iteration makes everything easier, I hope these > changes do that. > > Here's an example of all of the above, comparing two adjacent > commits on a reduced set of vbenches in 1min flat: > > ? export BUILD_CACHE_DIR="/tmp/.pandas_build_cache/" > ? time ./test_perf.sh -b 18c7e6c -t 18c7e6c^ -r reindex > ... > Results: > t_head t_baseline ratio > name > dataframe_reindex 0.3726 0.3726 1.0000 > reindex_fillna_backfill_float32 0.0961 0.0961 1.0000 > reindex_fillna_pad_float32 0.0959 0.0959 1.0000 > frame_reindex_upcast 17.7334 17.7334 1.0000 > reindex_daterange_backfill 0.1649 0.1649 1.0000 > reindex_fillna_pad 0.1052 0.1052 1.0000 > reindex_daterange_pad 0.1757 0.1757 1.0000 > reindex_frame_level_align 1.0109 1.0109 1.0000 > reindex_fillna_backfill 0.1035 0.1035 1.0000 > reindex_frame_level_reindex 0.9586 0.9586 1.0000 > frame_reindex_columns 0.3101 0.3101 1.0000 > reindex_multiindex 1.1427 1.1427 1.0000 > > Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio > > - a Ratio of 1.30 means the target commit is 30% slower then the baseline. > > Target [18c7e6c] : BLD: check for BUILD_CACHE_DIR envar in setup.py > Baseline [18c7e6c] : BLD: check for BUILD_CACHE_DIR envar in setup.py > > > *** Results were also written to the logfile at > '/home/user1/src/pandas/vb_suite.log' > > > real 0m58.561s > user 0m52.699s > sys 0m1.645s > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev > This is really great. Caching of builds is a no-brainer and the vbench suite has gotten quite large (surprised it's not more popular? we are avant garde). Thanks y-p! From swlin at post.harvard.edu Wed Mar 27 01:39:30 2013 From: swlin at post.harvard.edu (Stephen Lin) Date: Tue, 26 Mar 2013 20:39:30 -0400 Subject: [Pandas-dev] Small question about vb_suite Message-ID: Hey guys, Just curious, Is there a convenient way to run just a particular set of benchmarks using one's locally checked out copy, right than going through the whole rigamorale of checking out two commits and comparing them? I'm modifying some tests to see if I can improve their stability, and I just want to check for syntax errors and such quickly... Stephen From wesmckinn at gmail.com Wed Mar 27 02:12:26 2013 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 26 Mar 2013 21:12:26 -0400 Subject: [Pandas-dev] Small question about vb_suite In-Reply-To: References: Message-ID: On Tue, Mar 26, 2013 at 8:39 PM, Stephen Lin wrote: > Hey guys, > > Just curious, Is there a convenient way to run just a particular set > of benchmarks using one's locally checked out copy, right than going > through the whole rigamorale of checking out two commits and comparing > them? I'm modifying some tests to see if I can improve their > stability, and I just want to check for syntax errors and such > quickly... > > Stephen > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > http://mail.python.org/mailman/listinfo/pandas-dev The Benchmark objects have a "run" method: In [2]: import reindex In [3]: reindex.reindex_fillna_pad.run() Out[3]: {'loops': 1000, 'repeat': 3, 'succeeded': True, 'timing': 0.12795305252075195, 'units': 'ms'} Make a list of benchmarks of interest, any, with a little getattr action, you should be in business. - Wes From yoval at gmx.com Wed Mar 27 11:48:49 2013 From: yoval at gmx.com (yoval p.) Date: Wed, 27 Mar 2013 11:48:49 +0100 Subject: [Pandas-dev] Small question about vb_suite Message-ID: <20130327104850.37450@gmx.com> ----- Original Message ----- From: Stephen Lin Sent: 03/27/13 02:39 AM To: pandas-dev at python.org Subject: [Pandas-dev] Small question about vb_suite Hey guys, Just curious, Is there a convenient way to run just a particular set of benchmarks using one's locally checked out copy, right than going through the whole rigamorale of checking out two commits and comparing them? I'm modifying some tests to see if I can improve their stability, and I just want to check for syntax errors and such quickly... Stephen _______________________________________________ Pandas-dev mailing list Pandas-dev at python.org http://mail.python.org/mailman/listinfo/pandas-dev That would be vb_suite/perf_HEAD, which really should be part of test_perf. ...So now it is. ``` test_perf -H -r frame_.+ ``` Yoval -------------- next part -------------- An HTML attachment was scrubbed... URL: