[Numpy-discussion] Possible roadmap addendum: building better text file readers

Sun Feb 26 14:16:10 EST 2012

On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
> > I haven't pushed it to the extreme, but the "big" example (in the
> examples/
> > directory) is a 1 gig text file with 2 million rows and 50 fields in each
> > row.  This is read in less than 30 seconds (but that's with a solid state
> > drive).
>
> Obviously this was just a quick test, but FYI, a solid state drive
> shouldn't really make any difference here -- this is a pure sequential
> read, and for those, SSDs are if anything actually slower than
> traditional spinning-platter drives.
>
>

Good point.

> For this kind of benchmarking, you'd really rather be measuring the
> CPU time, or reading byte streams that are already in memory. If you
> can process more MB/s than the drive can provide, then your code is
> effectively perfectly fast. Looking at this number has a few
> advantages:
>  - You get more repeatable measurements (no disk buffers and stuff
> messing with you)
>  - If your code can go faster than your drive, then the drive won't
> make your benchmark look bad
>  - There are probably users out there that have faster drives than you
> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
> array), so it's nice to be able to measure optimizations even after
> they stop mattering on your equipment.
>
>

For anyone benchmarking software like this, be sure to clear the disk cache
before each run.  In linux:

$ sync
$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

In Mac OSX:

$ purge

I'm not sure what the equivalent is in Windows.

Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120226/2f3a8657/attachment.html>