[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs at pobox.com
Sun Feb 26 14:49:47 EST 2012


On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> For this kind of benchmarking, you'd really rather be measuring the
>> CPU time, or reading byte streams that are already in memory. If you
>> can process more MB/s than the drive can provide, then your code is
>> effectively perfectly fast. Looking at this number has a few
>> advantages:
>>  - You get more repeatable measurements (no disk buffers and stuff
>> messing with you)
>>  - If your code can go faster than your drive, then the drive won't
>> make your benchmark look bad
>>  - There are probably users out there that have faster drives than you
>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>> array), so it's nice to be able to measure optimizations even after
>> they stop mattering on your equipment.
>
>
> For anyone benchmarking software like this, be sure to clear the disk cache
> before each run.  In linux:

Err, my argument was that you should do exactly the opposite, and just
worry about hot-cache times (or time reading a big in-memory buffer,
to avoid having to think about the OS's caching strategies).

Clearing the disk cache is very important for getting meaningful,
repeatable benchmarks in code where you know that the cache will
usually be cold and where hitting the disk will have unpredictable
effects (i.e., pretty much anything doing random access, like
databases, which have complicated locality patterns, you may or may
not trigger readahead, etc.). But here we're talking about pure
sequential reads, where the disk just goes however fast it goes, and
your code can either keep up or not.

One minor point where the OS interface could matter: it's good to set
up your code so it can use mmap() instead of read(), since this can
reduce overhead. read() has to copy the data from the disk into OS
memory, and then from OS memory into your process's memory; mmap()
skips the second step.

-- Nathaniel



More information about the NumPy-Discussion mailing list