usage of os.posix_fadvise

Wolfgang Maier wolfgang.maier at biologie.uni-freiburg.de
Thu May 30 13:54:12 EDT 2013


Antoine Pitrou wrote:

>Hi,

>Wolfgang Maier <wolfgang.maier <at> biologie.uni-freiburg.de> writes:
>> 
>> Dear all,
>> I was just experimenting for the first time with os.posix_fadvise(), which
>> is new in Python3.3 . I'm reading from a really huge file (several GB) and I
>> want to use the data only once, so I don't want OS-level page caching. I
>> tried os.posix_fadvise with the os.POSIX_FADV_NOREUSE and with the
>> os.POSIX_FADV_DONTNEED flags, but neither seemed to have any effect on the
>> caching behaviour of Ubuntu (still uses all available memory to page cache
>> my I/O).
>> Specifically, I was trying this:
>> 
>> import os
>> fd = os.open('myfile', os.O_RDONLY)
>> # wasn't sure about the len parameter in fadvise,
>> # so thought I just use it on the first 4GB
>> os.posix_fadvise(fd, 0, 4000000000, os.POSIX_FADV_NOREUSE) # or DONTNEED
>
>The Linux version of "man posix_fadvise" probably holds the answer:
>
>"In kernels before 2.6.18, POSIX_FADV_NOREUSE had the same semantics
>as POSIX_FADV_WILLNEED.  This was probably a bug; since kernel
>2.6.18, this flag is a no-op."
>
>"POSIX_FADV_DONTNEED attempts to free cached pages associated with the
>specified region.  This is useful, for example, while streaming large
>files.  A program may periodically request the kernel to free cached
>data that has already been used, so that more useful cached pages  are
>not discarded instead."
>
>So, in summary:
>
>- POSIX_FADV_NOREUSE doesn't do anything on (modern) Linux kernels
>- POSIX_FADV_DONTNEED must be called *after* you are done with a range of
>  data, not before you read it (note that I haven't tested to confirm it >:-))
>
>Regards
>
>Antoine.

Hi Antoine,
you're right and thanks a lot for this great piece of information.
The following quick check works like a charm now:

>>> fo = open('myfile', 'rb')
>>> chunk_size = 16184
>>> last_flush = 0
>>> d = fo.read(chunk_size)
>>> pos = chunk_size
>>> while d:
... 	d = fo.read(chunk_size)
... 	pos += chunk_size
... 	if pos > 2000000000:
... 		print ('another 2GB read, flushing')
... 		os.posix_fadvise(fo.fileno(), last_flush, last_flush+pos,
os.POSIX_FADV_DONTNEED)
... 		last_flush += pos
...             pos = 0

With this page caching for my huge file (30 GB in that case) still occurs,
of course, but it never occupies more than 2 GB of memory. This way it
should interfere less with cached data of other applications.
Have to test carefully how much that improves overall performance of the
system, but for the moment I'm more than happy!
Best wishes,
Wolfgang

P.S.: Maybe these new os module features could use a bit more documentation? 




More information about the Python-list mailing list