[Neuroimaging] indexed access to gziped files

Nathaniel Smith njs at pobox.com
Fri Mar 11 19:55:32 EST 2016


On Fri, Mar 11, 2016 at 2:20 PM, paul mccarthy <pauldmccarthy at gmail.com> wrote:
> Hi all,
>
> Sorry for the delay in my joining the conversation.
>
> Brendan is correct - this is not a memmap solution. The approach that I've
> implemented (which I have to emphasise is not my idea - I've just got it
> working in Python) just improves random seek/read time of the uncompressed
> data stream, while keeping the compressed data on disk. This is achieved by
> building an index of mappings between locations in the compressed and
> uncompressed data streams. The index can be fully built when the file is
> initially opened, or can be built on-demand as the file handle is used.
>
> So once an index is built, the IndexedGzipFile class can be used to read in
> parts of the compressed data, without having to decompress the entire file
> every time you seek to a new location. This is what is typically required
> when reading GZIP files, and is a fundamental limitation in the GZIP format.
>
> As Gael (and others) pointed out, using a different compression format would
> remove the need for silly indexing techniques like the one that I have
> implemented. But I figured that having something like indexed_gzip would
> make life a bit easier for those of us who have to work with large amounts
> of existing .nii.gz files, at least until a new file format is adopted.

It's possible to create .gz files that allow seeking but are still
compliant with all the usual standards (e.g. regular gunzip still
works):

  http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html

It sounds likes the biopython folks are on top of this...

The excellent xz tool suite has similar features:

  http://blastedbio.blogspot.com/2013/04/random-access-to-blocked-xz-format-bxzf.html

> Going back to the topic of memory-mapping - I'm pretty sure that it is
> completely impossible to achieve true memory-mapping of compressed data,
> unless you're working at the OS kernel level.

100% pedantic and impractical correction: technically it is totally
possible; the Dato folks did it for their numpy/SArray wrappers. The
solution is to implement your own VM mapping system by registering
your page fault routine as a SIGSEGV handler, and have it call mmap to
manipulate the page tables. (If the previous sentence doesn't mean
anything to you, then that's probably a good thing ...there's a
difference between whether you *can* do something and whether you
*should* ;-).)

(Also, the result is unlikely to be particularly fast, and you still
need some way to actually do the fast random access to the compressed
disk file.)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the Neuroimaging mailing list