Determine whether STDIN is gzipped

Alex Martelli aleaxit at yahoo.com
Tue Jan 9 07:39:23 EST 2001


"Carsten Gaebler" <cg at schlund.de> wrote in message
news:3A5ADB84.194BAD4E at schlund.de...
> Hi there!
>
> I'm writing a program that reads its input data from STDIN. Now I'd like
> it to be able to determine whether the input data is gzipped or not and
    [snip]
> f = gzip.GzipFile("", "rb", fileobj=sys.stdin)
> try:
>     f.readline() # raises exception if not gzipped
> except:
>     f = sys.stdin
>
> while 1:
>     line = f.readline()
    [snip]
> The problem is that if STDIN is not gzipped I am missing the first few
> bytes of the first line in the while loop, if STDIN is gzipped I am even
> missing the whole first line, all because of the readline() in the try
> block. Unfortunately f.seek(0) doesn't work for STDIN. Any hints?

The part about "missing the whole first line if gzipped" is trivial
to remedy, of course -- change from the try onwards to:

    try:
        line = f.readline()
    except IOError:
        f = sys.stdin
        line = f.readline()

    while line:
        # processing
        line = f.readline()

(the except was changed just to be more specific -- except without
an exception type specification is rarely what one wants!).


The part about "missing the first few bytes" is also not all that
hard -- you just have to make sure you stash those 'first few bytes'
somewhere, while the GzipFile is reading them.  Specifically, 2
bytes should be needed; the gzip.GzipFile seems to do a first read
of 2 bytes (looking for the 'magic-number' confirming this is a
gzipped file), then others.  A little wrapping should suffice...:


class Wrap:
    def __init__(self,file):
        self.file = file
        self.data = None
    def read(self, *args):
        if self.data is None:
            self.data = self.file.read(*args)
            return self.data
        return self.file.read(*args)
    def tell(self, *args):
       return self.file.tell(*args)
    def seek(self, *args):
       return self.file.seek(*args)

wrapper = Wrap(sys.stdin)
f = gzip.GzipFile("", "rb", fileobj=wrapper)

try:
    line = f.readline() # raises exception if not gzipped
except IOError:
    f = sys.stdin
    line = wrapper.data + f.readline()

etc, as above.  As you see, a Wrap instance just delegates read, tell
and seek calls (the ones we expect from GzipFile's readline, after
a little experimentation) right to the file object it wraps,
*except* that, on the *very first* call to .read, it also stashes
away the returned data as its own instance-data attribute '.data'.

This should work.  Should performance prove unsatisfactory, we
might play dirtier tricks yet (a f.fileobj = wrapper.file right
after the first line in the try-body should get the wrapper out
of the way nicely, for example -- but that means _very_ heavy
coupling to gzip's implementation, so I wouldn't do it unless
measurements had demonstrated that the wrapper's overhead is
really critical to the overall application's performance; there
are less-heavily-coupling speedup alternatives to consider, too,
such as NOT reading one line at a time, but several!-).


Crucial issue -- IS this the "simplest thing that could
possibly work"?  I think that, roughly, it may be.  You have
indicated that you just can't afford reading _all_ of sys.stdin
into memory, so you will have to read it in pieces -- and the
very first piece will need to be examined twice in some cases
(by gzip, then by yourself if the file is NOT gzipped).  As
gzip calls self.fileobj.read, that call will somehow have to
save the data being read and returned (the first time only).

One could examine the magicnumber first, but then a wrapper
would be needed anyway to let gzip recover that data through
its own call to read; unless one wants to subclass the class
from gzip and override _read_gzip_header, but that doesn't
look like a particularly clever idea either -- it's a complex
method doing lots of clever stuff.  Wrapping the fileobject
one way or another would appear to be substantially simpler.


Alex






More information about the Python-list mailing list