Reading compressed files

Jordan jordan.taylor2 at gmail.com
Wed Feb 21 12:55:38 EST 2007


On Feb 21, 5:21 am, Steve Holden <s... at holdenweb.com> wrote:
> Shadab Sayani wrote:
> > Hi,
> > I have compressed files compressed using different techniques
> > (especially unix compress). So I want to have a module that reads any of
> > these (.Z,.bz,.tgz files) files and manipulates the data.
> > The data has a syntax.It contains
> > HEADER (some information)
> > BODY      (some information)
> > FOOTER   (some information)
> > If it were a normal text file I can get the values corresponding to
> > HEADER BODY and FOOTER by open function.
> > But here the files are in different format .Z , .bz ,.tgz,.gz .But I
> > know these are the only formats.Also I cannot rely upon the extensions
> > of the file (a .Z file can have no extension at all).Is there a way to
> > identify  which file am I reading and then  read it?If so how to read it?
> > Thanks and Regards,
> > Shadab.
>
> > Send instant messages to your online friendshttp://uk. messenger.yahoo.com
>
> The usual way is that used by the "file" utility - take a look at the
> /etc/magic file to see if you can gain any clues from that.
>
> regards
>   Steve
> --
> Steve Holden       +44 150 684 7255  +1 800 494 3119
> Holden Web LLC/Ltd          http://www.holdenweb.com
> Skype: holdenweb    http://del.icio.us/steve.holden
> Blog of Note:          http://holdenweb.blogspot.com
> See you at PyCon?        http://us.pycon.org/TX2007- Hide quoted text -
>
> - Show quoted text -

You really need to check the docs and do a little research before
asking questions like this.  Check out tarfile and zlib modules (use
these for opening .tgz, .gz, .bz2 etc), and then go search for the
different ways each (.Z, .tgz, .gz, .bz etc) formats their header
files so that you can determine what each type of archive is.

> The data has a syntax.It contains
> HEADER (some information)
> BODY      (some information)
> FOOTER   (some information)
> If it were a normal text file I can get the values corresponding to
> HEADER BODY and FOOTER by open function.
> But here the files are in different format .Z , .bz ,.tgz,.gz ...

What do you mean if it was a normal text file?? Use 'rb' for reading
binary and you shouldn't have a problem if you know the syntax for
each of these files.  What you need to do is research each syntax and
write a regexp or other string searching function to determine each
format based on the archive header syntax. While you're at it, open a
few archives with a hex editor or using open(...,'rb') and take a look
at the syntax of each file to see if you can determine it yourself.
Goodluck.

Cheers,
Jordan




More information about the Python-list mailing list