[issue20781] BZ2File doesn't decompress some .bz2 files correctly

Thu Feb 27 10:24:06 CET 2014

Nadeem Vawda added the comment:

> How does one create a multi-stream bzip2 file in the first place?

If you didn't do so deliberately, I would guess that you used a parallel
compression tool like pbzip2 or lbzip2 to create your bz2 file. These tools work
by splitting the input into chunks, compressing each chunk as a separate stream,
and then concatenating these streams afterward.

Another possibility is that you just concatenated two existing bz2 files, e.g.:

    $ cat first.bz2 second.bz2 >multi.bz2

> And how do I tell it's multi-stream.

I don't know of any pre-existing tools to do this, but you can write a script
for it yourself, by feeding the file's data through a BZ2Decompressor. When the
decompress() method raises EOFError, you're at the end of the first stream. If
the decompressor's unused_data attribute is non-empty, or there is data that has
not yet been read from the input file, then it is either (a) a multi-stream bz2
file or (b) a bz2 file with other metadata tacked on to the end.

To distinguish between cases (a) and (b), take unused_data + rest_of_input_file
and feed it into a new BZ2Decompressor. If don't get an IOError, then you've got
a multi-stream bz2 file.

(If you *do* get an IOError, then that's case (b) - someone's appended non-bz2
 data to the end of a bz2 file. For example, Gentoo and Sabayon Linux packages
 are bz2 files with package metadata appended, according to issue 19839.)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20781>
_______________________________________