[issue10900] bz2 module fails to uncompress large files

Charles-Francois Natali report at bugs.python.org
Tue Mar 1 23:11:33 CET 2011


Charles-Francois Natali <neologix at free.fr> added the comment:

> Stupid questions are always worth asking. I did check the MD5 sum earlier
> and just checked it again (since I copied the file from one machine to
> another):
>
> ebwolf at ubuntu:/opt$ md5sum /host/full-planet-110115-1800.osm.bz2
> 0e3f81ef0dd415d8f90f1378666a400c  /host/full-planet-110115-1800.osm.bz2
> ebwolf at ubuntu:/opt$ cat full-planet-110115-1800.osm.bz2.md5
> 0e3f81ef0dd415d8f90f1378666a400c  full-planet-110115-1800.osm.bz2
>

Well, that only proves that the file wasn't corrupted during the download.
But this doesn't prove that the file on the remote server isn't
corrupt (see for example the link I gave you, the guy used rsync and
had a correct checksum but was still unable to extract the file).

> There you have it. I was able to convert the bz2 to gzip with no errors:
>
> bzcat full-planet-110115-1800.osm.bz2 | gzip > full-planet.osm.gz
>

How big is full-planet.osm.gz ?
Since bzip2 uses bzlib, and can very well return after having
uncompressed only half the file.
A more interesting test would be
$ bzip2 -cd full-planet-110115-1800.osm.bz2 | bzip2 -c > full-planet.new.osm.bz2
$ md5sum full-planet.*.bz2

> FYI: This problem came up last year with no resolution:
>
> http://mail.python.org/pipermail/tutor/2010-February/074610.html
>

Yeah, and it was also on an OSM file.
Now, I know that OSM are probably one of the biggest providers of huge
archives, but it's surprising that everytime there's a problem with
bz2, it's with an OSM file, no ?

Look at what I just found, a message from an OSM admin dating from later 2010:

"""
On 26 October 2010 13:47, Anthony <osm <at> inbox.org> wrote:
> a <at> A-PC:/media/usbdrive$ cat full-planet-101022.osm.bz2.md5
> 0a90fec8ce66bdd82984c2ee8c6bb6ac  full-planet-101022.osm.bz2
> a <at> A-PC:/media/usbdrive$ md5sum full-planet-101022.osm.bz2
> c652430b00668c30bb04816ff16cbfbe  full-planet-101022.osm.bz2
>
> Just me?
>

We had problems with the network card in that machine last night
causing some corruption, try
rsync://planet.openstreetmap.org/planet/full-experimental/ the file
into a good state.

Although best to wait a few hours, currently packet loss issues on
server's upstream network.

Regards
 Grant
"""

> In general, is it best to always read the same number of bytes?

In that case, it doesn't matter.

> And what is the best value to pass for buffering in BZ2File? I just made up
> something hoping it would work.

The default one ;-) (don't provide any)

> Colin was using an OSM planet file from some time last year and it quit at exactly 900000 bytes.

OSM again :-)
900.000 is exacty the default bz2 block size...

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10900>
_______________________________________


More information about the Python-bugs-list mailing list