[New-bugs-announce] [issue7011] tarinfo has problems with longlinks

Mon Sep 28 11:21:28 CEST 2009

New submission from Patrick Gerken <patrick.gerken at computer.org>:

Sadly, I am unable to debug it enough to be able to provide a thorough
test case. I can provide information of how to reproduce the problem on
request. I have a tar file and a diff to tarfile.py with some pdbs that
only get activated in the middle of the file just before the problematic
data.

Installing an egg fails, and setuptools eats the original error.
The original error is this:
ValueError: 'invalid literal for int(): \xcf\xcf\xdf\xfc\xe9\xcd\xa9\xa9'

That happens in the call to next in the class TarFile. Here we read in a
chunk of filedata, and let TarInfo parse it. But the chunk of data is
actually the beginning of an image in the tar file.
Here is a more thorough report of my pdb findings:

Environment:
I created an egg on linux, which resulted in a tar.gz file. Installing
that egg fails, because the tarfile library has problems reading the tar
file. tar itself can extract the full file without problems.
I have a self compiled python 2.4.6. 

The last file that is apparently read correctly form TarFile.next, is a
LONGLINK, tarinfo.type == 'L'
This type has a method callback in TarInfo.TYPE_METH, which it uses for
returning the real TarInfo object. That goes into proc_gnulong of
tarfile.py.
This proc_gnulong method calls next again, to get the real file info, I
think.
The next buffer that is read out, contains a file name that is exactly
100chars long, and seems to be a directory, because it has a trailing
slash. but its filetype is '0'. 
I suspected the error here, and to cross check, I checked the output of
"tar -tf" on the tar file. I expect tar to return the filenames in the
same order as python reads them in. Before the directory that next seems
to find, there is his parent directory in there. The previous tarinfo
object is exactly about this parent directory. So it looks like, we
actually have a directory entry here.
Enough wild guesses and more observations: The next call of
TarInfo.next() creates a TarInfo object again, here at about line 693,
he checks if the file is a regular file but ends with a slash. If so, he
changes the file type from '0', regular file, to '5', DIRTYPE. He
actually does that with our TarInfo object.

The TarInfo object is created successfully and the next method continues
to run. Then, around line 1650, there is a check, if tarinfo.isreg() or
tarinfo.type not in SUPPORTED_TYPES:...
Here the offset for reading the next TarInfo Buffer is increased by the
size of the actual file size in the tar file. But not for our TarInfo
object, because it is not regular file any longer. If I pad the offset
manually, everything continues to work. But I won't do it this time.
The call to next finishes, and after a while TarInfo.next() is called again.
This time, next tries to read a chunk of data again, but this time, the
chunk of data is an actual file content, it starts with 'GIF89a...',
which makes sense, the directory contains images. Here parsing of the
tar file fails.

----------
components: Library (Lib)
messages: 93198
nosy: do3cc
severity: normal
status: open
title: tarinfo has problems with longlinks
versions: Python 2.4

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7011>
_______________________________________