fseek In Compressed Files

Mon Feb 3 15:57:38 EST 2014

 Ayushi Dalmia <ayushidalmia2604 at gmail.com> Wrote in message:
> On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
>> Hello,
>> 
>> 
>> 
>> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> This is what I have done:
> 
> import bz2
> import sys
> from random import randint
> 
> index={}
> 
> data=[]
> f=open('temp.txt','r')
> for line in f:
>     data.append(line)
> 
> filename='temp1.txt.bz2'
> with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
>     f.writelines(data)
> 
> prevsize=0
> list1=[]
> offset={}
> with bz2.BZ2File(filename, 'rb') as f:
>     for line in f:
>         words=line.strip().split(' ')
>         list1.append(words[0])
>         offset[words[0]]= prevsize
>         prevsize = sys.getsizeof(line)+prevsize

sys.getsizeof looks at internal size of a python object, and is
 totally unrelated to a size on disk of a text line. len () might
 come closer, unless you're on Windows. You really should be using
 tell to define the offsets for later seek. In text mode any other
 calculation is not legal,  ie undefined. 

> 
> 
> data=[]
> count=0
> 
> with bz2.BZ2File(filename, 'rb') as f:
>     while count<20:
>         y=randint(1,25)
>         print y
>         print offset[str(y)]
>         count+=1
>         f.seek(int(offset[str(y)]))
>         x= f.readline()
>         data.append(x)
> 
> f=open('b.txt','w')
> f.write(''.join(data))
> f.close()
> 
> where temp.txt is the posting list file which is first written in a compressed format and then read  later. 

I thought you were starting with a compressed file.  If you're
 being given an uncompressed file, just deal with it directly.

>I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
> 
> 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
> 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
> 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
> 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
> 5 90 t0b1c0i0e0

So every line begins with its line number in ascii form?  If true,
 the dict above called offsets should just be a list.

Maybe you should just quote the entire assignment.  You're
 probably adding way too much complication to it.

-- 
DaveA