fseek In Compressed Files

Tue Feb 4 07:39:47 EST 2014

On Tuesday, February 4, 2014 2:27:38 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604 at gmail.com> Wrote in message:
> 
> > On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> 
> >> Hello,
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> > 
> 
> > This is what I have done:
> 
> > 
> 
> > import bz2
> 
> > import sys
> 
> > from random import randint
> 
> > 
> 
> > index={}
> 
> > 
> 
> > data=[]
> 
> > f=open('temp.txt','r')
> 
> > for line in f:
> 
> >     data.append(line)
> 
> > 
> 
> > filename='temp1.txt.bz2'
> 
> > with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
> 
> >     f.writelines(data)
> 
> > 
> 
> > prevsize=0
> 
> > list1=[]
> 
> > offset={}
> 
> > with bz2.BZ2File(filename, 'rb') as f:
> 
> >     for line in f:
> 
> >         words=line.strip().split(' ')
> 
> >         list1.append(words[0])
> 
> >         offset[words[0]]= prevsize
> 
> >         prevsize = sys.getsizeof(line)+prevsize
> 
> 
> 
> sys.getsizeof looks at internal size of a python object, and is
> 
>  totally unrelated to a size on disk of a text line. len () might
> 
>  come closer, unless you're on Windows. You really should be using
> 
>  tell to define the offsets for later seek. In text mode any other
> 
>  calculation is not legal,  ie undefined. 
> 
> 
> 
> > 
> 
> > 
> 
> > data=[]
> 
> > count=0
> 
> > 
> 
> > with bz2.BZ2File(filename, 'rb') as f:
> 
> >     while count<20:
> 
> >         y=randint(1,25)
> 
> >         print y
> 
> >         print offset[str(y)]
> 
> >         count+=1
> 
> >         f.seek(int(offset[str(y)]))
> 
> >         x= f.readline()
> 
> >         data.append(x)
> 
> > 
> 
> > f=open('b.txt','w')
> 
> > f.write(''.join(data))
> 
> > f.close()
> 
> > 
> 
> > where temp.txt is the posting list file which is first written in a compressed format and then read  later. 
> 
> 
> 
> I thought you were starting with a compressed file.  If you're
> 
>  being given an uncompressed file, just deal with it directly.
> 
>  
> 
> 
> 
> >I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
> 
> > 
> 
> > 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
> 
> > 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
> 
> > 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
> 
> > 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
> 
> > 5 90 t0b1c0i0e0
> 
> 
> 
> So every line begins with its line number in ascii form?  If true,
> 
>  the dict above called offsets should just be a list.
> 
>  
> 
> 
> 
> Maybe you should just quote the entire assignment.  You're
> 
>  probably adding way too much complication to it.
> 
> 
> 
> -- 
> 
> DaveA

Hey! I am new here. Sorry about the incorrect posts. Didn't understand the protocol then.

Although, I have the uncompressed text, I cannot start right away with them