fseek In Compressed Files

Ayushi Dalmia ayushidalmia2604 at gmail.com
Thu Jan 30 08:34:57 EST 2014


On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> Hello,
> 
> 
> 
> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.

This is what I have done:

import bz2
import sys
from random import randint

index={}

data=[]
f=open('temp.txt','r')
for line in f:
    data.append(line)

filename='temp1.txt.bz2'
with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
    f.writelines(data)

prevsize=0
list1=[]
offset={}
with bz2.BZ2File(filename, 'rb') as f:
    for line in f:
        words=line.strip().split(' ')
        list1.append(words[0])
        offset[words[0]]= prevsize
        prevsize = sys.getsizeof(line)+prevsize


data=[]
count=0

with bz2.BZ2File(filename, 'rb') as f:
    while count<20:
        y=randint(1,25)
        print y
        print offset[str(y)]
        count+=1
        f.seek(int(offset[str(y)]))
        x= f.readline()
        data.append(x)

f=open('b.txt','w')
f.write(''.join(data))
f.close()

where temp.txt is the posting list file which is first written in a compressed format and then read  later. I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:

1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
5 90 t0b1c0i0e0
6 727 t0b2c0i0e0
7 431 t0b2c0i1e0
8 532 t0b1c0i0e0:652 t0b1c0i0e0:727 t0b2c0i0e0
9 378 t0b1c0i0e0
10 666 t0b2c0i0e0
11 405 t0b1c0i0e0
12 702 t0b1c0i0e0
13 755 t0b1c0i0e0
14 781 t0b1c0i0e0
15 593 t0b1c0i0e0
16 725 t0b1c0i0e0
17 989 t0b2c0i1e0
18 221 t0b1c0i0e0:402 t0b1c0i0e0:842 t0b1c0i0e0
19 405 t0b1c0i0e0
20 200 t0b1c0i0e0:300 t0b1c0i0e0:398 t0b1c0i0e0:649 t0b1c0i0e0
21 66 t0b1c0i0e0
22 30 t0b1c0i0e0
23 126 t0b1c0i0e0:895 t0b1c0i0e0
24 355 t0b1c0i0e0:374 t0b1c0i0e0:378 t0b1c0i0e0:431 t0b3c0i0e0:482 t0b1c0i0e0:546 t0b3c0i0e0:578 t0b1c0i0e0
25 198 t0b1c0i0e0



More information about the Python-list mailing list