Finding size of Variable
Peter Otten
__peter__ at web.de
Wed Feb 5 03:27:15 EST 2014
Ayushi Dalmia wrote:
> On Wednesday, February 5, 2014 12:51:31 AM UTC+5:30, Dave Angel wrote:
>> Ayushi Dalmia <ayushidalmia2604 at gmail.com> Wrote in message:
>>
>>
>>
>> >
>>
>> > Where am I going wrong? What are the alternatives I can try?
>>
>>
>>
>> You've rejected all the alternatives so far without showing your
>>
>> code, or even properly specifying your problem.
>>
>>
>>
>> To get the "total" size of a list of strings, try (untested):
>>
>>
>>
>> a = sys.getsizeof (mylist )
>>
>> for item in mylist:
>>
>> a += sys.getsizeof (item)
>>
>>
>>
>> This can be high if some of the strings are interned and get
>>
>> counted twice. But you're not likely to get closer without some
>>
>> knowledge of the data objects and where they come
>>
>> from.
>>
>>
>>
>> --
>>
>> DaveA
>
> Hello Dave,
>
> I just thought that saving others time is better and hence I explained
> only the subset of my problem. Here is what I am trying to do:
>
> I am trying to index the current wikipedia dump without using databases
> and create a search engine for Wikipedia documents. Note, I CANNOT USE
> DATABASES. My approach:
>
> I am parsing the wikipedia pages using SAX Parser, and then, I am dumping
> the words along with the posting list (a list of doc ids in which the word
> is present) into different files after reading 'X' number of pages. Now
> these files may have the same word and hence I need to merge them and
> write the final index again. Now these final indexes must be of limited
> size as I need to be of limited size. This is where I am stuck. I need to
> know how to determine the size of content in a variable before I write
> into the file.
>
> Here is the code for my merging:
>
> def mergeFiles(pathOfFolder, countFile):
> listOfWords={}
> indexFile={}
> topOfFile={}
> flag=[0]*countFile
> data=defaultdict(list)
> heap=[]
> countFinalFile=0
> for i in xrange(countFile):
> fileName = pathOfFolder+'\index'+str(i)+'.txt.bz2'
> indexFile[i]= bz2.BZ2File(fileName, 'rb')
> flag[i]=1
> topOfFile[i]=indexFile[i].readline().strip()
> listOfWords[i] = topOfFile[i].split(' ')
> if listOfWords[i][0] not in heap:
> heapq.heappush(heap, listOfWords[i][0])
At this point you have already done it wrong as your heap contains the
complete data and you have done a lot of O(N) tests on the heap.
This is both slow and consumes a lot of memory. See
http://code.activestate.com/recipes/491285-iterator-merge/
for a sane way to merge sorted data from multiple files. Your code becomes
(untested)
with open("outfile.txt", "wb") as outfile:
infiles = []
for i in xrange(countFile):
filename = os.path.join(pathOfFolder, 'index'+str(i)+'.txt.bz2')
infiles.append(bz2.BZ2File(filename, "rb"))
outfile.writelines(imerge(*infiles))
for infile in infiles:
infile.close()
Once you have your data in a single file you can read from that file and do
the postprocessing you mention below.
> while any(flag)==1:
> temp = heapq.heappop(heap)
> for i in xrange(countFile):
> if flag[i]==1:
> if listOfWords[i][0]==temp:
>
> //This is where I am stuck. I cannot wait until memory
> //error, as I need to do some postprocessing too. try:
> data[temp].extend(listOfWords[i][1:])
> except MemoryError:
> writeFinalIndex(data, countFinalFile,
> pathOfFolder) data=defaultdict(list)
> countFinalFile+=1
>
> topOfFile[i]=indexFile[i].readline().strip()
> if topOfFile[i]=='':
> flag[i]=0
> indexFile[i].close()
>
os.remove(pathOfFolder+'\index'+str(i)+'.txt.bz2')
> else:
> listOfWords[i] = topOfFile[i].split(' ')
> if listOfWords[i][0] not in heap:
> heapq.heappush(heap, listOfWords[i][0])
> writeFinalIndex(data, countFinalFile, pathOfFolder)
>
> countFile is the number of files and writeFileIndex method writes into the
> file.
More information about the Python-list
mailing list