[Tutor] first python program to find citeulike duplicates

Suresh Krishna madzientist at gmail.com
Thu Nov 20 12:54:20 CET 2008


hi everybody,

i wrote this to solve the problem of exact duplicate entries in my  
citeulike library, that i wanted to remove. so i exported my entries in  
ris format, and then parsed the entries to find exact duplicates based on  
matching fields. the exact duplicates came about because i uploaded the  
same RIS file twice to my citeulike library, as a result of the upload  
being interrupted the first time.

it works (i think), but since this is my very first python program, i  
would really appreciate feedback on how the program could be improved..

thanks much !!!!

suresh

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~éééé

InFileName= "original_library.ris";
INBIBFILE=open(InFileName,'r')

OutFileName= "C:/users/skrishna/desktop/library_without_duplicates.ris";
OUTBIBFILE=open(OutFileName,'w')

OutDupFileName= "C:/users/skrishna/desktop/library_of_duplicates.ris";
OUTDUPBIBFILE=open(OutDupFileName,'w')

current_entry=[]
current_keyval=[]
current_keys=[]

numduplicates=0

for line in INBIBFILE: #large file, so prefer not to use readlines()

     if not current_entry and line.isspace():
         continue  #dont write out successive blanks or initial blanks
     elif current_entry and line.isspace(): #reached a blank that  
demarcates end of current entry

         keyvalue=''.join(current_keyval) #generated a key based on certain  
fields
         if keyvalue not in current_keys: #is a unique entry
             current_keys.append(keyvalue) #append current key to list of  
keys
             current_entry.append(line) #add the blank line to current entry
             OUTBIBFILE.writelines(current_entry) #write out to new bib  
file without duplicates
             current_entry=[] #clear current entry for next one
             current_keyval=[] #clear current key
         else:
             numduplicates=numduplicates+1 #increment the number of  
duplicates
             current_entry.append(line) #add the blank line at end of entry
             OUTDUPBIBFILE.writelines(current_entry) #write out to list of  
duplicates file
             current_entry=[] #clear current entry for next one
             current_keyval=[] #clear current key
     elif len(line)>2: #not a blank, so more stuff in currrent entry
         current_entry.append(line)
         if line[0:2] in ('TY','JF','EP','TI','SP','KW','AU','PY','UR'):  
#only if line starts with these fields
             current_keyval.append(line) #append to current key

INBIBFILE.close()
OUTBIBFILE.close()
OUTDUPBIBFILE.close()

print numduplicates



More information about the Tutor mailing list