xlrd and cPickle.dump

Wed Apr 2 17:10:48 EDT 2008

patrick.waldo at gmail.com wrote:
>> FWIW, it works here on 2.5.1 without errors or warnings. Ouput is:
>> 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
>> 0.6.1
> 
> I guess it's a version issue then...

I say again: Don't guess.
> 
> I forgot about sorted!  Yes, that would make sense!
> 
> Thanks for the input.
> 
> 
> On Apr 2, 4:23 pm, patrick.wa... at gmail.com wrote:
>> Still no luck:
>>
>> Traceback (most recent call last):
>>   File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
>> \scriptutils.py", line 310, in RunScript
>>     exec codeObject in __main__.__dict__
>>   File "C:\text analysis\pickle_test2.py", line 13, in ?
>>    cPickle.dump(Data_sheet, pickle_file, -1)
>> PicklingError: Can't pickle <type 'module'>: attribute lookup
>> __builtin__.module failed

I didn't notice that the exception had changed from the original:
     "TypeError: can't pickle file objects" (with protocol=0)
to:
     "TypeError: can't pickle module objects" (pickling an xlrd.Book 
object with protocol=-1)
and now to:
     "PicklingError: Can't pickle <type 'module'>: attribute lookup 
__builtin__.module failed" (pickling an xlrd.Sheet object with protocol -1)

I'm wondering if this is some unfortunate side effect of running the 
script in the pywin IDE ("exec codeObject in __main__.__dict__"). Can 
you reproduce the problem by running the script in the Command Prompt 
window? What version of pywin32 are you using?

>>
>> My code remains the same, except I added 'wb' and the -1 following
>> your suggestions:
>>
>> import cPickle,xlrd, sys
>>
>> print sys.version
>> print xlrd.__VERSION__
>>
>> data_path = """C:\\test\\test.xls"""
>> pickle_path = """C:\\test\\pickle.pickle"""
>>
>> book = xlrd.open_workbook(data_path)
>> Data_sheet = book.sheet_by_index(0)
>>
>> pickle_file = open(pickle_path, 'wb')cPickle.dump(Data_sheet, pickle_file, -1)
>> pickle_file.close()
>>
>> To begin with (I forgot to mention this before) I get this error:
>> WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
>> zero

"WARNING" != "error". If that's the only message you get, ignore it; it 
means that your XLS file was created by the perl XLS-writing package or 
a copier thereof.

>>
>> I'm not sure what this means.
>>
>>> What do you describe as "simple manipulations"? Please describe your
>>> computer, including how much memory it has.
>> I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
>> enough for my programming projects.  However, when I try to print out
>> the rows in the excel file, my computer gets very slow and choppy,
>> which makes experimenting slow and frustrating.

Just printing the rows is VERY UNLIKELY to cause this. Demonstrate this 
to yourself by using xlrd's supplied runxlrd script:

command_prompt> c:\python24\scripts\runxlrd.py show yourfile.xls

>>  Maybe cPickle won't
>> solve this problem at all!

99.9% chance, not "maybe".

>>  For this first part, I am trying to make
>> ID numbers for the different permutation of categories, topics, and
>> sub_topics.  So I will have [book,non-fiction,biography],[book,non-
>> fiction,history-general],[book,fiction,literature], etc..
>> so I want the combination of
>> [book,non-fiction,biography] = 1
>> [book,non-fiction,history-general] = 2
>> [book,fiction,literature] = 3
>> etc...
>>
>> My code does this, except sort returns None, which is strange.

list.sort() returns None by definition; it sorts the list object's 
contents in situ.

>  I just
>> want an alphabetical sort of the first option, which sort should do
>> automatically.  When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
>>>>> nest_list.sort()
>> [['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
>> It works fine, but not for my rows.

Why are you sorting?

>>
>> Here's the code (unpickled/unsorted):
>> import xlrd, pyExcelerator
>>
>> path_file = "C:\\text_analysis\\test.xls"
>> book = xlrd.open_workbook(path_file)
>> ProcFT_QC = book.sheet_by_index(0)
>> log_path = "C:\\text_analysis\\ID_Log.log"
>> logfile = open(log_path,'wb')
>>
>> set_rows = []

The test x in y where y is a sequence needs to compare with half of the 
existing items on average. You are doing that test N times. If the 
number of unique rows is U, it will do about N*U/4 comparisons.  You 
said N is about 50,000.

The changes below make y a set; consequentially x needs to be a tuple 
instead of a list.

set_rows = set()

>> rows = []
>> db = {}
>> n=0
>> while n<ProcFT_QC.nrows:
>>     rows.append(ProcFT_QC.row_values(n, 6,9))

rows.append(tuple(ProcFT_QC.row_values(n, 6,9)))

>>     n+=1
>> print rows.sort() #Outputs None
>> ID = 1
>> for row in rows:
>>     if row not in set_rows:
>>         set_rows.append(row)

set_rows.add(row)

>>         db[ID] = row
>>         entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'

Presuming your data is actually ASCII, you could save time and memory by 
converting it once as you extract it from the spreadsheet.

entry = str(ID) + '|' + str(row).strip('u()') + '\r\n'

>>         logfile.write(entry)
>>         ID+=1
>> logfile.close()
>>

HTH,
John