xlrd and cPickle.dump
John Machin
sjmachin at lexicon.net
Wed Apr 2 17:10:48 EDT 2008
patrick.waldo at gmail.com wrote:
>> FWIW, it works here on 2.5.1 without errors or warnings. Ouput is:
>> 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
>> 0.6.1
>
> I guess it's a version issue then...
I say again: Don't guess.
>
> I forgot about sorted! Yes, that would make sense!
>
> Thanks for the input.
>
>
> On Apr 2, 4:23 pm, patrick.wa... at gmail.com wrote:
>> Still no luck:
>>
>> Traceback (most recent call last):
>> File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
>> \scriptutils.py", line 310, in RunScript
>> exec codeObject in __main__.__dict__
>> File "C:\text analysis\pickle_test2.py", line 13, in ?
>> cPickle.dump(Data_sheet, pickle_file, -1)
>> PicklingError: Can't pickle <type 'module'>: attribute lookup
>> __builtin__.module failed
I didn't notice that the exception had changed from the original:
"TypeError: can't pickle file objects" (with protocol=0)
to:
"TypeError: can't pickle module objects" (pickling an xlrd.Book
object with protocol=-1)
and now to:
"PicklingError: Can't pickle <type 'module'>: attribute lookup
__builtin__.module failed" (pickling an xlrd.Sheet object with protocol -1)
I'm wondering if this is some unfortunate side effect of running the
script in the pywin IDE ("exec codeObject in __main__.__dict__"). Can
you reproduce the problem by running the script in the Command Prompt
window? What version of pywin32 are you using?
>>
>> My code remains the same, except I added 'wb' and the -1 following
>> your suggestions:
>>
>> import cPickle,xlrd, sys
>>
>> print sys.version
>> print xlrd.__VERSION__
>>
>> data_path = """C:\\test\\test.xls"""
>> pickle_path = """C:\\test\\pickle.pickle"""
>>
>> book = xlrd.open_workbook(data_path)
>> Data_sheet = book.sheet_by_index(0)
>>
>> pickle_file = open(pickle_path, 'wb')cPickle.dump(Data_sheet, pickle_file, -1)
>> pickle_file.close()
>>
>> To begin with (I forgot to mention this before) I get this error:
>> WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
>> zero
"WARNING" != "error". If that's the only message you get, ignore it; it
means that your XLS file was created by the perl XLS-writing package or
a copier thereof.
>>
>> I'm not sure what this means.
>>
>>> What do you describe as "simple manipulations"? Please describe your
>>> computer, including how much memory it has.
>> I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
>> enough for my programming projects. However, when I try to print out
>> the rows in the excel file, my computer gets very slow and choppy,
>> which makes experimenting slow and frustrating.
Just printing the rows is VERY UNLIKELY to cause this. Demonstrate this
to yourself by using xlrd's supplied runxlrd script:
command_prompt> c:\python24\scripts\runxlrd.py show yourfile.xls
>> Maybe cPickle won't
>> solve this problem at all!
99.9% chance, not "maybe".
>> For this first part, I am trying to make
>> ID numbers for the different permutation of categories, topics, and
>> sub_topics. So I will have [book,non-fiction,biography],[book,non-
>> fiction,history-general],[book,fiction,literature], etc..
>> so I want the combination of
>> [book,non-fiction,biography] = 1
>> [book,non-fiction,history-general] = 2
>> [book,fiction,literature] = 3
>> etc...
>>
>> My code does this, except sort returns None, which is strange.
list.sort() returns None by definition; it sorts the list object's
contents in situ.
> I just
>> want an alphabetical sort of the first option, which sort should do
>> automatically. When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
>>>>> nest_list.sort()
>> [['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
>> It works fine, but not for my rows.
Why are you sorting?
>>
>> Here's the code (unpickled/unsorted):
>> import xlrd, pyExcelerator
>>
>> path_file = "C:\\text_analysis\\test.xls"
>> book = xlrd.open_workbook(path_file)
>> ProcFT_QC = book.sheet_by_index(0)
>> log_path = "C:\\text_analysis\\ID_Log.log"
>> logfile = open(log_path,'wb')
>>
>> set_rows = []
The test x in y where y is a sequence needs to compare with half of the
existing items on average. You are doing that test N times. If the
number of unique rows is U, it will do about N*U/4 comparisons. You
said N is about 50,000.
The changes below make y a set; consequentially x needs to be a tuple
instead of a list.
set_rows = set()
>> rows = []
>> db = {}
>> n=0
>> while n<ProcFT_QC.nrows:
>> rows.append(ProcFT_QC.row_values(n, 6,9))
rows.append(tuple(ProcFT_QC.row_values(n, 6,9)))
>> n+=1
>> print rows.sort() #Outputs None
>> ID = 1
>> for row in rows:
>> if row not in set_rows:
>> set_rows.append(row)
set_rows.add(row)
>> db[ID] = row
>> entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'
Presuming your data is actually ASCII, you could save time and memory by
converting it once as you extract it from the spreadsheet.
entry = str(ID) + '|' + str(row).strip('u()') + '\r\n'
>> logfile.write(entry)
>> ID+=1
>> logfile.close()
>>
HTH,
John
More information about the Python-list
mailing list