Out of memory while reading excel file

Mahmood Naderan nt_mahmood at yahoo.com
Wed May 10 11:12:51 EDT 2017


Thanks for your reply. The openpyxl part (reading the workbook) works fine. I printed some debug information and found that when it reaches the np.array, after some 10 seconds, the memory usage goes high. 


So, I think numpy is unable to manage the memory.


 
Regards,
Mahmood


On Wednesday, May 10, 2017 7:25 PM, Peter Otten <__peter__ at web.de> wrote:



Mahmood Naderan via Python-list wrote:


> Hello,

> 

> The following code which uses openpyxl and numpy, fails to read large

> Excel (xlsx) files. The file si 20Mb which contains 100K rows and 50

> columns.

> 

> 

> 

> W = load_workbook(fname, read_only = True)

> 

> p = W.worksheets[0]

> 

> a=[]

> 

> m = p.max_row

> 

> n = p.max_column

> 

> 

> np.array([[i.value for i in j] for j in p.rows])

> 

> 

> 

> How can I fix that? I have stuck at this problem. For medium sized files

> (16K rows and 50 columns) it is fine.


The docs at


https://openpyxl.readthedocs.io/en/default/optimized.html#read-only-mode


promise "(near) constant memory consumption" for the sample script below:


from openpyxl import load_workbook

wb = load_workbook(filename='large_file.xlsx', read_only=True)

ws = wb['big_data']


for row in ws.rows:

    for cell in row:

        print(cell.value)


If you change only the file and worksheet name to your needs -- does the 

script run to completion in reasonable time (redirect stdout to /dev/null) 

and with reasonable memory usage?


If it does you may be wasting memory elsewhere; otherwise you might need to 

convert the xlsx file to csv using your spreadsheet application before 

processing the data in Python.


-- 

https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list