Out of memory while reading excel file

Peter Otten __peter__ at web.de
Thu May 11 03:53:11 EDT 2017


Mahmood Naderan via Python-list wrote:

> I wrote this:
> 
> a = np.zeros((p.max_row, p.max_column), dtype=object)
> for y, row in enumerate(p.rows):
>       for cell in row:
>             print (cell.value)
>             a[y] = cell.value

In the line above you overwrite the row in the numpy array with the cell 
value. In combination with numpy's "broadcasting" you end up with all values 
in a row set to the rightmost cell in the spreadsheet row, just like in 

>>> import numpy
>>> a = numpy.array([[0, 0, 0]])
>>> a
array([[0, 0, 0]])
>>> for x in 1, 2, 3:
...     a[0] = x
... 
>>> a
array([[3, 3, 3]])


The correct code:

for y, row in enumerate(ws.rows):
    a[y] = [cell.value for cell in row]

I think I posted it before ;)

>      print (a[y])
> 
> 
> For one of the cells, I see
> 
> NM_198576.3
> ['NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3'
> 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3' 'NM_198576.3']
> 
>  
> These are 50 NM_198576.3 in a[y] and 50 is the number of columns in my
> excel file (p.max_column)
> 
> 
> 
> The excel file looks like
> 
> CHR1     11,202,100     NM_198576.3     PASS     3.08932    G|B|C     -   
> .   .   .
> 
> 
> 
> Note that in each row, some cells are '-' or '.' only. I want to read all
> cells as string. Then I will write the matrix in a file and my main code
> (java) will process that. I chose openpyxl for reading excel files,
> because Apache POI (a java package for manipulating excel files) consumes
> huge memory even for medium files.
> 
> So my python script only transforms an xlsx file to a txt file keeping the
> cell positions and formats.

What kind of text file?

> Any suggestion?

In that case there's no need to load the data into memory. For example, to 
convert xlsx to csv:

#!/usr/bin/env python3
from openpyxl import load_workbook
import csv

source = "beta.xlsx"
dest = "gamma.csv"
sheet = 'alpha'

wb = load_workbook(filename=source, read_only=True)
ws = wb[sheet]

with open(dest, "w") as outstream:
    csv.writer(outstream).writerows(
        [cell.value for cell in row]
        for row in ws.rows
    )





More information about the Python-list mailing list