change data in large excel file(more than 240 000 rows on sheet)

Laura Creighton lac at openend.se
Fri Oct 9 07:07:44 EDT 2015


In a message of Fri, 09 Oct 2015 01:21:37 -0700, gall.pavgal.gall at gmail.com wri
tes:
>Thanks Laura!
>But i need to change existing excel file and if i use the optimised reader, i can read data only, but i can't change data.  

Hmmm.  When I have this problem I have other problems as well,
and they can pretty well all be sumarised as 'I never wanted 
an excel file anyway'. 

So first I convert the xlsx to a CVS.

import openpyxl
import csv

workbook = openpyxl.load_workbook('my_big_file.xlsx', read_only=True)
sheet = workbook['big_data']

with open('my_big_file.csv', 'wb') as f:
    c = csv.writer(f)
    for r in sh.rows:
        c.writerow([cell.value for cell in r])

This may be enough for you, and you may be able to use the Python
csv module to make the changes you want.
https://docs.python.org/3.5/library/csv.html
This may be all you need, if you can process your problem one line
at a time.  Reading it all into memory will give you the same problem.

But I have never actually done this.  Instead I convert the cvs to a 
postgresql database, and then make my changes there, and then
export it again as a csv, and then reverse the process to get an
xlsx again (if I need one).  I don't know any way to get postgresql
to take the xlsx directly, which is why I have the two step process
with the csv in the middle.

If you are more comfortable with MySQL or SQLite or some other database,
by all means use that.  They might even have a way to import xlsx directly.
I only do this because it is what I am used to.

The other thing to consider is pandas.
https://pypi.python.org/pypi/pandas

It uses openpyxl under the hood, but is set up for big data.  It accepts
xlxs files and also connects to databases.  It may be exactly what
you are looking for -- but I haven't tried this myself for such large
datasets, so  I don't know for sure.

Laura





More information about the Python-list mailing list