python and very large data sets???
David Bolen
db3l at fitlinxx.com
Wed Apr 24 13:37:44 EDT 2002
zaka07 at hotmail.com (Rad) writes:
(working with 80GB worth of text file data)
> I have a few weeks to prepare myself before data arrives and my
> question is: am I going the right way about the project, is Python
> (with humanly written code) capable of doing this kind of staff
> relatively quickly?
If you're comparing manipulating the files with Python versus
"learning" C (even if you're also learning Python), I think that even
with any reduction in performance in Python you'll likely have a net
gain in time. My bet is you'd easily eat up any time savings by the
final C application in terms of building and debugging the code. And
in this sort of application, you may find your overall performance
swamped by I/O handling rather than computation, so Python may not
be at that much of a disadvantage. And some of your correlation
requirements would require much more coding in C than in Python due
to the native structures such as dictionaries in Python.
This does depend somewhat on the frequency of these manipulations
(e.g., are you building something that will be used daily on data sets
of this size or just this once), but even if this is a long term
project, you sound like you have some immediate deadlines for this
specific data set - and you can always choose to re-implement for
repeated runs later if you find you need to.
A few suggestions:
* While you don't have the files yet it sounds like you have the specs
for the files - write up some small test files and use them to work
on the code. And even when you initially get the files, extract a
small portion of them to develop against to streamline testing.
Assure yourself as much as possible that you're comfortable with
your code before wasting time by turning it over on the real 15GB
files.
* Some of the cross-referencing you seem to need to do will most
likely be easy matches to Python dictionaries (perhaps keeping an
index of key fields and where they exist in the source file) so
practice up on them.
* Use the latest Python (2.2) - many improvements have been made on
the basic I/O performance in the common file processing loop (e.g.,
"for line in file:") that you'll get the advantage of.
* Don't skimp on your hardware. Find the beefiest machine you've got
handy, with the fastest drive, and ensure you have plenty of memory.
Spend a few bucks on memory (it's cheap) if you need to. It's quite
likely that in final processing, you'll be heavily dependent on I/O
performance - just copying a 15GB file is a time consuming process.
* I don't know your platform, but since these are text files, make
sure you gain some familiarity with any text processing utilities
you may have. If you're under Unix, you want to be very comfortable
with stuff like head, tail, split, grep, and diff. Tools like these
can make it much easier to manipulate the files during development
testing or even as part of an overall process of handling them. If
you can find the time to check out awk you may even find that if
this is a one shot task, a few quick awk scripts are all you need.
If you aren't under Unix, I'd suggest getting ahold of equivalent
tools (personally, I'd just install Cygwin and get all the same Unix
ones).
* Be sure you have a good text editor or browser (e.g., less) that
you're comfortable with for examining the raw files if you run into
questions. You're unlikely to really want to load a 15GB file into
any editor or OS, regardless of virtual memory handling, but some of
the previous text tools can chop up the file into more manageable
pieces if you need to check something out. And a simple browser
like less can page through a file without keeping much of it memory.
--
-- David
--
/-----------------------------------------------------------------------\
\ David Bolen \ E-mail: db3l at fitlinxx.com /
| FitLinxx, Inc. \ Phone: (203) 708-5192 |
/ 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \
\-----------------------------------------------------------------------/
More information about the Python-list
mailing list