python and very large data sets???

David Bolen db3l at fitlinxx.com
Wed Apr 24 13:37:44 EDT 2002


zaka07 at hotmail.com (Rad) writes:

(working with 80GB worth of text file data)

> I have a few weeks to prepare myself before data arrives and my
> question is: am I going the right way about the project, is Python
> (with humanly written code) capable of doing this kind of staff
> relatively quickly?

If you're comparing manipulating the files with Python versus
"learning" C (even if you're also learning Python), I think that even
with any reduction in performance in Python you'll likely have a net
gain in time.  My bet is you'd easily eat up any time savings by the
final C application in terms of building and debugging the code.  And
in this sort of application, you may find your overall performance
swamped by I/O handling rather than computation, so Python may not
be at that much of a disadvantage.  And some of your correlation
requirements would require much more coding in C than in Python due
to the native structures such as dictionaries in Python.

This does depend somewhat on the frequency of these manipulations
(e.g., are you building something that will be used daily on data sets
of this size or just this once), but even if this is a long term
project, you sound like you have some immediate deadlines for this
specific data set - and you can always choose to re-implement for
repeated runs later if you find you need to.
  
A few suggestions:

* While you don't have the files yet it sounds like you have the specs
  for the files - write up some small test files and use them to work
  on the code.  And even when you initially get the files, extract a
  small portion of them to develop against to streamline testing.
  Assure yourself as much as possible that you're comfortable with
  your code before wasting time by turning it over on the real 15GB
  files.

* Some of the cross-referencing you seem to need to do will most
  likely be easy matches to Python dictionaries (perhaps keeping an
  index of key fields and where they exist in the source file) so
  practice up on them.

* Use the latest Python (2.2) - many improvements have been made on
  the basic I/O performance in the common file processing loop (e.g.,
  "for line in file:") that you'll get the advantage of.

* Don't skimp on your hardware.  Find the beefiest machine you've got
  handy, with the fastest drive, and ensure you have plenty of memory.
  Spend a few bucks on memory (it's cheap) if you need to.  It's quite
  likely that in final processing, you'll be heavily dependent on I/O
  performance - just copying a 15GB file is a time consuming process.
  
* I don't know your platform, but since these are text files, make
  sure you gain some familiarity with any text processing utilities
  you may have.  If you're under Unix, you want to be very comfortable
  with stuff like head, tail, split, grep, and diff.  Tools like these
  can make it much easier to manipulate the files during development
  testing or even as part of an overall process of handling them.  If
  you can find the time to check out awk you may even find that if
  this is a one shot task, a few quick awk scripts are all you need.

  If you aren't under Unix, I'd suggest getting ahold of equivalent
  tools (personally, I'd just install Cygwin and get all the same Unix
  ones).

* Be sure you have a good text editor or browser (e.g., less) that
  you're comfortable with for examining the raw files if you run into
  questions.  You're unlikely to really want to load a 15GB file into
  any editor or OS, regardless of virtual memory handling, but some of
  the previous text tools can chop up the file into more manageable
  pieces if you need to check something out.  And a simple browser
  like less can page through a file without keeping much of it memory.
  
--
-- David
-- 
/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/



More information about the Python-list mailing list