[Tutor] Reading/dealing/matching with truly huge (ascii) files
Steven D'Aprano
steve at pearwood.info
Wed Feb 22 10:00:03 CET 2012
On Wed, Feb 22, 2012 at 04:44:57PM +1100, Elaina Ann Hyde wrote:
> So, Python question of the day: I have 2 files that I could normally just
> read in with asciitable, The first file is a 12 column 8000 row table that
> I have read in via asciitable and manipulated. The second file is
> enormous, has over 50,000 rows and about 20 columns. What I want to do is
> find the best match for (file 1 column 1 and 2) with (file 2 column 4 and
> 5), return all rows that match from the huge file, join them togeather and
> save the whole mess as a file with 8000 rows (assuming the smaller table
> finds one match per row) and 32=12+20 columns.
I don't know much about asciitable, so I'm going to have to guess what
some of your code does. I think the critical part is where you grab a
column from each file:
Radeg=dat['ra-drad']*180./math.pi
Decdeg=dat['dec-drad']*180./math.pi
Radeg2=dat2['ra-drad']*180./math.pi
Decdeg2=dat2['dec-drad']*180./math.pi
and then compare them, element by element:
for i in xrange(len(Radeg)):
for j in xrange(len(Radeg2)):
#select the value if it is very, very, very close
...
The selection criteria is messy and complicated. Start by cleaning it
up: elegant code is easier to work with. The first step is to operate on
items in the columns directly, rather than indirectly via an index
value.
Instead of writing your for-loops like this:
for i in xrange(len(column)):
do something with column[i]
do another thing with column[i]
Python can iterate over the values in the column directly:
for x in column:
do something with x
do another thing with x
You don't save any lines, but you gain a lot of clarity without the
unnecessary indirection.
Disclaimer: I have never used asciitable, and it is possible that
asciitable's column type does not support this. If not, that's pretty
awful design! But you can rescue the situation by manually assigning to
a variable inside the loop:
for i in xrange(len(column)):
x = column[i]
do something with x
do another thing with x
If you need the index as well, use the enumerate function:
for i, x in enumerate(column):
...
Using that form, if column = [1.1, 2.2, 3.3, ...] then (i, x) will
take the values (0, 1.1), (1, 2.2), (2, 3.3) ...
However, in your case, you have not one column but two. This is where
the zip function comes to the rescue, it lines the columns up like teeth
in a zipper:
delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
if i == j: # skip an iteration -- but why????
continue
if a <= c+delta and a >= c-delta \
and b <= d+delta and b >= d-delta:
write_stuff_to_file(...)
Now we can simplify the selection criteria:
delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
if i == j: # skip an iteration -- but why????
continue
if c-delta <= a <= c+delta and d-delta <= b <= d+delta:
write_stuff_to_file(...)
Already easier to read. (And also likely to be a little faster, although
not enough to really make a difference.) Or at least, I find it easier
to read, and I hope you do too!
You're comparing the (a,b) values from the small file (8,000 rows) with
each of the (c,d) values from the large file (50,000 rows). You will
have to compare 8000*50000=400 million values, which isn't going to be
fast in Python unless you can avoid some of those comparisons.
If you can assume that there will only be one match per row, then once
you have found that match, you can skip to the next iteration of the
outer loop by breaking out of the inner loop, and avoid 42,000+
comparisons per row! If you can do this, that will be a BIG saving.
delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
if i == j: # skip an iteration -- but why????
continue
if c-delta <= a <= c+delta and d-delta <= b <= d+delta:
write_stuff_to_file(...)
# there can only be one match, and we've just found it,
# so go on with the next outer loop
break
But I don't know if that is a safe assumption to make. That depends on
the semantics of your data.
The next thing to look at is the "write_stuff_to_file(...)" placeholder,
which I'm using to stand in for your code:
fopen.write( " ".join([str(k) for k in list(dat[i])]) + " " +
" ".join([str(k) for k in list(dat[j])])+"\n")
and see if that can be improved, but frankly I have to go now, I'll try
to come back to that later.
P.S. you have:
> import astropysics
Is that module really called astropysics with no H?
--
Steven
More information about the Tutor
mailing list