[Tutor] Reading/dealing/matching with truly huge (ascii) files

Wed Feb 22 10:00:03 CET 2012

On Wed, Feb 22, 2012 at 04:44:57PM +1100, Elaina Ann Hyde wrote:
> So, Python question of the day:  I have 2 files that I could normally just
> read in with asciitable, The first file is a 12 column 8000 row table that
> I have read in via asciitable and manipulated.  The second file is
> enormous, has over 50,000 rows and about 20 columns.  What I want to do is
> find the best match for (file 1 column 1 and 2) with (file 2 column 4 and
> 5), return all rows that match from the huge file, join them togeather and
> save the whole mess as a file with 8000 rows (assuming the smaller table
> finds one match per row) and 32=12+20 columns.

I don't know much about asciitable, so I'm going to have to guess what 
some of your code does. I think the critical part is where you grab a 
column from each file:

Radeg=dat['ra-drad']*180./math.pi
Decdeg=dat['dec-drad']*180./math.pi

Radeg2=dat2['ra-drad']*180./math.pi
Decdeg2=dat2['dec-drad']*180./math.pi

and then compare them, element by element:

for i in xrange(len(Radeg)):
        for j in xrange(len(Radeg2)):
        #select the value if it is very, very, very close
                ...

The selection criteria is messy and complicated. Start by cleaning it 
up: elegant code is easier to work with. The first step is to operate on 
items in the columns directly, rather than indirectly via an index 
value.

Instead of writing your for-loops like this:

for i in xrange(len(column)):
    do something with column[i]
    do another thing with column[i]

Python can iterate over the values in the column directly:

for x in column:
    do something with x
    do another thing with x

You don't save any lines, but you gain a lot of clarity without the 
unnecessary indirection.

Disclaimer: I have never used asciitable, and it is possible that 
asciitable's column type does not support this. If not, that's pretty 
awful design! But you can rescue the situation by manually assigning to 
a variable inside the loop:

for i in xrange(len(column)):
    x = column[i]
    do something with x
    do another thing with x

If you need the index as well, use the enumerate function:

for i, x in enumerate(column):
    ...

Using that form, if column = [1.1, 2.2, 3.3, ...] then (i, x) will 
take the values (0, 1.1), (1, 2.2), (2, 3.3) ... 

However, in your case, you have not one column but two. This is where 
the zip function comes to the rescue, it lines the columns up like teeth 
in a zipper:

delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
    for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
        if i == j: # skip an iteration -- but why????
            continue
        if a <= c+delta and a >= c-delta \
        and b <= d+delta and b >= d-delta:
             write_stuff_to_file(...)

Now we can simplify the selection criteria:

delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
    for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
        if i == j: # skip an iteration -- but why????
            continue
       	if c-delta <= a <= c+delta and d-delta <= b <= d+delta:
       	     write_stuff_to_file(...)

Already easier to read. (And also likely to be a little faster, although 
not enough to really make a difference.) Or at least, I find it easier 
to read, and I hope you do too!

You're comparing the (a,b) values from the small file (8,000 rows) with 
each of the (c,d) values from the large file (50,000 rows). You will 
have to compare 8000*50000=400 million values, which isn't going to be 
fast in Python unless you can avoid some of those comparisons.

If you can assume that there will only be one match per row, then once 
you have found that match, you can skip to the next iteration of the 
outer loop by breaking out of the inner loop, and avoid 42,000+ 
comparisons per row! If you can do this, that will be a BIG saving.

delta = 0.000001
for i, (a, b) in enumerate(zip(Radeg, Decdeg)):
    for j, (c, d) in enumerate(zip(Radeg2, Decdeg2)):
        if i == j: # skip an iteration -- but why????
            continue
       	if c-delta <= a <= c+delta and d-delta <= b <= d+delta:
       	     write_stuff_to_file(...)
             # there can only be one match, and we've just found it,
             # so go on with the next outer loop
             break

But I don't know if that is a safe assumption to make. That depends on 
the semantics of your data.

The next thing to look at is the "write_stuff_to_file(...)" placeholder, 
which I'm using to stand in for your code:

fopen.write( "     ".join([str(k) for k in list(dat[i])]) + "     " + 
    "     ".join([str(k) for k in list(dat[j])])+"\n")

and see if that can be improved, but frankly I have to go now, I'll try 
to come back to that later.

P.S. you have:

> import astropysics

Is that module really called astropysics with no H?

-- 
Steven