[Tutor] Reading/dealing/matching with truly huge (ascii) files

Thu Feb 23 09:39:48 CET 2012

Elaina Ann Hyde wrote:

> Thanks for all the helpful hints, I really like the idea of using
> distances
> instead of a limit.  Walter was right that the 'i !=j' condition was
> causing problems.  I think that Alan and Steven's use of the index
> separately was great as it makes this much easier to test (and yes
> 'astropysics' is a valid package, it's in there for later when I convert
> astrophysical coordinates and whatnot, pretty great but a little buggy
> FYI).  So I thought, hey, why not try to do a little of all these ideas,
> and, if you'll forgive the change in syntax, I think the problem is that
> the file might really just be too big to handle, and I'm not sure I have
> the right idea with the best_match:

> The errors are as follows:

dat2=asciitable.read(y,Reader=asciitable.NoHeader,data_start=4,fill_values=['nan','-9.999'])
>   File
> "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-
packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py",
> line 131, in read
>     dat = _guess(table, new_kwargs)
>   File
> "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-
packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py",
> line 175, in _guess
>     dat = reader.read(table)
>   File
> "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-
packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py",
> line 841, in read
>     self.lines = self.inputter.get_lines(table)
>   File
> "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-
packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py",
> line 158, in get_lines
>     lines = table.splitlines()
> MemoryError
> ----------------------
> So this means I don't have enough memory to run through the large file?
> Even if I just read in with asciitable I get this problem, I looked again
> and the large file is 1.5GB of text lines, so very large.  I was thinking
> of trying to tell the read function to skip lines that are too far away,
> the file is much, much bigger than the area I need.  Thanks for the
> comments so far.
> ~Elaina
> 

Hmm, 1.5GB would be about 30,000 bytes per line if the 50,000 lines you 
mentioned before are correct. What does

$ wc <bigfile>

say?

Can you give the first few lines of <bigfile> here or on pastebin.com? I 
don't have asciitables installed but a quick look into the code suggests it 
consumes a lot more memory than necessary to solve your problem. If the file 
format is simple a viable alternative may be to extract the interesting 
columns manually together with the line index. Once you have the best 
matches you can build the result from <bigfile> and the indices of the best 
matches.

Alternatively you can split <bigfile> into a few parts, calculate the best 
matches for each part, and finally calculate the best matches of the partial 
best matches combined.