[Numpy-discussion] fromfile() for reading text (one more time!)

Tue Jan 5 03:30:17 EST 2010

Christopher Barker, on 2010-01-04 17:05, wrote:
> Hi folks,
> 
> I'm taking a look once again at fromfile() for reading text files. I 
> often have the need to read a LOT of numbers form a text file, and it 
> can actually be pretty darn slow do i the normal python way:
> 
> for line in file:
>     data = map(float, line.strip().split())
> 
> 
> or various other versions that are similar. It really does take longer 
> to read the text, split it up, convert to a number, then put that number 
> into a numpy array, than it does to simply read it straight into the array.
> 
> However, as it stands, fromfile() turn out to be next to useless for 
> anything but whitespace separated text. Full set of ideas here:
> 
> http://projects.scipy.org/numpy/ticket/909
> 
> However, for the moment, I'm digging into the code to address a 
> particular problem -- reading files like this:
> 
> 123, 65.6, 789
> 23,  3.2,  34
> ...
> 
> That is comma (or whatever) separated text -- pretty common stuff.
> 
> The problem with the current code is that you can't read more than one 
> line at time with fromfile:
> 
> a = np.fromfile(infile, sep=",")
> 
> will read until it doesn't find a comma, and thus only one line, as 
> there is no comma after each line. As this is a really typical case, I 
> think it should be supported.

Just a potshot, but have you tried np.loadtxt?

I find it pretty fast.

> 
> Here is the question:
> 
> The work of finding the separator is done in:
> 
> multiarray/ctors.c:  fromfile_skip_separator()
> 
> It looks like it wouldn't be too hard to add some code in there to look 
> for a newline, and consider that a valid separator. However, that would 
> break backward compatibility. So maybe a flag could be passed in, saying 
> you wanted to support newlines. The problem is that flag would have to 
> get passed all the way through to this function (and also for fromstring).
> 
> I also notice that it supports separators of arbitrary length, which I 
> wonder how useful that is. But it also does odd things with spaces 
> embedded in the separator:
> 
> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
> 
> Is it worth trying to fix that?
> 
> 
> In the longer term, it would be really nice to support comments as well, 
> tough that would require more of a re-factoring of the code, I think 
> (though maybe not -- I suppose a call to fromfile_skip_separator() could 
> look for a comment character, then if it found one, skip to where the 
> comment ends -- hmmm.
> 
> thanks for any feedback,
> 
> -Chris
> 
> 
> 
> 
> 
> 
>