[Numpy-discussion] fromfile() for reading text (one more time!)
alan at ajackson.org
alan at ajackson.org
Mon Jan 4 22:39:42 EST 2010
>Hi folks,
>
>I'm taking a look once again at fromfile() for reading text files. I
>often have the need to read a LOT of numbers form a text file, and it
>can actually be pretty darn slow do i the normal python way:
>
>for line in file:
> data = map(float, line.strip().split())
>
>
>or various other versions that are similar. It really does take longer
>to read the text, split it up, convert to a number, then put that number
>into a numpy array, than it does to simply read it straight into the array.
>
>However, as it stands, fromfile() turn out to be next to useless for
>anything but whitespace separated text. Full set of ideas here:
>
>http://projects.scipy.org/numpy/ticket/909
>
>However, for the moment, I'm digging into the code to address a
>particular problem -- reading files like this:
>
>123, 65.6, 789
>23, 3.2, 34
>...
>
>That is comma (or whatever) separated text -- pretty common stuff.
>
>The problem with the current code is that you can't read more than one
>line at time with fromfile:
>
>a = np.fromfile(infile, sep=",")
>
>will read until it doesn't find a comma, and thus only one line, as
>there is no comma after each line. As this is a really typical case, I
>think it should be supported.
>
>Here is the question:
>
>The work of finding the separator is done in:
>
>multiarray/ctors.c: fromfile_skip_separator()
>
>It looks like it wouldn't be too hard to add some code in there to look
>for a newline, and consider that a valid separator. However, that would
>break backward compatibility. So maybe a flag could be passed in, saying
>you wanted to support newlines. The problem is that flag would have to
>get passed all the way through to this function (and also for fromstring).
>
>I also notice that it supports separators of arbitrary length, which I
>wonder how useful that is. But it also does odd things with spaces
>embedded in the separator:
>
>", $ #" matches all of: ",$#" ", $#" ",$ #"
>
>Is it worth trying to fix that?
>
>
>In the longer term, it would be really nice to support comments as well,
>tough that would require more of a re-factoring of the code, I think
>(though maybe not -- I suppose a call to fromfile_skip_separator() could
>look for a comment character, then if it found one, skip to where the
>comment ends -- hmmm.
>
>thanks for any feedback,
>
>-Chris
>
I agree. I've tried using it, and usually find that it doesn't quite get there.
I rather like the R command(s) for reading text files - except then I have to
use R which is painful after using python and numpy. Although ggplot2 is
awfully nice too ... but that is a later post.
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown")
read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".",
fill = TRUE, comment.char="", ...)
read.csv2(file, header = TRUE, sep = ";", quote="\"", dec=",",
fill = TRUE, comment.char="", ...)
read.delim(file, header = TRUE, sep = "\t", quote="\"", dec=".",
fill = TRUE, comment.char="", ...)
read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",",
fill = TRUE, comment.char="", ...)
There is really only read.table, the others are just aliases with different
defaults. But the flexibility is great, as you can see.
--
-----------------------------------------------------------------------
| Alan K. Jackson | To see a World in a Grain of Sand |
| alan at ajackson.org | And a Heaven in a Wild Flower, |
| www.ajackson.org | Hold Infinity in the palm of your hand |
| Houston, Texas | And Eternity in an hour. - Blake |
-----------------------------------------------------------------------
More information about the NumPy-Discussion
mailing list