[Numpy-discussion] Possible roadmap addendum: building better text file readers

Wed Mar 21 01:41:25 EDT 2012

On Tue, Mar 20, 2012 at 5:59 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> Warren et al:
>
> On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
> > If you are setup with Cython to build extension modules,
>
> I am
>
> > and you don't mind
> > testing an unreleased and experimental reader,
>
> and I don't.
>
> > you can try the text reader
> > that I'm working on: https://github.com/WarrenWeckesser/textreader
>
> It just took me a while to get around to it!
>
> First of all: this is pretty much exactly what I've been looking for
> for years, and never got around to writing myself - thanks!
>
> My comments/suggestions:
>
> 1) a docstring for the textreader module would be nice.
>
> 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
> parse an ISO datetime string timezone specifier, but short of that, I
> think the default should be None or UTC -- time zones are too ugly to
> presume anything!
>
> 3) it breaks with the old MacOS style line endings: \r only. Maybe no
> need to support that, but it turns out one of my old test files still
> had them!
>
> 4) when I try to read more rows than are in the file, I get:
>   File "textreader.pyx", line 247, in textreader.readrows
> (python/textreader.c:3469)
>  ValueError: negative dimensions are not allowed
>
> good to get an error, but it's not very informative!
>
> 5) for reading float64 values -- I get something different with
> textreader than with the python "float()":
>  input: "678.901"
>  float("") :  678.90099999999995
>  textreader : 678.90100000000007
>
> as close as the number of figures available, but curious...
>
>
> 5) Performance issue: in my case, I'm reading a big file that's in
> chunks -- each one has a header indicating how many rows follow, then
> the rows, so I parse it out bit by bit.
> For smallish files, it's much faster than pure python, and almost as
> fast as some old C code of mine that is far less flexible.
>
> But for large files,  -- it's much slower -- indeed slower than a pure
> python version for my use case.
>
> I did a simplified test -- with 10,000 rows:
>
> total number of rows:  10000
> pure python took: 1.410408 seconds
> pure python chunks took: 1.613094 seconds
> textreader all at once took: 0.067098 seconds
> textreader in chunks took : 0.131802 seconds
>
> but with 1,000,000 rows:
>
> total number of rows:  1000000
> total number of chunks:  1000
> pure python took: 30.712564 seconds
> pure python chunks took: 31.313225 seconds
> textreader all at once took: 1.314924 seconds
> textreader in chunks took : 9.684819 seconds
>
> then it gets even worse with the chunk size smaller:
>
> total number of rows:  1000000
> total number of chunks:  10000
> pure python took: 30.032246 seconds
> pure python chunks took: 42.010589 seconds
> textreader all at once took: 1.318613 seconds
> textreader in chunks took : 87.743729 seconds
>
> my code, which is C that essentially runs fscanf over the file, has
> essentially no performance hit from doing it in chunks -- so I think
> something is wrong here.
>
> Sorry, I haven't dug into the code to try to figure out what yet --
> does it rewind the file each time maybe?
>
> Enclosed is my test code.
>
> -Chris
>

Chris,

Thanks!  The feedback is great.  I won't have time to get back to this for
another week or so, but then I'll look into the issues you reported.

Warren

>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120321/d5ffa56c/attachment.html>