[SciPy-dev] matlab io - request for testing

Sat Feb 21 02:58:19 EST 2009

Hi,

> I finally got a chance to test with my nasty file, and with r5561, it
> now takes ~32 minutes of cpu time to load (as compared to ~5 minutes
> for 0.7.0, and 3 seconds for 0.6.0). All the time is in
> zlibstreams.py:read.
>
> I talked to the guy whose data it is now, though, and he okayed my
> distributing an example:
>  http://roberts.vorpus.org/~njs/tmp/test.mat
>  http://roberts.vorpus.org/~njs/tmp/test-mat.txt
>  http://roberts.vorpus.org/~njs/tmp/test-mat.profile
> (Sorry the file is so large, all my attempts to minimize it somehow
> also fixed whatever is making it so pathological.)

Thanks - that's very useful.

> Does that help track things down? (This is also a good example file
> for why struct_as_record=True can be Very Very Useless, and if you
> combine struct_as_record=True with squeeze_me=True, the file ends up
> as gibberish -- a big tuple of anonymous variables, not so useful...)

Also useful - thank you.

> I'm also wondering, though, if (as you mentioned downthread somewhere)
> the matlab IO code ends up doing a single short read and then reads
> the whole actual matrix data in one fell swoop, then what benefit does
> this streaming code give us? I though that the point was that one
> could read small chunks and avoid taking the memory for a large
> temporary buffer, but if that's not happening, then it seems like a
> very slow and fragile chunk of code for no benefit.

It may be that we'll have to pull it.  The purpose of the two stage
read - and the original purpose of the code - was to allow someone who
is trying to read a particular variable to read enough of the zlib
stream to get the name, in order to be able to skip it if the name is
not the one they are looking for.  Otherwise, they would have to read
the whole stream - that might be very large - just to get the name.

Thanks again,

Matthew