[Numpy-discussion] Re: [GS-discuss] Re: NumPy, Python DB-API and MySQL

Tim Churches tchur at bigpond.com
Fri Apr 14 07:25:33 EDT 2000


Andy Dustman wrote:
> 
> On Sun, 9 Apr 2000, Tim Churches wrote:
> 
> > I've been experimenting with pulling quantitative data out of a MySQL
> > table into NumPy arrays via Andy Dustman's excellent MySQLdb module and
> > then calculating various statistics from the data using Gary Strangman's
> > excellent stats.py functions, which when operating on NumPy arrays are
> > lightning-fast.
> > [...snip...]
> 
> It might be possible to do something like this. I would prefer that such a
> feature work as a seperate module (i.e. I don't think it is generally
> applicable to MySQLdb/_mysql). Or perhaps it could be a compile-time
> option for _mysql (-DUSE_NUMPY).

The latter sounds good. I agree that most users of MySQLdb would not
need it, so they shouldn't be burdened with it.

> 
> The object that you want to mess with is the _mysql result object. It
> contains an attribute MYSQL_RES *result, which is a pointer to the actual
> MySQL structure. I don't remember if NumPy arrays are extensible or not,
> i.e. can rows be appended? 

No they can't. I suspect that is the price to be paid for the efficient
storage offered by NumPy arrays.

>                            That would affect the design. If they are not
> extensible, then you are probably limited to using mysql_store_result()
> (result set stored on the client side), as opposed to mysql_use_result()
> (result set stored on the server side). mysql_store_result is probably
> preferable in this case anyway, so extensibility doesn't matter, as we can
> find the size of the result set in advance with mysql_num_rows(). Then we
> know the full size of the array.

Yes, but the problem with mysql_store_result() is the large amount of
memory required to store the result set. Couldn't the user be
responsible for predetermining the size of the array via a query such as
"select count(*) from sometable where...." and then pass this value as a
parameter to the executeNumPy() method? In MySQL at least such count(*)
queries are resolved very quickly so such an approach wouldn't take
twice the time. Then mysql_use_result() could be used to populate the
initialised NumPy array with data row, so there so only ever one
complete copy of the data in memory, and that copy is in the NumPy
array.

> 
> However, with very large result sets, it may be necessary to use
> mysql_use_result(), in which case the array will need to be extended,
> possibly row-by-row.
> 
> I could do this, but I need to know how to create and assign values to a
> NumPy array from within C. Or perhaps an initial (empty) array with the
> correct number of columns can be passed. I am pretty sure NumPy arrays
> look like sequences (of sequences), so assignment should not be a big
> problem. Easiest solution (for me, and puts least bloat in _mysql) would
> be for the user to pass in a NumPy array.

I'll look at the NumPy docs re this. Can any of the NumPy developers
give some clues re this?

> 
> Question: Would it be adequate to put all columns returned into the array?
> If label columns need to be returned, this could pose a problem. They may
> have to be returned as a separate query. Or else non-numeric columns would
> be excluded and returned in a list of tuples (this would be harder).

Yes, more thought needed here - my initial thought was one NumPy array
per column, particularly since NumPy arrays must be homogenous wrt data
type. Each NumPy array could be named the same as the column from which
it is derived.

Cheers,

Tim C






More information about the NumPy-Discussion mailing list