[SciPy-user] [Timeseries] Linux installation error

Matt Knox mattknox.ca at gmail.com
Thu Apr 2 21:56:19 EDT 2009


>> In case that doesn't work I'll probably resort to storing the
>> historical data in sql as well.  Can someone give me some pointers on
>> how I would go about that perhaps?  It's about 20 000 - 30 000 stocks
>> with on average about a decades worth of daily data.  Would I dump all
>> of that into a single table?  20 000 tables?  hdf5 certainly is much
>> better suited for something like that...

in terms of raw throughput on reading/writing simple stock price history, yes
HDF5 will be better, but if you need any kind of concurrent access or plan
to throw some stuff up on a web-site, you will quickly find that the robustness
of a modern relational database system will outweigh the performance benefits
of a pure HDF5 solution. You could look at a hybrid solution, but the complexity
often would not be worth it. I think HDF5 and similar types of storage are
great for research projects and ad-hoc analysis, but if you are talking about
a large scale production system it is going to be hard to beat a modern
relational database. And you may be surprised by the performance you get from
a modern relational db running on modern hardware. These systems are designed to
handle LOTS of data.

Anyway, your question really comes down to database design, and I would highly
recommend you do some introductory reading on the basics of table design and
such. One thing with relational dbs is that it is generally very hard to change
the schema (ie. table structures) once your application gets to a certain size
so you really need to plan an overall architecture ahead of time. Things like
HDF5 allow for a little bit more of a care free approach.

Also, having managed a large equity database before, I can say that it is not
something I can adequately describe how to do well in a brief email on a mailing
list.

But I will say this... a typical setup would have a "master security" table
(which may actually take several tables to properly describe the securities)
which would have an integer id for every security and map it to various
identifiers like cusip's, isin's, ticker's, etc. You will also need to account
for changing tickers.

To store your actual data (lets just say it is price data for now), you could
have a table with 3 fields:

[security_id] [int],
[date] [datetime],
[price] [decimal](12,5)

The natural primary key here would be ([security_id], [date])
I won't get into the topic of "surrogate keys", but you may want to google that
too. You would probably have a foreign key for the security_id field referencing
your master security table -- although maybe not depending on performance
considerations.

Then there are things like indexes to consider to optimize the performance for
your usage patterns.

And you definitely DO NOT want 20,000 tables to store 20,000 stocks.

You'll also need to think about how you handle corporate actions in your db like
stock splits, mergers, etc. If you mess up on the design of how to do this, you
will be in for a world of pain as far as maintenance of the database.

Now as for reading your data into Python, you will find numpy arrays using
compound data types work quite nicely. I often query out a big chunk of data,
store it in a numpy array with a compound data type for caching purposes, then
filter that array to get the specific chunks of data I need (eg. read 100 stocks
worth of data at a time into a single array, then filter that array after
rather than hitting the db one time for each stock). This is something I'll
probably add to the database examples in the documentation at some point.

>> Is there any possibility to use python's decimal data type?  I saw you
>> used it in your sql example on the database side, but I'm guessing
>> numpy doesn't allow for this?  I sometimes have problems with equality
>> testing after several divisions / multiplications, so at the moment I
>> revert to the "is kind of" instead of the "is equal to" approach of
>> comparison...

Numpy does not support a decimal type. I agree it would be a really nice
addition, but I can generally live without it and probably won't have the
motivation to contribute that to the numpy community any time soon. You can
use an object array to store decimal values, but that is not recommended. You
should always just cast the decimal values to floats prior to pulling them into
Python unless you are talking about some accounting system or something where
that level of accuracy matters.

And I would agree with Josef's comments with regards to the approach for
equality checks.






More information about the SciPy-User mailing list