Python Data Analysis Recommendations

Fri Jan 1 12:55:56 EST 2016

I also collect data by sweeping multiple parameters in a similar fashion. I
find pandas very convenient for analysis.
I don't use all the features of pandas. I mainly use it for selecting
certain rows from the data, sometimes using database style merge
operations, and plotting using matplotlib. This can also be done using pure
numpy but with pandas, I don't have to keep track of all the indices

This is what my workflow is like (waarning - sloppy code):

data = pd.DataFrame(<some numpy array read from file>)
data.columns = ['temperature', 'voltage_measured', 'voltage_applied',
'channels']
for channel in data.channels.unique():
    for temperature in data.temperature.unique():
        slope = fit_slope(data[data['temperature']==temperature and
data['channels']==channel]) # fit_slope(x) -> fits x.voltage_measured and
x.voltage_applied and returns slope
        # append (channel, temperature, slope) to final plotting array etc

I imagine your database driven approach would do something similar but you
might find pandas more convenient given that it can all be done in python
and that you won't have to resort to SQL queries.

My data is small enough to get away with storing as plain text. But hdf5 is
definitely a better solution.

In addition to pytables, there is also h5py (http://www.h5py.org/). I
prefer the latter. You might like pytables because it is more database-like.

Sameer

On 31 December 2015 at 22:45, Rob Gaddi <rgaddi at highlandtechnology.invalid>
wrote:

> I'm looking for some advice on handling data collection/analysis in
> Python.  I do a lot of big, time consuming experiments in which I run a
> long data collection (a day or a weekend) in which I sweep a bunch of
> variables, then come back offline and try to cut the data into something
> that makes sense.
>
> For example, my last data collection looked (neglecting all the actual
> equipment control code in each loop) like:
>
> for t in temperatures:
>   for r in voltage_ranges:
>     for v in test_voltages[r]:
>       for c in channels:
>         for n in range(100):
>           record_data()
>
> I've been using Sqlite (through peewee) as the data backend, setting up
> a couple tables with a basically hierarchical relationship, and then
> handling analysis with a rough cut of SQL queries against the
> original data, Numpy/Scipy for further refinement, and Matplotlib
> to actually do the visualization.  For example, one graph was "How does
> the slope of straight line fit between measured and applied voltage vary
> as a function of temperature on each channel?"
>
> The whole process feels a bit grindy; like I keep having to do a lot of
> ad-hoc stitching things together.  And I keep hearing about pandas,
> PyTables, and HDF5.  Would that be making my life notably easier?  If
> so, does anyone have any references on it that they've found
> particularly useful?  The tutorials I've seen so far seem to not give
> much detail on what the point of what they're doing is; it's all "how
> you write the code" rather than "why you write the code".  Paying money
> for books is acceptable; this is all on the company's time/dime.
>
> Thanks,
> Rob
>
> --
> Rob Gaddi, Highland Technology -- www.highlandtechnology.com
> Email address domain is currently out of order.  See above to fix.
> --
> https://mail.python.org/mailman/listinfo/python-list
>