[Tutor] Creating 2 Subarrays for large dataset

Alan Gauld alan.gauld at yahoo.co.uk
Mon Jun 12 13:51:02 EDT 2017


On 12/06/17 16:52, Peter Gibson wrote:

> I have a large, 4 column data set that is in the form of an array.  

Do you mean 4 separate arrays or a single 4xN array?
Or do you mean an N size array of 4 item tuples?
Or are the 4 colums part of a class?

There are lots of ways to interpret that statement.
Also are you using NumPy arrays or standard Python
arrays, or standard Python lists/tuples treated as arrays?

These all have a bearing.

It might help if you post a small sample of
your real data structure?

> last column, there is either a 1 or a 2, and they are not organized in any
> predictable manner (ex of an array of the last columns:
> 1,2,2,1,2,2,1,1,1,1,2,1,1, ect).
> 
> I would like to cut this large data set into two different arrays, one
> where the final column has a 1 in it, and the other where the final column
> of the data has a 2 in it.

A trivial way to do that (assuming 4 arrays called
col1,col2,col3,col4) is to create two lists/arrays
and iterate over the data filtering on col4:

ones = []
twos = []
for n in col4:
   if n == 1: ones.append((col1[n],col2[n],col3[n]))
   else: twos.append((col1[n],col2[n],col3[n]))

If you must have arrays rather than lists that's a relatively
minor tweak.

However, depending on your data structures there are
likely more efficient ways to do it (e.g. sorting on
column 4 then using slicing, or using a dict keyed
on col4, etc).

but it all depends on how large 'large' is, what the
actual time constraints are, what the real data structures
look like etc. Premature optimisation is the root of all
evil...

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list