[melbourne-pug] Lists or Arrays - When do I use what and why - Or : Confused much? I am, Now

Mon Feb 7 15:06:28 CET 2011

To put some concrete numbers to this, I ran some test scripts on an Amazon m2.4xlarge instance (68GB RAM). Loading 100m records with uniformly distributed coordinate values in the range (-100.0, 100.0) into a 200x200x200 numpy array (with each element of the 3d array containing a 1d numpy array of records) takes 42m13s of CPU time and 3.2GB RAM. Using a similar structure based on Python dicts, lists and tuples (3d sparse matrix using dicts, with each element being a list, containing a tuple per record) has consumed 106m17s of CPU time and 10GB RAM for 42.5m records so far — and it's scaling much, much worse.

numpy arrays support easy access to records within a bin, e.g.:

> >>> data[5,1,2]
> array([ (5.3820352554321289, 1.9680783748626709, 2.4225616455078125, 8863, 83, 123, 155),
>        (5.9297022819519043, 1.285321831703186, 2.2762794494628906, 77, 135, 127, 190),
>        (5.1742649078369141, 1.0804551839828491, 2.246323823928833, 700, 126, 192, 215),
>        (5.7373204231262207, 1.5889065265655518, 2.0195131301879883, 1027, 113, 186, 87),
>        (5.2434444427490234, 1.3134119510650635, 2.1608052253723145, 8901, 29, 203, 36),
>        (5.6749391555786133, 1.5029317140579224, 2.0740163326263428, -4095, 187, 212, 183),
>        (5.874603271484375, 1.645256519317627, 2.7637510299682617, 8471, 81, 88, 57),
>        (5.076021671295166, 1.2720983028411865, 2.7047195434570312, -5897, 239, 228, 105),
>        (5.4287075996398926, 1.7871297597885132, 2.3681249618530273, 5266, 130, 119, 252),
>        (5.1363730430603027, 1.4398738145828247, 2.2409327030181885, -106, 101, 41, 236)], 
>       dtype=[('x', '<f4'), ('y', '<f4'), ('z', '<f4'), ('i', '<i4'), ('r', '|u1'), ('g', '|u1'), ('b', '|u1')])

Slicing is also supported, so "data[5,1]" returns all records for X=5, Y=1; "data[5,:,2]" returns everything for X=5, Z=2.

My numpy code was as follows:

> import numpy
> record_type = numpy.dtype({
>     'names': ['x', 'y', 'z', 'i', 'r', 'g', 'b'], 
>     'formats': ['f4', 'f4', 'f4', 'i4', 'u1', 'u1', 'u1']
> })
> data = numpy.empty((200, 200, 200), dtype='object')
> with open('/data/xyzirgb-test', 'r') as f:
>     for line in f:
>         line = line.split()
>         if len(line) < 7:
>             continue
>         
>         line = (
>             float(line[0]), float(line[1]), float(line[2]), # X, Y, Z
>             int(line[3]), # I
>             int(line[4]), int(line[5]), int(line[6]) # R, G, B
>         )
>         coord = int(line[0]), int(line[1]), int(line[2])
>         rec = numpy.array([line], dtype=record_type)
>         data[coord] = rec if data[coord] is None else numpy.append(data[coord], rec, axis=0)

On 07/02/2011, at 20:12 , Ben Dyer wrote:

> David,
> 
> Assuming you need to be able to refer back to the X, Y, Z co-ordinates at full precision, storing 100 million records as a flat array of C structs (3x double, 1x int32, 3x uint8) would take 2.9GB of RAM.
> 
> Having run some tests using various native Python data structures to read a file of 100 million records, I'd say the only sane way to do this on PC-class hardware is to use numpy/scipy:
> • Sparse matrix classes: http://docs.scipy.org/doc/scipy/reference/sparse.html
> • numpy array data types: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
> 
> The appropriate structure will depend on the expected sparseness of the data (e.g. if your XYZ values are in the range (-100, 100) then you'll have 100,000,000 points spread across 8,000,000 possible bins so using a sparse data structure probably won't be worth it). Also, you need to take into account the fact that you'll have variable numbers of data points in each bin, so for the non-sparse approach you'd need to use something like a 3-dimensional array of 1-dimensional arrays of records containing 3x double, 1x uint32, 3x uint8 (since numpy doesn't support variable-length arrays, and using the "array of native Python types" functionality with tuples etc would probably consume too much memory).
> 
>> I have a large file of data which consists of  the following:  X,Y,Z,I,R,G,B
>> As sampled below.   There are a large number of files with variable lengths.
>> A few contain 100 million data points but mostly there are between 3 million
>> and 10 million points. 
> 
> All the above does of course assume that when you say "[you] wish to read this file in to memory and assign it into an array" you're referring to one of the "… large number of file[s] …" (of which, "[a] few contain 100 million data points …"). If you're actually referring to the "… large file of data …" at the start of that paragraph, taken to mean the entire contents of each data file, then there's no way you're going to fit all of that in memory on a system with "desktop" amounts of RAM, so you'd be better off processing on disk or using an extra-large (high memory) Amazon EC2 instance.
> 
> Regards,
> Ben
> 
> On 07/02/2011, at 18:08 , David Crisp wrote:
> 
>> Thanks William,
>> 
>> I read up on the Dictionaries at one of the main points I spotted was a reference on one of the sites :
>> 
>> (a) More than one entry per key not allowed. Which means no duplicate key is allowed. When duplicate keys encountered during assignment, the last assignment wins.
>> 
>> By my reading that is saying that there can only be one X value of 10  if a new X value of ten comes along then it will supercede the previous.
>> 
>> Because this is 3D space these points represent,  there can actually be more than 1 X of Value 10 and indeed there can be more than 1 Y value of 10 as well.
>> 
>> Or have i misunderstood the restrictions on Dictionaries.
>> 
>> Regards,
>> David
>> 
>> 
>> 
>> On Mon, 7 Feb 2011, William ML Leslie wrote:
>> 
>>> On 7 February 2011 13:15, David Crisp <dcrisp at netspace.net.au> wrote:
>>>> I then wish to iterate through the array and 'bin'  the X values into a
>>>> range.  For instance every X value between -54 and -53 goes into bin 1,  -53
>>>> to -52 goes into bin 2
>>>> 
>>>> Then I wish to iterate through this bin array and sort the Y values into the
>>>> same sort of ranges.
>>>> 
>>>> Then I wish to Sort the Z values into the same sort of ranges.   This should
>>>> give me an array of Array[X][Y][Z] Where Array[0][0][0] should give me the
>>>> very first cell  and I can reference any of the values simply by :
>>>> 
>>>>>>> print Array[100][100][10]
>>>> 
>>>> and get the resulting Z values for that location.
>>> 
>>> A better data structure for an index like this might actually be a
>>> dictionary. A dictionary can be keyed on (X, Y, Z) tuples.
>>> 
>>> You could do something just a little neater than than:
>>> 
>>> points = {}
>>> for line in file:
>>>  points = line.split(None, 6)
>>>  if not points[2:]:
>>>      # specifically handle blank lines and count lines
>>>      continue
>>>  x, y, z, i, r, g, b = map(float, points)
>>>  region = points.setdefault((int(x), int(y), int(z)), [])
>>>  region.append((x, y, z, i, r, g, b))
>>> 
>>> On tuples:
>>> 
>>> Tuples are (generally) for representing heterogeneous data; much like
>>> structs or classes in other languages.  Each "column" of your table
>> A> represents a different concept; whether an intensity, a location, or
>>> whatever, and it looks like they are even supposed to be different
>>> types.  Whenever each column has a distinct meaning/type, a
>>> heterogeneous collection type like a tuple, namedtuple or class is
>>> probably the way to go.
>>> 
>>> On the other hand, you probably want to talk about "the list of
>>> points"; dealing with them as a heterogeneous collection.  Each point
>>> in the list, or row of your table, represents the same type of data.
>>> 
>>> Someone had suggested doing a talk next month on appropriate use of
>>> data structures, so it is probably a good idea to get some questions
>>> going on in here.  Using numpy for the detail rows is reasonable too,
>>> but that means you either give up unboxing the floats (which unboxing
>>> will halve the memory usage of this table on cpython) or give up
>>> having different types for different columns; so it might not be
>>> worthwhile.
>>> 
>>> 
>> _______________________________________________
>> melbourne-pug mailing list
>> melbourne-pug at python.org
>> http://mail.python.org/mailman/listinfo/melbourne-pug
> 
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug at python.org
> http://mail.python.org/mailman/listinfo/melbourne-pug