[Neuroimaging] [EXTERNAL] Re: CZI grant - what would you like to see in Nibabel?

Thu Jul 30 16:56:21 EDT 2020

On Thu, Jul 30, 2020 at 6:55 PM Reid, Robert I. (Rob) via Neuroimaging <
neuroimaging at python.org> wrote:

> [...]
>
>
>
> I got a 404 error for Emanuele’s load_trk.py URL. Maybe things were
> rearranged?
>

My mistake, that was a private repo with some code in progress. I've made
available the relevant (and self-contained) file from here:
  https://github.com/emanuele/load_trk.git

>
>
> Could near numpy performance without resampling be achieved by storing the
> tractogram as a dict of numpy arrays keyed by the number of points in their
> streamlines?
>
> e.g. {20: <numpy array of all the streamlines with 20 points>, 21: <numpy
> array of all the streamlines with 21 points>, …}
>
>
>
Good question and interesting suggestion.

Nevertheless, in most of our work, we need an easy way to access the
coordinates of streamlines and to have a unique ID for each streamline. In
years of experiments and coding, we almost always ended up with one of
these two simple data structures:
1) In the case of a tractogram (T) where the streamlines have mixed number
of points: a numpy.array of M elements (like M=10 millions) and
dtype=np.object, where each element/object is a streamline, i.e. a matrix n
x 3, where n is its number of points, that may change from streamline to
streamline.
2) In the case of tractogram (T) where the streamlines all have the same
number of points, e.g. 16: a numpy array M x 16 x 3, typically
dtype=numpy.float32 to save some space (remember the 10M streamlines? It's
2Gb).

Why numpy.array? Mainly for the very convenient indexing property. First,
each streamline has a unique ID (in the example above an int between 0 and
9999999), which is the position of the streamline in the array. So if, for
example, we write an algorithm for nearest neighbour and compute the 100
nearest neighbours to the streamline having ID=123456, we just need a list
or numpy.array (neighbours) of 100 integers to store the result. If we need
to retrieve those neighbouring streamlines, it's just "T[neighbours]",
which is also very fast.
Moreover, many algorithms in data analysis / machine learning operate
directly on numpy.arrays.

A side note: At the repo above, I've just added a small test. Loading 4
million [*] streamlines with mixed number of points takes 1 minute with our
load_streamlines() and 6 seconds with numpy.load() - not fat from what I
reported in the previous message, where instead all streamlines had the
same number of points.

Best,

Emanuele

[*]:In this case, the problem is the excessive use of RAM when doing
numpy.save() - that's why I could test just 4 millions on my laptop.

-- 
_FBK vi invita a leggere il suo Piano di rientro 
<https://trasparenza.fbk.eu/COVID-19-comunicazioni-del-Datore-di-Lavoro-raccomandazioni-e-altro/Piano-di-rientro-FBK/Piano-di-rientro-FBK> 
| FBK invites you to read its Premises Reopening Plan 
<https://trasparenza.fbk.eu/COVID-19-comunicazioni-del-Datore-di-Lavoro-raccomandazioni-e-altro/Piano-di-rientro-FBK/English-version_FBK-Reopening-Plan>._

--
Le informazioni contenute nella presente comunicazione sono di natura 
privata e come tali sono da considerarsi riservate ed indirizzate 
esclusivamente ai destinatari indicati e per le finalità strettamente 
legate al relativo contenuto. Se avete ricevuto questo messaggio per 
errore, vi preghiamo di eliminarlo e di inviare una comunicazione 
all’indirizzo e-mail del mittente.

--
The information transmitted is 
intended only for the person or entity to which it is addressed and may 
contain confidential and/or privileged material. If you received this in 
error, please contact the sender and delete the material.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/neuroimaging/attachments/20200730/c25b5c8f/attachment.html>