Filtering XArray Datasets?

Israel Brewster ijbrewster at alaska.edu
Mon Jun 6 18:28:41 EDT 2022


I have some large (>100GB) datasets loaded into memory in a two-dimensional (X and Y) NumPy array backed XArray dataset. At one point I want to filter the data using a boolean array created by performing a boolean operation on the dataset that is, I want to filter the dataset for all points with a longitude value greater than, say, 50 and less than 60, just to give an example (hopefully that all makes sense?).

Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for example), and then applying that boolean array to the dataset using .where(), with drop=True. This appears to work, but has two issues:

1) It’s slow. On my large datasets, applying where can take several minutes (vs. just seconds to use a boolean array to index a similarly sized numpy array)
2) It uses large amounts of memory (which is REALLY a problem when the array is already using 100GB+)

What it looks like is that values corresponding to True in the boolean array are copied to a new XArray object, thereby potentially doubling memory usage until it is complete, at which point the original object can be dropped, thereby freeing the memory.

Is there any solution for these issues? Some way to do an in-place filtering? 
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



More information about the Python-list mailing list