Filtering XArray Datasets?

Dennis Lee Bieber wlfraed at ix.netcom.com
Mon Jun 6 23:29:02 EDT 2022


On Mon, 6 Jun 2022 14:28:41 -0800, Israel Brewster <ijbrewster at alaska.edu>
declaimed the following:

>I have some large (>100GB) datasets loaded into memory in a two-dimensional (X and Y) NumPy array backed

	Unless you have some massive number cruncher machine, with TB RAM, you
are running with a lot of page swap -- and not just cached pages in unused
RAM; actual disk I/O.

	Pretty much anything that has to scan the data is going to be slow!

>
>Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for example), and then applying that boolean array to the dataset using .where(), with drop=True. This appears to work, but has two issues:
>

	FYI: your first paragraph said "longitude", not "latitude".

>1) It’s slow. On my large datasets, applying where can take several minutes (vs. just seconds to use a boolean array to index a similarly sized numpy array)
>2) It uses large amounts of memory (which is REALLY a problem when the array is already using 100GB+)
>

	Personally, given the size of the data, and that it is going to involve
lots of page swapping... I'd try to convert the datasets into some RDBM --
maybe with indices defined for latitude/longitude columns, allowing queries
to scan the index to find matching records, and return those (perhaps for
processing one at a time "for rec in cursor:" rather than doing a
.fetchall().

	Some RDBMs even have extensions for spatial data handling.


-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
	wlfraed at ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/


More information about the Python-list mailing list