[Numpy-discussion] using loadtxt to load a text file in to a numpy array

Andrew Collette andrew.collette at gmail.com
Tue Jan 21 18:22:50 EST 2014


Hi Chris,

Just stumbled on this discussion (I'm the lead author of h5py).

We would be overjoyed if there were a 1-byte text type available in
NumPy.  String handling is the source of major pain right now in the
HDF5 world.  All HDF5 strings are text (opaque types are used for
binary data), but we're forced into using the "S" type most of the
time because (1) the "U" type doesn't round-trip between HDF5 and
NumPy, as there's no fixed-width wide-character string type in HDF5,
and (2) "U" takes 4x the space, which is a problem for big scientific
datasets.

ASCII-only would be preferable, partly for selfish reasons (HDF5's
default is ASCII only), and partly to make it possible to copy them
into containers labelled "UTF-8" without manually inspecting every
value.

> """At the high-level interface, h5py exposes three kinds of strings. Each
> maps to a specific type within Python (but see str_py3 below):
>
> Fixed-length ASCII (NumPy S type)
> ....
> """
> This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not
> an ASCII string (even though I wish it were...). But clearly the HDF folsk
> think we need one!

Yes, this was intended to state that the HDF5 "Fixed-width ASCII" type
maps to NumPy "S" at conversion time, which is obviously a wretched
solution on Py3.

>>>> dset = f.create_dataset("string_ds", (100,), dtype="S10")
> """
> Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form
> another post, I thought you'd need to use numpy.bytes_ (which is the same on
> py2)

It does produce an instance of 'numpy.bytes_', although I think the
h5py docs should be changed to use bytes_ explicitly.

Andrew



More information about the NumPy-Discussion mailing list