[Numpy-discussion] String type again.

Andrew Collette andrew.collette at gmail.com
Mon Jul 14 13:39:41 EDT 2014


Hi Chuck,

> This note proposes to adapt the currently existing 'a'
> type letter, currently aliased to 'S', as a new fixed encoding dtype. Python
> 3.3 introduced two one byte internal representations for unicode strings,
> ascii and latin1. Ascii has the advantage that it is a subset of UTF-8,
> whereas latin1 has a few more symbols. Another possibility is to just make
> it an UTF-8 encoding, but I think this would involve more overhead as Python
> would need to determine the maximum character size.

For storing data in HDF5 (PyTables or h5py), it would be somewhat
cleaner if either ASCII or UTF-8 are used, as these are the only two
charsets officially supported by the library.  Latin-1 would require a
custom read/write converter, which isn't the end of the world but
would be tricky to do in a correct way, and likely somewhat slow.
We'd also run into truncation issues since certain latin-1 chars
become multibyte sequences in UTF8.

I assume 'a' strings would still be null-padded?

Andrew



More information about the NumPy-Discussion mailing list