[Numpy-discussion] A one-byte string dtype?

Charles R Harris charlesr.harris at gmail.com
Tue Jan 21 13:14:28 EST 2014


On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker <chris.barker at noaa.gov>wrote:

> A  lot of good discussion here -- to much to comment individually, but it
> seems we can boil it down to a couple somewhat distinct proposals:
>
> 1) a one-byte-per-char dtype:
>
> This would provide compact, high efficiency storage for common text
> for scientific computing. It is analogous to a lower-precision numeric type
> -- i.e. it could not store any unicode strings -- only the subset that are
> compatible the suggested encoding.
>  Suggested encoding: latin-1
>  Other options:
>      - ascii only.
>      - settable to any one-byte per char encoding supported by python
>         I like this IFF it's pretty easy, but it may
> add significant complications (and overhead) for comparisons, etc....
>
> NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
> back to the py2 mojibake hell" -- the goal here is to very clearly have
> this be text data, and have a clearly defined encoding. Which is why we
> can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
> to conveniently and efficiently use numpy for text that is ansi compatible.
>
> 2) a utf-8 dtype:
>     NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
> per char encoding, so would not snuggly into the numpy data model.
>    It would give compact memory use for mostly-ascii data, so that would
> be nice.
>
> 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
>   This would get us the advantages of the new py3 unicode model -- compact
> and efficient when it can be, but also supporting all of unicode. Honestly,
> this seems like more work than it's worth to me, at least given the current
> numpy dtype model -- maybe a nice addition to dynd. YOu can, after
> all, simply use an object array with py3 strings in it. Though perhaps
> using the py3 unicode type, but having a dtype that specifically links to
> that, rather than a generic python object would be a good compromise.
>
>
> Hmm -- I guess despite what I said, I just write the starting pint for a
> NEP...
>
>
Should also mention the reasons for adding a new data type.

<snip>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/16f379d5/attachment.html>


More information about the NumPy-Discussion mailing list