[Numpy-discussion] Extent of unicode types in numpy

Tue Feb 7 12:09:05 EST 2006

El dt 07 de 02 del 2006 a les 12:26 -0700, en/na Travis Oliphant va
escriure:
> Python itself hands us this difference.  Is it really so different then 
> the fact that python integers are either 32-bit or 64-bit depending on 
> the platform.  
> 
> Perhaps what this is telling us, is that we do indeed need another 
> data-type for 4-byte unicode.   It's how we solve the problem of 32-bit 
> or 64-bit integers (we have a 64-bit integer on all platforms).

Agreed.

> Then in NumPy we can support going back and forth between UCS-2 (which 
> we can then say is UTF-16) and UCS-4.

If this could be implemented, then excellent!

> The issue with saving to disk is really one of encoding anyway.  So, if 
> PyTables want's do do this correctly, then it should be using a 
> particular encoding anyway.

The problem with unicode encodings is that most (I'm thinking in UTF-8
and UTF-16) choose (correct me if I'm wrong here) a technique of
surrogating pairs when trying to encode values that doesn't fit in a
single word (7 bits for UTF-8 and 15 bits for UTF-16), which brings to a
*variable* length of the coded output. And this is precisely the point:
PyTables (as NumPy itself, or any other piece of software with
efficiency in mind) would require a *fixed* space for keeping data, not
a space that can be bigger or smaller depending on the number of
surrogate pairs that should be used to encode a certain unicode string.

But, if what you are saying is that NumPy would adopt a 32-bit unicode
type internally and then do the appropriate conversion to/from the
python interpreter, then this is perfect, because it is the buffer of
NumPy that will be used to be written/read to/from disk, not the Python
object, and the buffer of such a NumPy object meets the requisites to
become an efficient buffer: fixed length *and* large enough to keep
*every* Unicode character without a need to use encodings.

> I think the best solution is to define separate UCS4 and UCS2 data-types 
> and handle conversion between them using the casting functions.   This 
> is a bit of work to implement, but not too bad...

Well, I don't understand well here. I thought that you were proposing a
32-bit unicode type for NumPy and then converting it appropriately to
UCS2 (conversion to UCS4 wouldn't be necessary as it would be the same
as the native NumPy unicode type) just in case that the user requires an
scalar out of the NumPy object. But you are talking here about defining
separate UCS4 and UCS2 data-types. I admit that I'm loosed here...

Regards,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"