[Numpy-discussion] newunicode branch started to fix unicode to always be UCS4
Francesc Altet
faltet at carabos.com
Thu Feb 9 04:50:03 EST 2006
A Dijous 09 Febrer 2006 06:36, Travis Oliphant va escriure:
> Travis Oliphant wrote:
> > I've started a branch on SVN to fix the unicode implementation in
> > NumPy so that internally all unicode arrays use UCS4. When a scalar
> > is obtained it will be the Python unicode scalar and the required
> > conversions (and data-copying) will be done.
> > If anybody would like to help the branch is
>
> Well, it turned out not to be too difficult. It is done.
Oh my! If I wouldn't have met you in person I would tend to think that
you are not human ;-)
> All Unicode
> arrays are now always 4-bytes-per character in NumPy. The length is
> specified in terms of characters (not bytes). This is different than
> other types, but it's consistent with the use of Unicode as characters.
Yes, I think this is a good idea.
> The array-scalar that a unicode array produces inherits directly from
> Python unicode type which has either 2 or 4 bytes depending on the build.
>
> On narrow builds where Python unicode is only 2-bytes, the 4-byte
> unicode is converted to 2-byte using surrogate pairs.
Very good!
> There may be lingering bugs of course, so please try it out and report
> problems.
Well, I've tried it for a while and it seems to me that you made a
very good job! Just a little thing:
# Using an UCS4 interpreter here
>>> len(buffer(numpy.array("qsds", 'U4')[()]))
16
>>> numpy.array("qsds", 'U4')[()].dtype
dtype('<U4')
>>> len(buffer(numpy.array("qsds", 'U3')[()]))
12
>>> numpy.array("qsds", 'U3')[()].dtype
dtype('<U3')
so far so good. But in UCS2 we have:
# Using an UCS2 interpreter here
>>> len(buffer(numpy.array("qsds", 'U4')[()]))
8 # Fine
>>> numpy.array("qsds", 'U4')[()].dtype
dtype('<U2') # Shouldn't be U4?
>>> len(buffer(numpy.array("qsds", 'U3')[()]))
6 # Fine
>>> numpy.array("qsds", 'U3')[()].dtype
dtype('<U1') # Shouldn't be U3?
I'll try to do more serious tests and contribute them back in a series
of test units.
Finally, one final consideration. From a FAQ about Unicode
(http://www.cl.cam.ac.uk/~mgk25/unicode.html), one can read:
"""
No endianess is implied by the encoding names UCS-2, UCS-4, UTF-16,
and UTF-32, though ISO 10646-1 says that Bigendian should be preferred
unless otherwise agreed. It has become customary to append the letters
?BE? (Bigendian, high-byte first) and ?LE? (Littleendian, low-byte
first) to the encoding names in order to explicitly specify a byte
order.
"""
In NumPy, it seems that the endianess is the same of the platform,
while the ISO recomendation seems to say that Big-endian would be
preferred. I don't know which is the convention in Python about this,
but in any case, I'd follow Python convention, not the ISO one.
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the NumPy-Discussion
mailing list