[Numpy-discussion] Extent of unicode types in numpy

Tue Feb 7 06:44:01 EST 2006

A Dimarts 07 Febrer 2006 08:16, Travis Oliphant va escriure:
> In current SVN, numpy assumes 'w' is 2-byte unicode and 'W' is 4-byte
> unicode in the array interface typestring.   Right now these codes
> require that the number of bytes be specified explicitly (to satisfy the
> array interface requirement).   There is still only 1 Unicode data-type
> on the platform and it has the size of Python's Py_UNICODE type.  The
> character 'U' continues to be useful on data-type construction to stand
> for a unicode string of a specific character length. It's internal dtype
> representation will use 'w' or 'W' depending on how Python was compiled.
>
> This may not solve all issues, but at least it's a bit more consistent
> and solves the problem of
>
> dtype(dtype('U8').str) not producing the same datatype.
>
> It also solves the problem of unicode written out with one compilation
> of Python and attempted to be written in with another (it won't let you
> because only one of 'w#' or 'W#' is supported on a platform.

While I agree that this solution is more consistent, I must say that
I'm not very confortable with having to deal with two different widths
for unicode characters. What bothers me is the lack portability of
unicode strings when saving them to disk in python interpreters
UCS4-enabled and retrieving with UCS2-enabled ones in the context of
PyTables (or any other database). Let's suppose that a user have a
numpy object of type unicode that has been created in a python with
UCS4. This would look like:

# UCS4-aware interpreter here
>>> numpy.array(u"\U000110fc", "U1")
array(u'\U000110fc', dtype=(unicode,4))

Now, suppose that you save this in a PyTables file (for example) and
you want to regenerate it on a python interpreter compiled with UCS2.
As the buffer on-disk has a fixed length, we are forced to use unicode
types twice as larger as containers for this data. So the net effect
is that we will end in the UCS2 interpreter with an object like:

# UCS2-aware interpreter here
>>> numpy.array(u"\U000110fc", "U2")
array(u'\U000110fc', dtype=(unicode,4))

which, apparently is the same than the one above, but not quite. To
begin with, the former is an array that is an unicode scalar with only
*one* character, while the later has *two* characters. But worse than
that, the interpretation of the original content changes drastically
in the UCS2 platform. For example, if we select the first and second
characters of the string in the UCS2-aware platform, we have:

>>> numpy.array(u"\U000110fc", "U2")[()][0]
u'\ud804'
>>> numpy.array(u"\U000110fc", "U2")[()][1]
u'\udcfc'

that have nothing to do with the original \U000110fc character (I'd
expect to get at least the truncated values \u0001 and \u10fc). I
think this is because of the conventions that are used to represent
32-bit unicode characters in UTF-16 using a technique called
"surrogate pairs" (see: http://www.unicode.org/glossary/).

All in all, my opinion is that allowing the coexistence of different
sizes of unicode types in numpy would be a receipt for disaster when
one wants to transport unicode characters between platforms with
python interpreters compiled with different unicode sizes.
Consequently I'd propose to suport just one size of unicode sizes in
numpy, namely, the 4-byte one, and if this size doesn't match the
underlying python platform, then refuse to deliver native unicode
objects if the user is asking for them. Something like would work:

# UCS2-aware interpreter here
>>> h=numpy.array(u"\U000110fc", "U1")
>>> h  # This is a 'true' 32-bit unicode array in numpy
array(u'\U000110fc', dtype=(unicode,4))
>>> h[()]    # Try to get a native unicode object in python
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unicode sizes in numpy and your python interpreter doesn't
match. Sorry, but you should get an UCS4-enable python interpreter if
you want to successfully complete this operation.

As a premium, we can get rid of the 'w' and 'W' typecodes that has
been introduced a bit forcedly, IMO. I don't know, however, how
difficult would be implementing this in numpy. Another option can be
to refuse to compile numpy with UCS2-aware interpreters, but this
sounds a bit extreme, but see below.

OTOH, I'm not an expert in Unicode, but after googling a bit, I've
found interesting recommendations about its use in Python. The first
is from Uge Ubuchi in http://www.xml.com/pub/a/2005/06/15/py-xml.html.
Here is the relevant excerpt:

"""
I also want to mention another general principle to keep in mind: if
possible, use a Python install compiled to use UCS4 character storage
[...] UCS4 uses more space to store characters, but there are some
problems for XML processing in UCS2, which the Python core team is
reluctant to address because the only known fixes would be too much of
a burden on performance. Luckily, most distributors have heeded this
advice and ship UCS4 builds of Python.
"""

So, it seems that the Python crew is not interested in solving
problems with with UCS2. Now, towards the end of the PEP 261 ('Support
for "wide" Unicode characters') one can read this as a final
conclusion:

"""
This PEP represents the least-effort solution. Over the next several
years, 32-bit Unicode characters will become more common and that may
either convince us that we need a more sophisticated solution or (on
the other hand) convince us that simply mandating wide Unicode
characters is an appropriate solution.
"""

This PEP dates from 27-Jun-2001, so the "next several years" the
author is referring to is nowadays. In fact, the interpreters in my
Debian based Linux, are both compiled with UCS4. Despite of this, it
seems that the default for compiling python is using UCS2 provided
that you still need to pass the flag "--enable-unicode=ucs4" if you
want to end with a UCS4-enabled interpreter. I wonder why they are
doing this if that can positively lead to problems with XML as Uge
Ubuchi said (?).

Anyway, I don't know if the recommendation of compiling Python with
UCS4 is spread enough or not in the different distributions, but
people can easily check this with:

>>> len(buffer(u"u"))
4

if the output of this is 4 (as in my example), then the interpreter is
using UCS4; if it is 2, it is using UCS2.

Finally, I agree that asking for help about these issues in the python
list would be a good idea.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"