[Python-Dev] Help with Unicode arrays in NumPy

Tue Feb 7 21:53:16 CET 2006

Travis E. Oliphant wrote:
> Numpy supports arrays of arbitrary fixed-length "records".  It is
> much more than numeric-only data now.  One of the fields that a
> record can contain is a string.  If strings are supported, it makes
> sense to support unicode strings as well.

Hmm. How do you support strings in fixed-length records? Strings are
variable-sized, after all.

On common application is that you have a C struct in some API which
has a fixed-size array for string data (either with a length field,
or null-terminated), in this case, it is moderately useful to model
such a struct in Python. However, transferring this to Unicode is
pointless - there aren't any similar Unicode structs that need
support.

> This allows NumPy to memory-map arbitrary data-files on disk.

Ok, so this is the "C struct" case. Then why do you need Unicode
support there? Which common file format has embedded fixed-size
Unicode data?

> Perhaps you should explain why you think NumPy "shouldn't support
> Unicode"

I think I said "Unicode arrays", not Unicode. Unicode arrays are
a pointless data type, IMO. Unicode always comes in strings
(i.e. variable sized, either null-terminated or with an introducing
length). On disk/on the wire Unicode comes as UTF-8 more often
than not.

Using UCS-2/UCS-2 as an on-disk represenationis also questionable
practice (although admittedly Microsoft uses that a lot).

> That is currently what is done.  The current unicode data-type is 
> exactly what Python uses.

Then I wonder how this goes along with the use case "allow to
map arbitrary files".

> The chararray subclass gives to unicode and string arrays all the 
> methods of unicode and strings (operating on an element-by-element
> basis).

For strings, I can see use cases (although I wonder how you deal
with data formats that also support variable-sized strings, as
most data formats supporting strings do).

> Please explain why having zero of them is *sufficient*.

Because I (still) cannot imagine any specific application that
might need such a feature (IOWYAGNI).

>> If the purpose is to support arbitrary Unicode characters, it
>> should use 4 bytes (as two bytes are insufficient to represent
>> arbitrary Unicode characters).
> 
> 
> And Python does not support arbitrary Unicode characters on narrow 
> builds?  Then how is \U0010FFFF represented?

It's represented using UTF-16. Try this for yourself:

py> len(u"\U0010FFFF")
2
py> u"\U0010FFFF"[0]
u'\udbff'
py> u"\U0010FFFF"[1]
u'\udfff'

This has all kinds of non-obvious implications.

> The purpose is to represent bytes as they might exist in a file or 
> data-stream according to the users specification.

See, and this is precisely the statement that I challenge. Sure,
they "might" exist - but I'd rather expect that they don't.

If they exist, "Unicode" might come as variable-sized UTF-8, UTF-16,
or UTF-32. In either case, NumPy should already support that by
mapping a string object onto the encoded bytes, to which you then
can apply .decode() should you need to process the actual Unicode
data.

> The purpose is 
> whatever the user wants them for.  It's the same purpose as having an
>  unsigned 64-bit data-type --- because users may need it to represent
>  data as it exists in a file.

No. I would expect you have 64-bit longs because users *do* need them,
and because there wouldn't be an easy work-around if users wouldn't have
them. For Unicode, it's different: users don't directly need them
(atleast not many users), and if they do, there is an easy work-around
for their absence.

Say I want to process NTFS run lists. In NTFS run lists, there are
24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles).
Can I represent them all in NumPy? Can I have NumPy transparently
map a sequence of run list records (which are variable-sized)
map as an array of run list records?

Regards,
Martin