[SciPy-dev] Some Q's vis-a-vis Numpy unicode support

Tue Aug 11 23:40:24 EDT 2009

On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd at gmail.com> wrote:
> On Tue, Aug 11, 2009 at 10:28 PM, David
> Goldsmith<d_l_goldsmith at yahoo.com> wrote:
>> Thanks, Josef.  This may just be an artifact of working in a DOS Terminal (but your example, though not printing the accented e, did at least print something different for b vs. b.capitalize()), or it may be because I don't know the right encoding to use, but I tried your code w/ what I found on Wikipedia to be the unicode for the Greek letter delta, namely, u'\x03b04', with both 'cp1252' and 'iso8859-7' encoding (the latter being inferred from the same Wikipedia article) and here's what I get:
>>
>>>>> b = np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
>>>>> print b.encode('cp1252')[0]
>> ♥
>>>>> print b.capitalize().encode('cp1252')[0]
>> ♥
>>>>> print b.encode('iso8859-7')[0]
>> ♥
>>>>> print b.capitalize().encode('iso8859-7')[0]
>> ♥
>>
>> i.e., no difference.  If I'm doing something wrong, please let me know; otherwise, for the purpose of documenting chararray.capitalize() - which is my ultimate goal - is there any rhyme or reason behind which unicode characters capitalize() works on and which it doesn't?
>>
>> Thanks,
>>
>> DG
>> --- On Tue, 8/11/09, josef.pktd at gmail.com <josef.pktd at gmail.com> wrote:
>>
>>> actually this works (in Idle)
>>>
>>> >>> b =
>>> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
>>> >>> print b.encode('cp1252')[0]
>>> é
>>> >>> print b.capitalize().encode('cp1252')[0]
>>> É
>>> >>> print b[0].encode('cp1252')
>>> é
>>>
>>>
>>> this looks like a bug ? or is it a known limitation that
>>> chararrays
>>> cannot be 0-d
>>>
>>> >>> b0=
>>> np.array(u'\xe9','<U1').view(np.chararray)
>>> >>> print b0.encode('cp1252')
>>> Traceback (most recent call last):
>>>   File "<pyshell#47>", line 1, in
>>> <module>
>>>     print b0.encode('cp1252')
>>>   File
>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>> line 217, in encode
>>>     return self._generalmethod('encode',
>>> broadcast(self, encoding, errors))
>>>   File
>>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>>> line 162, in _generalmethod
>>>     newarr[:] = res
>>> ValueError: cannot slice a 0-d array
>>>
>>>
>>> >
>>> > Josef
>>> >
>>> >>>
>>> >>> Unless the answer is "No," my real question:
>>> >>>
>>> >>> 1) Does chararray.capitalize() capitalize
>>> non-Roman letters
>>> >>> that have different lower-case and upper-case
>>> forms (e.g.,
>>> >>> the Greek letters)?  If "yes," are there any
>>> exceptions
>>> >>> (e.g., Russian letters)?
>
> I think yes, exceptions are languages for which no capital letters
> exist, Cantonese(Chinese) ?
> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
>  ??? google search for 03B04,
>
>>> >>>
>>> >>> Thanks!
>>> >>>
>>> >>> DG
>>> >>>
>>> >>>
>
> I have problems finding the correct codes for the characters and
> usually need a word processor.
>
> To me it looks like your character is not a greek delta
>
>>>> print u'\x03b04'
>  b04
>>>> print u'\u03b04'
> ΰ4
>>>> print u'\u03b4'
> δ
>
> I don't know what it is since it doesn't render to anything meaningful
>
> I managed to get the greek delta through the html code for it δ from page:
> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
>
>
> running this script:
>
>
> # -*- coding: utf-8 -*-
>
> sd = u'δ'
> print sd
>
> b = np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
> print b[0]
> print repr(b[0])
> print b.capitalize()[0]
> print repr(b.capitalize()[0])
>
> ***********
> prints this in my Idle shell
>>>>
> δ
> δ
> u'\u03b4'
> Δ
> u'\u0394'
>
> delta is correctly capitalized
>
>
> Josef
>

trying without copy and past non-Ascii characters
the page at
http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4

also has the utf8 code \xCE\xB4,  everything looks ok starting from this.

Josef

>>> '\xCE\xB4'.decode('utf8')
u'\u03b4'
>>> print '\xCE\xB4'.decode('utf8')
δ
>>> print '\xCE\xB4'.decode('utf8').capitalize()
Δ
>>> b = np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
>>> b
chararray([u'\u03b4', u'\u03b4'],
      dtype='<U1')
>>> print b[0]
δ
>>> print b.capitalize()[0]
Δ