[SciPy-dev] Some Q's vis-a-vis Numpy unicode support

josef.pktd at gmail.com josef.pktd at gmail.com
Wed Aug 12 09:54:09 EDT 2009


On Wed, Aug 12, 2009 at 1:45 AM, David Goldsmith<d_l_goldsmith at yahoo.com> wrote:
> Actually, since you seem so into it ;-)
This was just a refresher, I struggled much more the first time I
tried to use non-english filenames and files.

> can you write me a little script (just 'cause it seems like you could do it faster) to print all the unicode characters u such >that u == u.capitalize()?

u == u.capitalize()    that' s for most of them,
The webpage lists 89,674 unicode characters and I didn't want to try
all of them.

Below are the unicode characters in the first 1000 for which u != u.capitalize()

josef

-----------------------------

print unichr(30)

maxcode = 1000 # I don't want to try 38000
start = 0  # 38000  is boring
umany = np.array([unichr(i) for i in xrange(start,start+maxcode)],
                 '<U1').view(np.chararray)

capmask = (umany != umany.capitalize())
umanycap = umany[capmask]


print umany
print capmask
print '%2.2f percent differ in capitalize' %
(np.sum(capmask)/float(len(umany))*100)

for i in xrange(len(umanycap)):
    try:
        print umanycap[i],
    except:
        print "\n%r doesn't print" % umanycap[i]
--------------

>
> DG
>
> --- On Tue, 8/11/09, josef.pktd at gmail.com <josef.pktd at gmail.com> wrote:
>
>> From: josef.pktd at gmail.com <josef.pktd at gmail.com>
>> Subject: Re: [SciPy-dev] Some Q's vis-a-vis Numpy unicode support
>> To: "SciPy Developers List" <scipy-dev at scipy.org>
>> Date: Tuesday, August 11, 2009, 8:59 PM
>> On Tue, Aug 11, 2009 at 11:40 PM,
>> <josef.pktd at gmail.com>
>> wrote:
>> > On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd at gmail.com>
>> wrote:
>> >> On Tue, Aug 11, 2009 at 10:28 PM, David
>> >> Goldsmith<d_l_goldsmith at yahoo.com>
>> wrote:
>> >>> Thanks, Josef.  This may just be an artifact
>> of working in a DOS Terminal (but your example, though not
>> printing the accented e, did at least print something
>> different for b vs. b.capitalize()), or it may be because I
>> don't know the right encoding to use, but I tried your code
>> w/ what I found on Wikipedia to be the unicode for the Greek
>> letter delta, namely, u'\x03b04', with both 'cp1252' and
>> 'iso8859-7' encoding (the latter being inferred from the
>> same Wikipedia article) and here's what I get:
>> >>>
>> >>>>>> b =
>> np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
>> >>>>>> print b.encode('cp1252')[0]
>> >>> ♥
>> >>>>>> print
>> b.capitalize().encode('cp1252')[0]
>> >>> ♥
>> >>>>>> print b.encode('iso8859-7')[0]
>> >>> ♥
>> >>>>>> print
>> b.capitalize().encode('iso8859-7')[0]
>> >>> ♥
>> >>>
>> >>> i.e., no difference.  If I'm doing something
>> wrong, please let me know; otherwise, for the purpose of
>> documenting chararray.capitalize() - which is my ultimate
>> goal - is there any rhyme or reason behind which unicode
>> characters capitalize() works on and which it doesn't?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> DG
>> >>> --- On Tue, 8/11/09, josef.pktd at gmail.com
>> <josef.pktd at gmail.com>
>> wrote:
>> >>>
>> >>>> actually this works (in Idle)
>> >>>>
>> >>>> >>> b =
>> >>>>
>> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
>> >>>> >>> print b.encode('cp1252')[0]
>> >>>> é
>> >>>> >>> print
>> b.capitalize().encode('cp1252')[0]
>> >>>> É
>> >>>> >>> print b[0].encode('cp1252')
>> >>>> é
>> >>>>
>> >>>>
>> >>>> this looks like a bug ? or is it a known
>> limitation that
>> >>>> chararrays
>> >>>> cannot be 0-d
>> >>>>
>> >>>> >>> b0=
>> >>>>
>> np.array(u'\xe9','<U1').view(np.chararray)
>> >>>> >>> print b0.encode('cp1252')
>> >>>> Traceback (most recent call last):
>> >>>>   File "<pyshell#47>", line 1, in
>> >>>> <module>
>> >>>>     print b0.encode('cp1252')
>> >>>>   File
>> >>>>
>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>> >>>> line 217, in encode
>> >>>>     return
>> self._generalmethod('encode',
>> >>>> broadcast(self, encoding, errors))
>> >>>>   File
>> >>>>
>> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
>> >>>> line 162, in _generalmethod
>> >>>>     newarr[:] = res
>> >>>> ValueError: cannot slice a 0-d array
>> >>>>
>> >>>>
>> >>>> >
>> >>>> > Josef
>> >>>> >
>> >>>> >>>
>> >>>> >>> Unless the answer is "No," my
>> real question:
>> >>>> >>>
>> >>>> >>> 1) Does
>> chararray.capitalize() capitalize
>> >>>> non-Roman letters
>> >>>> >>> that have different
>> lower-case and upper-case
>> >>>> forms (e.g.,
>> >>>> >>> the Greek letters)?  If
>> "yes," are there any
>> >>>> exceptions
>> >>>> >>> (e.g., Russian letters)?
>> >>
>> >> I think yes, exceptions are languages for which no
>> capital letters
>> >> exist, Cantonese(Chinese) ?
>> >> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
>> >>  ??? google search for 03B04,
>> >>
>> >>>> >>>
>> >>>> >>> Thanks!
>> >>>> >>>
>> >>>> >>> DG
>> >>>> >>>
>> >>>> >>>
>> >>
>> >> I have problems finding the correct codes for the
>> characters and
>> >> usually need a word processor.
>> >>
>> >> To me it looks like your character is not a greek
>> delta
>> >>
>> >>>>> print u'\x03b04'
>> >>  b04
>> >>>>> print u'\u03b04'
>> >> ΰ4
>> >>>>> print u'\u03b4'
>> >> δ
>> >>
>> >> I don't know what it is since it doesn't render to
>> anything meaningful
>> >>
>> >> I managed to get the greek delta through the html
>> code for it δ from page:
>> >> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
>> >>
>> >>
>> >> running this script:
>> >>
>> >>
>> >> # -*- coding: utf-8 -*-
>> >>
>> >> sd = u'δ'
>> >> print sd
>> >>
>> >> b =
>> np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
>> >> print b[0]
>> >> print repr(b[0])
>> >> print b.capitalize()[0]
>> >> print repr(b.capitalize()[0])
>> >>
>> >> ***********
>> >> prints this in my Idle shell
>> >>>>>
>> >> δ
>> >> δ
>> >> u'\u03b4'
>> >> Δ
>> >> u'\u0394'
>> >>
>> >> delta is correctly capitalized
>> >>
>> >>
>> >> Josef
>> >>
>> >
>> >
>> > trying without copy and past non-Ascii characters
>> > the page at
>> > http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4
>> >
>> > also has the utf8 code \xCE\xB4,  everything looks ok
>> starting from this.
>> >
>> > Josef
>> >
>> >>>> '\xCE\xB4'.decode('utf8')
>> > u'\u03b4'
>> >>>> print '\xCE\xB4'.decode('utf8')
>> > δ
>> >>>> print
>> '\xCE\xB4'.decode('utf8').capitalize()
>> > Δ
>> >>>> b =
>> np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
>> >>>> b
>> > chararray([u'\u03b4', u'\u03b4'],
>> >      dtype='<U1')
>> >>>> print b[0]
>> > δ
>> >>>> print b.capitalize()[0]
>> > Δ
>> >
>>
>> and for the fun of it,
>> a Russian (cyrillic) character that capitalizes
>>
>> >>> print '\xD0\xB9'.decode('utf8')
>> й
>> >>> print '\xD0\xB9'.decode('utf8').capitalize()
>> Й
>> >>> '\xD0\xB9'.decode('utf8')
>> u'\u0439'
>> >>> '\xD0\xB9'.decode('utf8').capitalize()
>> u'\u0419'
>>
>>
>> and a german letter that doesn't have a capitalized
>> version
>>
>> >>> print '\xC3\x9F'.decode('utf8').capitalize()
>> ß
>> >>> print '\xC3\x9F'.decode('utf8')
>> ß
>> >>> '\xC3\x9F'.decode('utf8')
>> u'\xdf'
>> >>> '\xC3\x9F'.decode('utf8').capitalize()
>> u'\xdf'
>>
>> and here's a nice picture of unicode 03B04
>> http://www.cns11643.gov.tw/seeker/english/showfont.jsp?ucode=03B04
>>
>> and here are all unicode characters (although my browser
>> doesn't
>> display most of them)
>> http://www.isthisthingon.org/unicode/allchars1.php
>>
>>
>> I hope this helps,
>>
>> Josef
>> _______________________________________________
>> Scipy-dev mailing list
>> Scipy-dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
>
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>



More information about the SciPy-Dev mailing list