[SciPy-dev] Some Q's vis-a-vis Numpy unicode support

Wed Aug 12 01:45:17 EDT 2009

Actually, since you seem so into it ;-) can you write me a little script (just 'cause it seems like you could do it faster) to print all the unicode characters u such that u == u.capitalize()?

DG

--- On Tue, 8/11/09, josef.pktd at gmail.com <josef.pktd at gmail.com> wrote:

> From: josef.pktd at gmail.com <josef.pktd at gmail.com>
> Subject: Re: [SciPy-dev] Some Q's vis-a-vis Numpy unicode support
> To: "SciPy Developers List" <scipy-dev at scipy.org>
> Date: Tuesday, August 11, 2009, 8:59 PM
> On Tue, Aug 11, 2009 at 11:40 PM,
> <josef.pktd at gmail.com>
> wrote:
> > On Tue, Aug 11, 2009 at 11:18 PM, <josef.pktd at gmail.com>
> wrote:
> >> On Tue, Aug 11, 2009 at 10:28 PM, David
> >> Goldsmith<d_l_goldsmith at yahoo.com>
> wrote:
> >>> Thanks, Josef.  This may just be an artifact
> of working in a DOS Terminal (but your example, though not
> printing the accented e, did at least print something
> different for b vs. b.capitalize()), or it may be because I
> don't know the right encoding to use, but I tried your code
> w/ what I found on Wikipedia to be the unicode for the Greek
> letter delta, namely, u'\x03b04', with both 'cp1252' and
> 'iso8859-7' encoding (the latter being inferred from the
> same Wikipedia article) and here's what I get:
> >>>
> >>>>>> b =
> np.array([u'\x03b04',u'\x03b04'],'<U1').view(np.chararray)
> >>>>>> print b.encode('cp1252')[0]
> >>> ♥
> >>>>>> print
> b.capitalize().encode('cp1252')[0]
> >>> ♥
> >>>>>> print b.encode('iso8859-7')[0]
> >>> ♥
> >>>>>> print
> b.capitalize().encode('iso8859-7')[0]
> >>> ♥
> >>>
> >>> i.e., no difference.  If I'm doing something
> wrong, please let me know; otherwise, for the purpose of
> documenting chararray.capitalize() - which is my ultimate
> goal - is there any rhyme or reason behind which unicode
> characters capitalize() works on and which it doesn't?
> >>>
> >>> Thanks,
> >>>
> >>> DG
> >>> --- On Tue, 8/11/09, josef.pktd at gmail.com
> <josef.pktd at gmail.com>
> wrote:
> >>>
> >>>> actually this works (in Idle)
> >>>>
> >>>> >>> b =
> >>>>
> np.array([u'\xe9',u'\xe9'],'<U1').view(np.chararray)
> >>>> >>> print b.encode('cp1252')[0]
> >>>> é
> >>>> >>> print
> b.capitalize().encode('cp1252')[0]
> >>>> É
> >>>> >>> print b[0].encode('cp1252')
> >>>> é
> >>>>
> >>>>
> >>>> this looks like a bug ? or is it a known
> limitation that
> >>>> chararrays
> >>>> cannot be 0-d
> >>>>
> >>>> >>> b0=
> >>>>
> np.array(u'\xe9','<U1').view(np.chararray)
> >>>> >>> print b0.encode('cp1252')
> >>>> Traceback (most recent call last):
> >>>>   File "<pyshell#47>", line 1, in
> >>>> <module>
> >>>>     print b0.encode('cp1252')
> >>>>   File
> >>>>
> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
> >>>> line 217, in encode
> >>>>     return
> self._generalmethod('encode',
> >>>> broadcast(self, encoding, errors))
> >>>>   File
> >>>>
> "C:\Programs\Python25\Lib\site-packages\numpy\core\defchararray.py",
> >>>> line 162, in _generalmethod
> >>>>     newarr[:] = res
> >>>> ValueError: cannot slice a 0-d array
> >>>>
> >>>>
> >>>> >
> >>>> > Josef
> >>>> >
> >>>> >>>
> >>>> >>> Unless the answer is "No," my
> real question:
> >>>> >>>
> >>>> >>> 1) Does
> chararray.capitalize() capitalize
> >>>> non-Roman letters
> >>>> >>> that have different
> lower-case and upper-case
> >>>> forms (e.g.,
> >>>> >>> the Greek letters)?  If
> "yes," are there any
> >>>> exceptions
> >>>> >>> (e.g., Russian letters)?
> >>
> >> I think yes, exceptions are languages for which no
> capital letters
> >> exist, Cantonese(Chinese) ?
> >> http://www.isthisthingon.org/unicode/index.phtml?page=03&subpage=B&glyph=03B04
> >>  ??? google search for 03B04,
> >>
> >>>> >>>
> >>>> >>> Thanks!
> >>>> >>>
> >>>> >>> DG
> >>>> >>>
> >>>> >>>
> >>
> >> I have problems finding the correct codes for the
> characters and
> >> usually need a word processor.
> >>
> >> To me it looks like your character is not a greek
> delta
> >>
> >>>>> print u'\x03b04'
> >>  b04
> >>>>> print u'\u03b04'
> >> ΰ4
> >>>>> print u'\u03b4'
> >> δ
> >>
> >> I don't know what it is since it doesn't render to
> anything meaningful
> >>
> >> I managed to get the greek delta through the html
> code for it δ from page:
> >> http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&hilite=003B4
> >>
> >>
> >> running this script:
> >>
> >>
> >> # -*- coding: utf-8 -*-
> >>
> >> sd = u'δ'
> >> print sd
> >>
> >> b =
> np.array([u'\u03b4',u'\u0394'],'<U1').view(np.chararray)
> >> print b[0]
> >> print repr(b[0])
> >> print b.capitalize()[0]
> >> print repr(b.capitalize()[0])
> >>
> >> ***********
> >> prints this in my Idle shell
> >>>>>
> >> δ
> >> δ
> >> u'\u03b4'
> >> Δ
> >> u'\u0394'
> >>
> >> delta is correctly capitalized
> >>
> >>
> >> Josef
> >>
> >
> >
> > trying without copy and past non-Ascii characters
> > the page at
> > http://www.isthisthingon.org/unicode/index.phtml?page=00&subpage=3&glyph=003B4
> >
> > also has the utf8 code \xCE\xB4,  everything looks ok
> starting from this.
> >
> > Josef
> >
> >>>> '\xCE\xB4'.decode('utf8')
> > u'\u03b4'
> >>>> print '\xCE\xB4'.decode('utf8')
> > δ
> >>>> print
> '\xCE\xB4'.decode('utf8').capitalize()
> > Δ
> >>>> b =
> np.array(['\xCE\xB4'.decode('utf8'),'\xCE\xB4'.decode('utf8')],'<U1').view(np.chararray)
> >>>> b
> > chararray([u'\u03b4', u'\u03b4'],
> >      dtype='<U1')
> >>>> print b[0]
> > δ
> >>>> print b.capitalize()[0]
> > Δ
> >
> 
> and for the fun of it,
> a Russian (cyrillic) character that capitalizes
> 
> >>> print '\xD0\xB9'.decode('utf8')
> й
> >>> print '\xD0\xB9'.decode('utf8').capitalize()
> Й
> >>> '\xD0\xB9'.decode('utf8')
> u'\u0439'
> >>> '\xD0\xB9'.decode('utf8').capitalize()
> u'\u0419'
> 
> 
> and a german letter that doesn't have a capitalized
> version
> 
> >>> print '\xC3\x9F'.decode('utf8').capitalize()
> ß
> >>> print '\xC3\x9F'.decode('utf8')
> ß
> >>> '\xC3\x9F'.decode('utf8')
> u'\xdf'
> >>> '\xC3\x9F'.decode('utf8').capitalize()
> u'\xdf'
> 
> and here's a nice picture of unicode 03B04
> http://www.cns11643.gov.tw/seeker/english/showfont.jsp?ucode=03B04
> 
> and here are all unicode characters (although my browser
> doesn't
> display most of them)
> http://www.isthisthingon.org/unicode/allchars1.php
> 
> 
> I hope this helps,
> 
> Josef
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>