[Python-ideas] Extend unicodedata with a name search

Sat Oct 4 08:29:24 CEST 2014

On Sat, Oct 04, 2014 at 03:50:33PM +1000, Chris Angelico wrote:
> On Sat, Oct 4, 2014 at 1:17 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> >   - startswith, endswith, contains: probably sufficient, but I suppose
> >     one would like at least conjunction and disjunction operations:
> >     unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and')
> >     unicodematch.startswith('PIECE OF', 'PILE OF', op='or')
> >     (OK, that's pretty horrible, but it gives an idea.)
> 
> There's an easier way, though it would take a bit of setup work. Start
> by building up an actual list in RAM of [unicodedata.name(chr(i)) for
> i in range(sys.maxunicode+1)] and then do regular string operations.
> I'm fairly sure most Python programmers can figure out how to search a
> list of strings according to whatever rules they like - maybe using
> contains/startswith/endswith, or maybe regexps, or whatever.

py> x = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
ValueError: no such name

There are 1114112 such code points, and most of them are unused. 
Some of the used ones don't have names:

py> unicodedata.name('\0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

But even once you deal with those complications, you'll end up 
duplicating information which (I presume) Python already has, and still 
end up needing to do a linear search in slow Python code looking for 
what you want. I think there are probably better solutions. Or at least, 
I hope there are better solutions :-)

-- 
Steven