[Python-ideas] Extend unicodedata with a name search
Steven D'Aprano
steve at pearwood.info
Sat Oct 4 08:29:24 CEST 2014
On Sat, Oct 04, 2014 at 03:50:33PM +1000, Chris Angelico wrote:
> On Sat, Oct 4, 2014 at 1:17 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > - startswith, endswith, contains: probably sufficient, but I suppose
> > one would like at least conjunction and disjunction operations:
> > unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and')
> > unicodematch.startswith('PIECE OF', 'PILE OF', op='or')
> > (OK, that's pretty horrible, but it gives an idea.)
>
> There's an easier way, though it would take a bit of setup work. Start
> by building up an actual list in RAM of [unicodedata.name(chr(i)) for
> i in range(sys.maxunicode+1)] and then do regular string operations.
> I'm fairly sure most Python programmers can figure out how to search a
> list of strings according to whatever rules they like - maybe using
> contains/startswith/endswith, or maybe regexps, or whatever.
py> x = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
ValueError: no such name
There are 1114112 such code points, and most of them are unused.
Some of the used ones don't have names:
py> unicodedata.name('\0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
But even once you deal with those complications, you'll end up
duplicating information which (I presume) Python already has, and still
end up needing to do a linear search in slow Python code looking for
what you want. I think there are probably better solutions. Or at least,
I hope there are better solutions :-)
--
Steven
More information about the Python-ideas
mailing list