[Python-ideas] Extend unicodedata with a name search

Stephen J. Turnbull stephen at xemacs.org
Sat Oct 4 08:47:57 CEST 2014


Chris Angelico writes:

 > Start by building up an actual list in RAM of
 > [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)] and
 > then do regular string operations.  I'm fairly sure most Python
 > programmers can figure out how to search a list of strings
 > according to whatever rules they like - maybe using
 > contains/startswith/endswith, or maybe regexps, or whatever.

OK.  Times are quite imprecise, but after importing re, sys, unicodedata

>>> names = [unicodedata.name(chr(i)) for i in range(sys.maxunicode+1)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
ValueError: no such name

oops, although you didn't actually claim that would work. :-)  (BTW,
chr(0) has no name.  At least it was instantaneous. :-)  Then

>>> for i in range(sys.maxunicode+1):
...  try:
...   names.append(unicodedata.name(chr(i)))
...  except ValueError:
...   pass
... 

takes between 1 and 2 seconds, while

>>> names.index("PILE OF POO")
61721
>>> "PILE OF POO" in names
True

is instantaneous.  Note: 61721 is *much* smaller than 0x1F4A9.  And now

>>> pops = [name for name in names if re.match("^P\\S* O.* P", name)]
>>> pops
['PILE OF POO']

takes just noticable time (250ms, maybe?)  This on a 4-year-old 2.7GHz
i7 MacBook Pro running "Mavericks".

Plenty good for my use cases.



More information about the Python-ideas mailing list