[Python-ideas] Extend unicodedata with a name search

Stephen J. Turnbull stephen at xemacs.org
Sat Oct 4 05:17:58 CEST 2014


M.-A. Lemburg writes:
 > On 03.10.2014 23:10, Philipp A. wrote:

 > > Unfortunately, unicodedata is very limited.

Phillip, do you really mean *very* limited?  If so, I wonder what else
you think is missing besides "fuzzy" name lookup.  The UCD is defined
by the standard, and AFAICS access to all properties is provided.

 > > But the name database is only queryable using full names! I want
 > > to do unicodedata.search('clock') and get a list of dozens of glyphs
 
 > You should be able to code this as a PyPI package. I don't think
 > it's a use case that warrants making the unicodedata module more
 > complex.

I think it's unfortunate that unicodedata is limited in this
particular way, since the database is in C, and as you point out
hardly extensible.  For example, as a native English speaker who
enjoys wordplay I was able to guess which euphemism is the source of
the name of U+1F4A9 without looking it up, but I doubt a non-native
would be able to.  A builtin ability to do fuzzy searches
("unicodenames.startswith('PILE OF')") would be useful.

OTOH, a little thought convinced me that I don't know the TOOWTDI for
fuzzy search here:

  - regexp: database will be a huge string or similar

  - startswith, endswith, contains: probably sufficient, but I suppose
    one would like at least conjunction and disjunction operations:
    unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and')
    unicodematch.startswith('PIECE OF', 'PILE OF', op='or')
    (OK, that's pretty horrible, but it gives an idea.)

  - something else?




More information about the Python-ideas mailing list