[Python-ideas] Extend unicodedata with a name search
Stephen J. Turnbull
stephen at xemacs.org
Sat Oct 4 05:17:58 CEST 2014
M.-A. Lemburg writes:
> On 03.10.2014 23:10, Philipp A. wrote:
> > Unfortunately, unicodedata is very limited.
Phillip, do you really mean *very* limited? If so, I wonder what else
you think is missing besides "fuzzy" name lookup. The UCD is defined
by the standard, and AFAICS access to all properties is provided.
> > But the name database is only queryable using full names! I want
> > to do unicodedata.search('clock') and get a list of dozens of glyphs
> You should be able to code this as a PyPI package. I don't think
> it's a use case that warrants making the unicodedata module more
> complex.
I think it's unfortunate that unicodedata is limited in this
particular way, since the database is in C, and as you point out
hardly extensible. For example, as a native English speaker who
enjoys wordplay I was able to guess which euphemism is the source of
the name of U+1F4A9 without looking it up, but I doubt a non-native
would be able to. A builtin ability to do fuzzy searches
("unicodenames.startswith('PILE OF')") would be useful.
OTOH, a little thought convinced me that I don't know the TOOWTDI for
fuzzy search here:
- regexp: database will be a huge string or similar
- startswith, endswith, contains: probably sufficient, but I suppose
one would like at least conjunction and disjunction operations:
unicodematch.contains('GREEK', 'SMALL', 'ALPHA', op='and')
unicodematch.startswith('PIECE OF', 'PILE OF', op='or')
(OK, that's pretty horrible, but it gives an idea.)
- something else?
More information about the Python-ideas
mailing list