Extend unicodedata with a name/pattern/regex search for character entity references?

Ned Batchelder ned at nedbatchelder.com
Sun Sep 4 19:40:40 EDT 2016


On Saturday, September 3, 2016 at 7:55:48 AM UTC-4, Veek. M wrote:
> https://mail.python.org/pipermail//python-ideas/2014-October/029630.htm
> 
> Wanted to know if the above link idea, had been implemented and if 
> there's a module that accepts a pattern like 'cap' and give you all the 
> instances of unicode 'CAP' characters.
>  ⋂ \bigcap
>  ⊓ \sqcap
>  ∩ \cap
>  ♑ \capricornus
>  ⪸ \succapprox
>  ⪷ \precapprox
> 
> (above's from tex)
> 
> I found two useful modules in this regard: unicode_tex, unicodedata
> but unicodedata is a builtin which does not do globs, regexs - so it's 
> kind of limiting in nature.
> 
> Would be nice if you could search html/xml character entity references 
> as well.

The unicodedata module has all the information you need for searching
Unicode character names.  While it doesn't provide regex or globs, it's
all in-memory, so it's not bad for just iterating over the characters
and finding what you need.

But, 'CAP' appears in 'CAPITAL', which gives more than 1800 matches:

    >>> for c in range(32, 0x110000):
    ...   try:
    ...     name = unicodedata.name(chr(c))
    ...   except ValueError:
    ...     continue
    ...   if 'CAP' in name:
    ...     print(c, name)
    ...
    65 LATIN CAPITAL LETTER A
    66 LATIN CAPITAL LETTER B
    ..
    .. many other lines, mostly with CAPITAL in them ..
    ..
    917593 TAG LATIN CAPITAL LETTER Y
    917594 TAG LATIN CAPITAL LETTER Z
    >>>

These were the character names without "CAPITAL":

    8419 COMBINING ENCLOSING KEYCAP
    8851 SQUARE CAP
    9232 SYMBOL FOR DATA LINK ESCAPE
    9243 SYMBOL FOR ESCAPE
    9809 CAPRICORN
    11839 CAPITULUM
    41657 YI SYLLABLE CAP
    52290 HANGUL SYLLABLE CAP
    66003 PHAISTOS DISC SIGN CAPTIVE
    119050 MUSICAL SYMBOL DA CAPO
    127750 CITYSCAPE AT DUSK
    127891 GRADUATION CAP
    127956 SNOW CAPPED MOUNTAIN
    127961 CITYSCAPE
    128287 KEYCAP TEN
    128846 ALCHEMICAL SYMBOL FOR CAPUT MORTUUM

--Ned.



More information about the Python-list mailing list