Extend unicodedata with a name/pattern/regex search for character entity references?

Chris Angelico rosuav at gmail.com
Sun Sep 4 19:52:28 EDT 2016


On Mon, Sep 5, 2016 at 9:40 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
> But, 'CAP' appears in 'CAPITAL', which gives more than 1800 matches:
>
>     >>> for c in range(32, 0x110000):
>     ...   try:
>     ...     name = unicodedata.name(chr(c))
>     ...   except ValueError:
>     ...     continue
>     ...   if 'CAP' in name:
>     ...     print(c, name)
>     ...
>     65 LATIN CAPITAL LETTER A
>     66 LATIN CAPITAL LETTER B
>     ..
>     .. many other lines, mostly with CAPITAL in them ..
>     ..
>     917593 TAG LATIN CAPITAL LETTER Y
>     917594 TAG LATIN CAPITAL LETTER Z
>     >>>

FWIW, hex is much more common for displaying Unicode codepoints than
decimal is. So I'd print it like this (incorporating the 'not CAPITAL'
filter):

>>> for c in range(32, 0x110000):
...     try:
...         name = unicodedata.name(chr(c))
...     except ValueError:
...         continue
...     if 'CAP' in name and 'CAPITAL' not in name:
...         print("U+%04X %s" % (c, name))
...
U+20E3 COMBINING ENCLOSING KEYCAP
U+2293 SQUARE CAP
U+2410 SYMBOL FOR DATA LINK ESCAPE
U+241B SYMBOL FOR ESCAPE
U+2651 CAPRICORN
U+2E3F CAPITULUM
U+A2B9 YI SYLLABLE CAP
U+CC42 HANGUL SYLLABLE CAP
U+101D3 PHAISTOS DISC SIGN CAPTIVE
U+1D10A MUSICAL SYMBOL DA CAPO
U+1F306 CITYSCAPE AT DUSK
U+1F393 GRADUATION CAP
U+1F3D4 SNOW CAPPED MOUNTAIN
U+1F3D9 CITYSCAPE
U+1F51F KEYCAP TEN
U+1F74E ALCHEMICAL SYMBOL FOR CAPUT MORTUUM
>>>

Takes advantage of %04X giving a minimum, but not maximum, of four digits :)

ChrisA



More information about the Python-list mailing list