Extend unicodedata with a name/pattern/regex search for character entity references?
Chris Angelico
rosuav at gmail.com
Sun Sep 4 19:52:28 EDT 2016
On Mon, Sep 5, 2016 at 9:40 AM, Ned Batchelder <ned at nedbatchelder.com> wrote:
> But, 'CAP' appears in 'CAPITAL', which gives more than 1800 matches:
>
> >>> for c in range(32, 0x110000):
> ... try:
> ... name = unicodedata.name(chr(c))
> ... except ValueError:
> ... continue
> ... if 'CAP' in name:
> ... print(c, name)
> ...
> 65 LATIN CAPITAL LETTER A
> 66 LATIN CAPITAL LETTER B
> ..
> .. many other lines, mostly with CAPITAL in them ..
> ..
> 917593 TAG LATIN CAPITAL LETTER Y
> 917594 TAG LATIN CAPITAL LETTER Z
> >>>
FWIW, hex is much more common for displaying Unicode codepoints than
decimal is. So I'd print it like this (incorporating the 'not CAPITAL'
filter):
>>> for c in range(32, 0x110000):
... try:
... name = unicodedata.name(chr(c))
... except ValueError:
... continue
... if 'CAP' in name and 'CAPITAL' not in name:
... print("U+%04X %s" % (c, name))
...
U+20E3 COMBINING ENCLOSING KEYCAP
U+2293 SQUARE CAP
U+2410 SYMBOL FOR DATA LINK ESCAPE
U+241B SYMBOL FOR ESCAPE
U+2651 CAPRICORN
U+2E3F CAPITULUM
U+A2B9 YI SYLLABLE CAP
U+CC42 HANGUL SYLLABLE CAP
U+101D3 PHAISTOS DISC SIGN CAPTIVE
U+1D10A MUSICAL SYMBOL DA CAPO
U+1F306 CITYSCAPE AT DUSK
U+1F393 GRADUATION CAP
U+1F3D4 SNOW CAPPED MOUNTAIN
U+1F3D9 CITYSCAPE
U+1F51F KEYCAP TEN
U+1F74E ALCHEMICAL SYMBOL FOR CAPUT MORTUUM
>>>
Takes advantage of %04X giving a minimum, but not maximum, of four digits :)
ChrisA
More information about the Python-list
mailing list