Python Unicode handling wins again -- mostly

wxjmfauth at gmail.com wxjmfauth at gmail.com
Tue Dec 3 13:34:59 EST 2013


Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
> 
> 
> 
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
> 
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
> 
> >>>
> 
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
> 
> >>> failures or dubious. If you believe that the native string type should
> 
> >>> operate on code-points, then you'll think that Python does the right
> 
> >>> thing.
> 
> >>
> 
> >> I think Python is doing it correctly.  If I want to operate on
> 
> >> "clusters" I'll normalize the string first.
> 
> >>
> 
> >> Thanks for this excellent post.
> 
> >>
> 
> >> --
> 
> >> ~Ethan~
> 
> > 
> 
> > This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> 
> > that some grapheme clusters (or whatever the right word is) can't be
> 
> > normalized down to a single code point?  Characters can accept many
> 
> > accents, for example.  In that case, you can't always normalize and use
> 
> > the existing string methods, but would need more specialized code.
> 
> 
> 
> That is correct.
> 
> 
> 
> If Unicode had a distinct code point for every possible combination of 
> 
> base-character plus an arbitrary number of diacritics or accents, the 
> 
> 0x10FFFF code points wouldn't be anywhere near enough.
> 
> 
> 
> I see over 300 diacritics used just in the first 5000 code points. Let's 
> 
> pretend that's only 100, and that you can use up to a maximum of 5 at a 
> 
> time. That gives 79375496 combinations per base character, much larger 
> 
> than the total number of Unicode code points in total.
> 
> 
> 
> If anyone wishes to check my logic:
> 
> 
> 
> # count distinct combining chars
> 
> import unicodedata
> 
> s = ''.join(chr(i) for i in range(33, 5000))
> 
> s = unicodedata.normalize('NFD', s)
> 
> t = [c for c in s if unicodedata.combining(c)]
> 
> len(set(t))
> 
> 
> 
> # calculate the number of combinations
> 
> def comb(r, n):
> 
>     """Combinations nCr"""
> 
>     p = 1
> 
>     for i in range(r+1, n+1):
> 
>         p *= i
> 
>     for i in range(1, n-r+1):
> 
>         p /= i
> 
>     return p
> 
> 
> 
> sum(comb(i, 100) for i in range(6))
> 
> 
> 
> 
> 
> I'm not suggesting that all of those accents are necessarily in use in 
> 
> the real world, but there are languages which construct arbitrary 
> 
> combinations of accents. (Or so I have been lead to believe.) 
> 
> 
> 

from one of my libs, bmp only

>>> import fourbiunicode5
>>> print(len(fourbiunicode5.AllCombiningMarks))
240


jmf




More information about the Python-list mailing list