Python Unicode handling wins again -- mostly
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Tue Dec 3 13:34:59 EST 2013
Le mardi 3 décembre 2013 06:06:26 UTC+1, Steven D'Aprano a écrit :
> On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:
>
>
>
> > On 12/2/13 3:38 PM, Ethan Furman wrote:
>
> >> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>
> >>>
>
> >>> Out of the nine tests, Python 3.3 passes six, with three tests being
>
> >>> failures or dubious. If you believe that the native string type should
>
> >>> operate on code-points, then you'll think that Python does the right
>
> >>> thing.
>
> >>
>
> >> I think Python is doing it correctly. If I want to operate on
>
> >> "clusters" I'll normalize the string first.
>
> >>
>
> >> Thanks for this excellent post.
>
> >>
>
> >> --
>
> >> ~Ethan~
>
> >
>
> > This is where my knowledge about Unicode gets fuzzy. Isn't it the case
>
> > that some grapheme clusters (or whatever the right word is) can't be
>
> > normalized down to a single code point? Characters can accept many
>
> > accents, for example. In that case, you can't always normalize and use
>
> > the existing string methods, but would need more specialized code.
>
>
>
> That is correct.
>
>
>
> If Unicode had a distinct code point for every possible combination of
>
> base-character plus an arbitrary number of diacritics or accents, the
>
> 0x10FFFF code points wouldn't be anywhere near enough.
>
>
>
> I see over 300 diacritics used just in the first 5000 code points. Let's
>
> pretend that's only 100, and that you can use up to a maximum of 5 at a
>
> time. That gives 79375496 combinations per base character, much larger
>
> than the total number of Unicode code points in total.
>
>
>
> If anyone wishes to check my logic:
>
>
>
> # count distinct combining chars
>
> import unicodedata
>
> s = ''.join(chr(i) for i in range(33, 5000))
>
> s = unicodedata.normalize('NFD', s)
>
> t = [c for c in s if unicodedata.combining(c)]
>
> len(set(t))
>
>
>
> # calculate the number of combinations
>
> def comb(r, n):
>
> """Combinations nCr"""
>
> p = 1
>
> for i in range(r+1, n+1):
>
> p *= i
>
> for i in range(1, n-r+1):
>
> p /= i
>
> return p
>
>
>
> sum(comb(i, 100) for i in range(6))
>
>
>
>
>
> I'm not suggesting that all of those accents are necessarily in use in
>
> the real world, but there are languages which construct arbitrary
>
> combinations of accents. (Or so I have been lead to believe.)
>
>
>
from one of my libs, bmp only
>>> import fourbiunicode5
>>> print(len(fourbiunicode5.AllCombiningMarks))
240
jmf
More information about the Python-list
mailing list