Python Unicode handling wins again -- mostly

Mon Dec 2 17:56:57 EST 2013

Ned Batchelder <ned at nedbatchelder.com> writes:

> This is where my knowledge about Unicode gets fuzzy.  Isn't it the
> case that some grapheme clusters (or whatever the right word is) can't
> be normalized down to a single code point?  Characters can accept many
> accents, for example.

That's true, but doesn't affect the point being made: that one can have
both “sequence of Unicode code points” in Python's ‘unicode’ (now ‘str’)
type, and also deal with “sequence of text the reader will see”.

> In that case, you can't always normalize and use the existing string
> methods, but would need more specialized code.

Specialised code may not be needed. It will at least be true that “any
two code-point sequences which normalise to the same value will be
visually the same for the reader”, which is an important assertion for
addressing the complaints from Mortoray's article.

-- 
 \       “Pray, v. To ask that the laws of the universe be annulled in |
  `\     behalf of a single petitioner confessedly unworthy.” —Ambrose |
_o__)                           Bierce, _The Devil's Dictionary_, 1906 |
Ben Finney