'Straße' ('Strasse') and Python 2

Terry Reedy tjreedy at udel.edu
Wed Jan 15 19:27:35 EST 2014


On 1/15/2014 11:55 AM, Robin Becker wrote:

> The fact that unicoders want to take over the meaning of encoding is not
> relevant.

I agree with you that 'encoding' should not be limited to 'byte encoding 
of a (subset of) unicode characters. For instance, .jpg and .png are 
byte encodings of images. In the other hand, it is common in human 
discourse to omit qualifiers in particular contexts. 'Computer virus' 
gets condensed to 'virus' in computer contexts.

The problem with graphemes is that there is no fixed set of unicode 
graphemes. Which is to say, the effective set of graphemes is 
context-specific. Just limiting ourselves to English, 'fi' is usually 2 
graphemes when printing to screen, but often just one when printing to 
paper. This is why the Unicode consortium punted 'graphemes' to 
'application' code.

> I'm not anti unicode, that's just an assignment of identity to some
> symbols. Coding the values of the ids is a separate issue. It's my
> belief that we don't need more than the byte level encoding to represent
> unicode. One of the claims made for python3 unicode is that it somehow
> eliminates the problems associated with other encodings eg utf8,

The claim is true for the following problems of the way-too-numerous 
unicode byte encodings.

Subseting: only a subset of characters can be encoded.

Shifting: the meaning of a byte depends on a preceding shift character, 
which might be back as the beginning of the sequence.

Varying size: the number of bytes to encode a character depends on the 
character.

Both of the last two problems can turn O(1) operations into O(n) 
operations. 3.3+ eliminates all these problems.

-- 
Terry Jan Reedy




More information about the Python-list mailing list