flaming vs accuracy [was Re: Performance of int/long in Python 3]

Benjamin Kaplan benjamin.kaplan at case.edu
Thu Mar 28 16:29:23 EDT 2013


On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfauth at gmail.com> wrote:
> On 28 mar, 17:33, Ian Kelly <ian.g.ke... at gmail.com> wrote:
>> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa... at gmail.com> wrote:
>> > The flexible string representation takes the problem from the
>> > other side, it attempts to work with the characters by using
>> > their representations and it (can only) fails...
>>
>> This is false.  As I've pointed out to you before, the FSR does not
>> divide characters up by representation.  It divides them up by
>> codepoint -- more specifically, by the *bit-width* of the codepoint.
>> We call the internal format of the string "ASCII" or "Latin-1" or
>> "UCS-2" for conciseness and a point of reference, but fundamentally
>> all of the FSR formats are simply byte arrays of *codepoints* -- you
>> know, those things you keep harping on.  The major optimization
>> performed by the FSR is to consistently truncate the leading zero
>> bytes from each codepoint when it is possible to do so safely.  But
>> regardless of to what extent this truncation is applied, the string is
>> *always* internally just an array of codepoints, and the same
>> algorithms apply for all representations.
>
> -----
>
> You know, we can discuss this ad nauseam. What is important
> is Unicode.
>
> You have transformed Python back in an ascii oriented product.
>
> If Python had imlemented Unicode correctly, there would
> be no difference in using an "a", "é", "€" or any character,
> what the narrow builds did.
>
> If I am practically the only one, who speakes /discusses about
> this, I can ensure you, this has been noticed.
>
> Now, it's time to prepare the Asparagus, the "jambon cru"
> and a good bottle a dry white wine.
>
> jmf
>
>
You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings  are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.



More information about the Python-list mailing list