flaming vs accuracy [was Re: Performance of int/long in Python 3]

Thu Mar 28 17:50:14 EDT 2013

On 28/03/2013 21:11, jmfauth wrote:
> On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap... at case.edu> wrote:
>> On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa... at gmail.com> wrote:
>> > On 28 mar, 17:33, Ian Kelly <ian.g.ke... at gmail.com> wrote:
>> >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa... at gmail.com> wrote:
>> >> > The flexible string representation takes the problem from the
>> >> > other side, it attempts to work with the characters by using
>> >> > their representations and it (can only) fails...
>>
>> >> This is false.  As I've pointed out to you before, the FSR does not
>> >> divide characters up by representation.  It divides them up by
>> >> codepoint -- more specifically, by the *bit-width* of the codepoint.
>> >> We call the internal format of the string "ASCII" or "Latin-1" or
>> >> "UCS-2" for conciseness and a point of reference, but fundamentally
>> >> all of the FSR formats are simply byte arrays of *codepoints* -- you
>> >> know, those things you keep harping on.  The major optimization
>> >> performed by the FSR is to consistently truncate the leading zero
>> >> bytes from each codepoint when it is possible to do so safely.  But
>> >> regardless of to what extent this truncation is applied, the string is
>> >> *always* internally just an array of codepoints, and the same
>> >> algorithms apply for all representations.
>>
>> > -----
>>
>> > You know, we can discuss this ad nauseam. What is important
>> > is Unicode.
>>
>> > You have transformed Python back in an ascii oriented product.
>>
>> > If Python had imlemented Unicode correctly, there would
>> > be no difference in using an "a", "é", "€" or any character,
>> > what the narrow builds did.
>>
>> > If I am practically the only one, who speakes /discusses about
>> > this, I can ensure you, this has been noticed.
>>
>> > Now, it's time to prepare the Asparagus, the "jambon cru"
>> > and a good bottle a dry white wine.
>>
>> > jmf
>>
>> You still have yet to explain how Python's string representation is
>> wrong. Just how it isn't optimal for one specific case. Here's how I
>> understand it:
>>
>> 1) Strings are sequences of stuff. Generally, we talk about strings as
>> either sequences of bytes or sequences of characters.
>>
>> 2) Unicode is a format used to represent characters. Therefore,
>> Unicode strings are character strings, not byte strings.
>>
>> 2) Encodings  are functions that map characters to bytes. They
>> typically also define an inverse function that converts from bytes
>> back to characters.
>>
>> 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
>> mentioned in the previous point. It happens to be one of the five
>> standard encodings that is defined for all characters in the Unicode
>> standard (the others being the little and big endian variants of
>> UTF-16 and UTF-32).
>>
>> 4) The internal representation of a character string DOES NOT MATTER.
>> All that matters is that the API represents it as a string of
>> characters, regardless of the representation. We could implement
>> character strings by putting the Unicode code-points in binary-coded
>> decimal and it would be a Unicode character string.
>>
>> 5) The String type that .NET and Java (and unicode type in Python
>> narrow builds) use is not a character string. It is a string of
>> shorts, each of which corresponds to a UTF-16 code point. I know this
>> is the case because in all of these, the length of "\u1f435" is 2 even
>> though it only consists of one character.
>>
>> 6) The new string representation in Python 3.3 can successfully
>> represent all characters in the Unicode standard. The actual number of
>> bytes that each character consumes is invisible to the user.
>
> ----------
>
>
> I shew enough examples. As soon as you are using non latin-1 chars
> your "optimization" just became irrelevant and not only this, you
> are penalized.
>
> I'm sorry, saying Python now is just covering the whole unicode
> range is not a valuable excuse. I prefer a "correct" version with
> a narrower range of chars, especially if this range represents
> the "daily used chars".
>
> I can go a step further, if I wish to write an application for
> Western European users, I'm better served if I'm using a coding
> scheme covering all thesee languages/scripts. What about cp1252 [*]?
> Does this not remind somthing?
>
> Python can do better, it only succeeds to do worth!
>
> [*] yes, I kwnow, internally ....
>
If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?