flaming vs accuracy [was Re: Performance of int/long in Python 3]

Thu Mar 28 17:33:03 EDT 2013

On 28 mar, 22:11, jmfauth <wxjmfa... at gmail.com> wrote:
> On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap... at case.edu> wrote:
>
>
>
>
>
>
>
>
>
> > On Thu, Mar 28, 2013 at 10:48 AM, jmfauth <wxjmfa... at gmail.com> wrote:
> > > On 28 mar, 17:33, Ian Kelly <ian.g.ke... at gmail.com> wrote:
> > >> On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjmfa... at gmail.com> wrote:
> > >> > The flexible string representation takes the problem from the
> > >> > other side, it attempts to work with the characters by using
> > >> > their representations and it (can only) fails...
>
> > >> This is false.  As I've pointed out to you before, the FSR does not
> > >> divide characters up by representation.  It divides them up by
> > >> codepoint -- more specifically, by the *bit-width* of the codepoint.
> > >> We call the internal format of the string "ASCII" or "Latin-1" or
> > >> "UCS-2" for conciseness and a point of reference, but fundamentally
> > >> all of the FSR formats are simply byte arrays of *codepoints* -- you
> > >> know, those things you keep harping on.  The major optimization
> > >> performed by the FSR is to consistently truncate the leading zero
> > >> bytes from each codepoint when it is possible to do so safely.  But
> > >> regardless of to what extent this truncation is applied, the string is
> > >> *always* internally just an array of codepoints, and the same
> > >> algorithms apply for all representations.
>
> > > -----
>
> > > You know, we can discuss this ad nauseam. What is important
> > > is Unicode.
>
> > > You have transformed Python back in an ascii oriented product.
>
> > > If Python had imlemented Unicode correctly, there would
> > > be no difference in using an "a", "é", "€" or any character,
> > > what the narrow builds did.
>
> > > If I am practically the only one, who speakes /discusses about
> > > this, I can ensure you, this has been noticed.
>
> > > Now, it's time to prepare the Asparagus, the "jambon cru"
> > > and a good bottle a dry white wine.
>
> > > jmf
>
> > You still have yet to explain how Python's string representation is
> > wrong. Just how it isn't optimal for one specific case. Here's how I
> > understand it:
>
> > 1) Strings are sequences of stuff. Generally, we talk about strings as
> > either sequences of bytes or sequences of characters.
>
> > 2) Unicode is a format used to represent characters. Therefore,
> > Unicode strings are character strings, not byte strings.
>
> > 2) Encodings  are functions that map characters to bytes. They
> > typically also define an inverse function that converts from bytes
> > back to characters.
>
> > 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
> > mentioned in the previous point. It happens to be one of the five
> > standard encodings that is defined for all characters in the Unicode
> > standard (the others being the little and big endian variants of
> > UTF-16 and UTF-32).
>
> > 4) The internal representation of a character string DOES NOT MATTER.
> > All that matters is that the API represents it as a string of
> > characters, regardless of the representation. We could implement
> > character strings by putting the Unicode code-points in binary-coded
> > decimal and it would be a Unicode character string.
>
> > 5) The String type that .NET and Java (and unicode type in Python
> > narrow builds) use is not a character string. It is a string of
> > shorts, each of which corresponds to a UTF-16 code point. I know this
> > is the case because in all of these, the length of "\u1f435" is 2 even
> > though it only consists of one character.
>
> > 6) The new string representation in Python 3.3 can successfully
> > represent all characters in the Unicode standard. The actual number of
> > bytes that each character consumes is invisible to the user.
>
> ----------
>
> I shew enough examples. As soon as you are using non latin-1 chars
> your "optimization" just became irrelevant and not only this, you
> are penalized.
>
> I'm sorry, saying Python now is just covering the whole unicode
> range is not a valuable excuse. I prefer a "correct" version with
> a narrower range of chars, especially if this range represents
> the "daily used chars".
>
> I can go a step further, if I wish to write an application for
> Western European users, I'm better served if I'm using a coding
> scheme covering all thesee languages/scripts. What about cp1252 [*]?
> Does this not remind somthing?
>
> Python can do better, it only succeeds to do worth!
>
> [*] yes, I kwnow, internally ....
>
> jmf

-----

Addendum.

And you kwow what? Py34 will suffer from the same desease.
You are spending your time in improving chunks of bytes,
when the problem is elsewhere.
In fact you are working for peanuts, eg the replacing method.

If you are not satisfied with my examples, just pick up
the examples of GvR (ascii-string) on the bug tracker, "timeit"
them and you will see there is already a problem.

Better, "timeit" them afeter having replaced his ascii-strings
with non ascii characters...

jmf

and you will see, there is