Chardet, file, ... and the Flexible String Representation

Tue Sep 10 00:58:41 EDT 2013

On Mon, 09 Sep 2013 11:05:44 -0600, Michael Torrie wrote:

> On 09/09/2013 08:28 AM, wxjmfauth at gmail.com wrote:
>> Comment: Such differences never happen with utf.
> 
> But with utf, slicing strings is O(n) (well that's a simplification as
> someone showed an algorithm that is log n), whereas a fixed-width
> encoding (Latin-1, UCS-2, UCS-4) is O(1).  

UTF-32 is fixed-width. UTF-16 is not, but if you limit yourself to only 
characters in the Basic Multilingual Plane, it is functionally equivalent 
to UCS-2 and therefore fixed-width.

> Do you understand what this means?

Talking about "utf" in general as JMF does is a good sign that he 
doesn't. Which UTF? I know of at least eight:

UTF-1
UTF-7
UTF-8
UTF-9  # this one is a joke, but it does work
UTF-16  # in two varieties, big-endian and little-endian
UTF-18  # another joke
UTF-32  # likewise two varieties
UTF-EBCDIC

although only 3 (perhaps 4, if you include UTF-7) are in common use.

[...]
> I don't even know that much about unicode yet it's clear you're either
> deliberately muddying the waters with your stupid and pointless
> arguments against FCS or you don't really understand the difference
> between unicode and byte encoding.  Which is it?

I have been watching JMF get a mad-on about the flexible string 
representation since he first noticed it, and in my opinion, his 
complaints are based entirely on resentment that ASCII users save more 
memory than non-ASCII users. Even if it means everyone is worse off, he 
is utterly opposed to giving ASCII users any benefit.

Of course, he neglects to consider that *every single Python user* is an 
ASCII user, since most strings in Python are pure ASCII. Names of 
builtins, standard library modules, variables, attributes, most of them 
are ASCII.

-- 
Steven