[Python-ideas] Python 3000 TIOBE -3%

Sat Feb 11 21:08:38 CET 2012

Masklinn, 11.02.2012 20:46:
> On 2012-02-11, at 20:35 , Stefan Behnel wrote:
>>
>>> Yes, but now instead of just ignoring that stuff you have to actively and
>>> knowingly lie to Python to get it to shut up.
>>
>> The advantage is that it becomes explicit what you are doing. In Python 2,
>> without any encoding, you are implicitly assuming that the encoding is
>> Latin-1, because that's how you are processing it. You're just not spelling
>> it out anywhere, thus leaving it to the innocent reader to guess what's
>> happening. In Python 3, and in better Python 2 code (using codecs.open(),
>> for example), you'd make it clear right in the open() call that Latin-1 is
>> the way you are going to process the data.
> 
> I'm not sure going from "ignoring it" to "explicitly lying about it" is a
> great step forward. latin-1 is not "the way you are going to process the data"
> in this case, it's just the easiest way to get Python to shut up and open the
> damn thing.
> 
>>>> Besides, it's perfectly possible to process bytes in Python 3. You just
>>>> have to open the file in binary mode and do the processing at the byte
>>>> string level.
>>>
>>> I think that's the route which should be taken
>>
>> Oh, absolutely not. When it's text, it's best to process it as Unicode.
> 
> Except it's not processed as text, it's processed as "stuff with ascii
> characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS

Well, you are still processing it as text because you are (again,
implicitly) assuming those ASCII characters to be just that: ASCII encoded
characters. You couldn't apply the same byte processing algorithm to UCS2
encoded text or a compressed gzip file, for example, at least not with a
useful outcome.

Mind you, I'm not regarding any text semantics here. I'm not considering
whether the thus decoded data results in French, Danish, German or other
human words, or in completely incomprehensible garbage. That's not
relevant. What is relevant is that the program assumes an identity mapping
from 1 byte to 1 character to work correctly, which, speaking in Unicode
terms, implies Latin-1 decoding. Therefore my advice to make that
assumption explicit.

Stefan