Python usage numbers

Sun Feb 12 17:30:32 EST 2012

On Sun, 12 Feb 2012 05:11:30 -0600, Andrew Berg wrote:

> On 2/12/2012 3:12 AM, Steven D'Aprano wrote:
>> NTFS by default uses the UTF-16 encoding, which means the actual bytes
>> written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading
>> byte-order mark \xff\xfe).
>
> That's what I meant. Those bytes will be interpreted consistently across
> all locales.

Right. But, that's not Unicode, it is an encoding of Unicode. Terminology 
is important -- if we don't call things by the "right" names (or at least 
agreed upon names) how can we communicate?

>> Windows has two separate APIs, one for "wide" characters, the other for
>> single bytes. Depending on which one you use, the directory will appear
>> to be called Наӥв or 0å2.
>
> Yes, and AFAIK, the wide API is the default. The other one only exists
> to support programs that don't support the wide API (generally, such
> programs were intended to be used on older platforms that lack that
> API).

I'm not sure that "default" is the right word, since (as far as I know) 
both APIs have different spelling and the coder has to make the choice 
whether to call function X or function Y. Perhaps you mean that Microsoft 
encourages the wide API and makes the single-byte API available for 
legacy reasons?

>> But in any case, we're not talking about the file name encoding. We're
>> talking about the contents of files.
>
> Okay then. As I stated, this has nothing to do with the OS since
> programs are free to interpret bytes any way they like.

Yes, but my point was that even if the developer thinks he can avoid the 
problem by staying away from "Unicode files" coming from Linux and OS-X, 
he can't avoid dealing with multiple code pages on Windows.

You are absolutely correct that this is *not* a cross-platform issue to 
do with the OS, but some people may think it is.

-- 
Steven