Is there really a default source encoding?

Alexander Schmolck a.schmolck at gmx.net
Thu Jan 23 16:52:27 EST 2003


Brian Quinlan <brian at sweetapp.com> writes:

> I don't understand the goal of your proposal. The goal of the current
> plan is to support existing non-ASCII source files but warn the user
> that they will need to add an encoding comment in the future. 

This is a laudable goal. I just didn't understand why one would want to, after
slowly and carefully moving away from eurocentric (latin-1), revert even
further back to anglocentric (ascii) instead of opting for truly international
(and anglo-neutral, utf-8). That seemed a bit like 1 step forward, 2 steps
back. Sure, you could always explicitly request utf-8 by some comment-kludge,
but why not make something the default that simply handles *all* cases and is
fully upward compatible to what Martin said would become the default encoding
instead (ascii)? Now your next statement sheds some light on the issue:

> 
> UTF-8 is a supported encoding. In fact, if a UTF-8 BOM is present then
> the encoding comment is not necessary.
> 

Great. Only are you sure that BOMs are such a great idea?

Quote: (from http://www.cl.cam.ac.uk/~mgk25/unicode.html)

    It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF) as a
    signature to mark the beginning of a UTF-8 file. This practice should
    definitely not be used on POSIX systems for several reasons:
    
    On POSIX systems, the locale and not magic file type codes define the encoding
    of plain text files. Mixing the two concepts would add a lot of complexity and
    break existing functionality.  Adding a UTF-8 signature at the start of a file
    would interfere with many established conventions such as the kernel looking
    for "#!" at the beginning of a plaintext executable to locate the appropriate
    interpreter.  Handling BOMs properly would add undesirable complexity even to
    simple programs like cat or grep that mix contents of several files into one.

The '#!'-bit would seem especially relevant.  

I don't pretend to be a great unicode expert and maybe the above is outdated,
flawed, irrelevant or whatever, but it still isn't clear to me why .py files
(with or without BOM) shouldn't just be assumed to be utf-8 (after the
transitory latin-1 period), BOM or no BOM (and my cursory rereading of pep-263
didn't make it clear to me either).

alex








More information about the Python-list mailing list