Proposal: require 7-bit source str's

Fri Aug 6 15:59:20 EDT 2004

Martin v. Löwis wrote:
>Hallvard B Furuseth wrote:
>>>"Written by Martin v. Löwis"
>> 
>> So if the file has -*- coding: iso-8859-1 -*-, how does that doc string
>> look to someone using a iso-8859-2 locale?
> 
> Let's start all over. I'm referring to a time when there was no encoding
> declaration, and PEP 263 was not written yet. At that time, I thought
> that a proper encoding declaration (i.e. a statement) would be the
> best thing to do. So in my example, there is no -*- coding: iso-8859-1 
> -*- in the file. Instead, there is a directive.
> 
> About the unrelated question: How should a docstring be displayed
> to a user working in a different locale? Well, in theory, the docstring
> should be converted from its source encoding to the encoding where
> it is displayed. In practice, this is difficult to implement, and
> requires access to the original source code. However, Francois Pinard
> has suggested to add an __encoding__ attribute to each module,
> which could be used to recode the docstring.

Sounds OK for normal use.  It's not reliable, though: If files f1 and
f2 have different 'coding:'s and f1 does execfile(f2), f2's doc strings
won't match sys.modules[<something from f2>.__module__].__encoding__.

(Or maybe the other way around:  I notice that the execfile sets
f1.__doc__ = <f2's doc string>.  But I'll report that as a bug.)

> About your literal question: In the current implementation, the string
> looks just fine, as this docstring is codepoint-by-codepoint identical
> in iso-8859-1 and iso-8859-2.

Whoops.  Please pretend I said iso-8859-5 or something.  I was thinking
of ø, not ö.  Had just written about that in another posting.

>> Just like a str7bit directive, in whatever form, would not catch the
>> missing u in front of the doc string.
> 
> Not necessarily. It would be possible to go back and find all strings
> that fail to meet the requirement.

That sounds like it could have a severe performance impact.  However,
maybe the compiler can set a flag if there are any such strings when it
converts parsed strings from Unicode back to the file's encoding.  Then
the str7bit directive can warn that the file contains one or more bad
strings.  Or if the directive is executed while the file is being
parsed, it can catch such strings below the directive and give the less
informative warning if there are such string above the directive.

I can't say I like the idea, though.  It assumes Python retains the
internal implementations of 'coding:' which is described in PEP 263:
Convert the source code to Unicode, then convert string literals back
to the source character set.

> Notice that your approach only works for languages with single-byte
> character sets anyway. Many multi-byte character sets use only
> bytes < 128, and still they should get the warning you want to produce.

They will.  That's why I specified to do this after conversion to
Unicode.  But I notice my spec was unclear about that point.

New spec:

  After the source file has been converted to Unicode, cause a
  parse error if a non-u'' string contains a converted character
  whose Unicode code point is >= 128.

Except...

None of this properly addresses encodings that are not ASCII supersets
(or subsets), like EBCDIC.  Both Python and many Python programs seem to
make the assumption that the character set is ASCII-based, so plain
strings (with type str) can be output without conversion, while Unicode
strings must be converted to the output device's character set.
E.g. from Info node 'File Objects':

  `encoding'
     The encoding that this file uses. When Unicode strings are written
     to a file, they will be converted to byte strings using this
     encoding.

Nothing about converting 'str' strings.  Solving that one seems far out
of scope for this PEP-to-be, so my proposal inherits the above
assumption.  Though the problem may have to be discussed in order to get
a str7bit feature which does not get in the way of a clean solution to
character sets like EBCDIC.

>>>(of course, requiring that people use escape sequences for
>>>them might be acceptable).
>> 
>> Argh!  Please, no.
> 
> Think again. There absolutely is a need to represent byte arrays
> in Python source code, e.g. for libraries that manipulate binary
> data, e.g. generate MPEG files and so on. They do have a legitimate
> need to represent arbitrary bytes in source code, with no intention
> of these bytes being interpreted as characters.

Sure.  I wasn't protesting against people using of escape sequences.
I was protesting against requiring that people use them.

-- 
Hallvard