[Python-3000] Support for PEP 3131

Sat Jun 2 19:48:49 CEST 2007

"Rauli Ruohonen" <rauli.ruohonen at gmail.com> wrote:
> On 6/2/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> > """
> > If a comment in the first or second line of the Python script matches
> > the regular expression coding[=:]\s*([-\w.]+), this comment is processed
> > as an encoding declaration; the first group of this expression names the
> > encoding of the source code file.
> > """
> >
> > Your suggestion would unnecessarily change the semantics of the encoding
> > declarations.  I would call this gratuitous breakage.
> 
> Depending on what the regular expression for the declarations is, the
> difference may
> not be big. Current code can also reliably be converted with an automated tool,
> so this isn't a big deal for py3k.

Whether or not there exists a tool to convert from Python 2.6 to Python
3.0 (2to3), every tool that currently handles Python source code
encodings via the method specified in the documentation (just about
every Python-centric editor I know) would need to be changed.  Further,
not all code will be passed through the 2.6 to 3.0 converter, as the
tool is meant as a sort of "I don't want to go through all the trouble
of converting yet, but I want to support Python 3.0".  And even if it
*were* all passed through, the output of the converter is not meant for
future editing and consumption; it is meant as a stopgap.  People who
really want to support Python 3.0 should be doing the conversion by hand,
possibly with guidance from the converter.

> It may be that the change is unnecessary. Reading Guido's writings, he seems
> to be of the opinion that the Java way (no restrictions at all) is
> right here, and
> anything else can be delegated to pylint and similar tools.

Perhaps, but there is a growing contingent here that are of the opposite
opinion.  And even though this contingent is of differing opinions on
whether unicode identifiers should even be allowed, we all agree that if
they are allowed, they shouldn't be the default.

> > Sounds like the application of vim settings as a solution to a whole
> > bunch of completely unrelated "problems" in Python (especially with 4
> > space indents being the "one true way to indent" and the encoding
> > declaration already being established).  Please keep your vim out of my
> > Python ;) .
> 
> The encoding declaration stays mostly the same, I'm just suggesting adding
> similar declarations for the identifier/string character sets and making them
> deception-proof. You're probably right about the indentation stuff. If
> you got rid
> of all indentation-related options and simply forbade mixture of tabs and
> spaces, I'd just say good riddance.

Python 2.x has a -t option that warns people about inconsistent
tab/space usage.  In 3.0, from what I understand, that option is
automatically enabled and may result in errors instead of warnings.

> > And as stated by basically everyone, the only *sane* default is ascii
> > identifiers.  Since the vast majority of users will have no use for
> > unicode identifiers in the short or long term, making them the default
> > is overzealous at best.
> 
> "Basically everyone" is not true, because it does not include Guido, who
> matters the most. Some quotes from his latest posts on the topic:

Guido doesn't always overrule everyone.  There is quite a long history
of him changing his mind after having seen good reasoning about an issue. 
Most recently, see the dynamic attribute access thread about the o.{a}
syntax.

And when I say "basically everyone", I'm offering everyone the
opportunity who has offered their opinion recently to be in that camp. 
Please see the writings of Baptiste Carvello, Jim Jewett, Ka-Ping Yee,
Stephen Howell, Ivan Krstic, and myself.

If you want to completely ignore the general consensus was reached from
people on both sides of the issue, that's fine.  But pardon me if I
ignore you from here on out.

> Guido van Rossum (May 25):
> :I still think such a command-line switch (or switches) is the wrong
> :approach. What if I have *one* module that uses Cyrillic legitimately.
> :A command-line switch would enable Cyrillic in *all* modules.

I'm not personally a really big fan of the command-line argument
approach, but that doesn't mean that the only two solutions are
in-module with your syntax and command-line.  There are other solutions
(global registry of individual module allowed identifiers, in-module
with a different syntax, etc.). I'm just saying that I don't like *your*
solution.

> Guido van Rossum (May 25):
> :On 5/24/07, Josiah Carlson <jcarlson at uci.edu> wrote:
> :> Where else in Python have we made the default
> :> behavior only desired or useful to 5% of our users?
> :
> :Where are you getting that statistic? This seems an extremely
> :backwards, US-centric worldview.

You will note that I actually responded to this, as have others.  The
use of unicode identifiers will be rare, and your pressure to try to
make them the default won't change that; but it will confuse the hell
out of the large numbers of users who have no use for unicode, and whose
tools are not prepared for unicode.

> Guido van Rossum (May 25):
> :A more useful approach would seem to be a set of auditing tools that
> :can be applied routinely to all new contributions (e.g. as a
> :pre-commit hook when using a source control system), or to all code in
> :a given directory, download, etc. I don't see this as all that
> :different from using e.g. PyChecker of PyLint.
> :
> :While I routinely perform visual code inspections [...], I certainly don't see
> :this as a security audit [...]. Scanning for stray non-ASCII characters is best
> :left to automated tools.

Others have also responded to this.  Adding a tool to an arbitrarily
large or small previously existing toolchain, so that the majority of
users can verify that their code doesn't contain characters that
shouldn't be allowed in the first place, isn't a very good solution.

> Guido van Rossum (May 23):
> :In particular very helpful was a couple of reports from the Java
> :world, where Unicode letters in identifiers have been legal for a long
> :time now. (JavaScript also supports this BTW.) The Java world has not
> :fallen apart,

And we reported about this.  They are rarely used, and the far vast
majority of code that *does* have unicode identifiers is closed-source. 
As someone else has discussed this, do we want to encourage open source
(with which the only sane identifiers are ascii), or do we want to
encourage closed-source and the 'ghettoization' of Python source code?

> Guido van Rossum (May 17):
> :As I mentioned before, I don't expect either of these will be much of
> :a concern. I guess tools like pylint could optionally warn if
> :non-ascii characters are used.
> :
> :On 5/16/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> :> (1)  Security concerns.
> :> (2)  Obscure bugs.
> 
> Summary of what I think Guido's saying (involves some interpretation):
>  - always having no restrictions (the Java way) is not a problem in practice
>  - because having no restrictions has worked well with Java, Python
> should follow

Only because it is so rarely used that no one really runs into unicode
identifiers.  As such, the only sane position is to require the explicit
enabling of unicode identifiers.  Also please see Nich Coghlan's
discussion about *why* this isn't as much an issue with statically typed
declarative languages as it is with Python.

>  - any concerns can be adequately dealt solely with external tools

And having to rely on *additional* tools to verify that what the vast
majority of users want is actually happening is silly.  I'll ask again,
because you don't seem to have been paying attention to the messages you
cited, but where else in Python has the tiny minority defined the
defaults for the vast majority of users?

>  - command line switches are a bad implementation of restriction management

That's the only argument that is worth listening to.  But command line
switches aren't our only option here.

[snip]
> This isn't really anything more than a countermeasure against Ka-Ping's
> tricky.py -exploit and addition of a real charset restriction method instead of
> abusing the coding declaration for that (that would force you to use legacy
> codings just to restrict the charsets, as pointed out a lot earlier here).

Thankfully, no one who has bothered to think for more than a few minutes
about this issue has seriously considered using legacy encodings.  So
it's a non-issue.

> One more thing which might be removed from the suggestion is the command
> line option and its associated site.py default. Such checking is more
> appropriate
> for pylint, and is probably of little use anyway. Either you trust the
> files you're
> importing in which case the characters they use does not make any difference,
> or you don't, in which case you shouldn't be importing them at all and checking
> their character sets will not help you at all. For audit purposes the comment
> directives are enough as they can't deceive, and if you want to be
> extra paranoid
> you can use pylint to catch any surreptitious patches like in Guillaume's post.

Adding Pylint to verify that I don't have characters that shouldn't be
allowed in the first place, when Python should tell me the *the moment*
modules are being compiled, is silly.  Now, you have had the opportunity
to go through the hundreds of posts on the matter and compose a message,
yet you still don't understand that ascii is the only sane default. 
Please read posts in the 3131 thread from the authors I list above, and
please try to inform yourself on the content of postings from people
that are not Guido.

 - Josiah