2.3 encoding parsing bug

Wed Feb 18 02:23:57 EST 2004

Edward K. Ream wrote:
> Is there any chance of modifying the re to reduce the possibility of
> confusion and "false matches"?  For example, matching only 'coding' and
> 'fileeencoding' (as words).

Certainly. Propose a change to the specification, and suggest it to
python-dev. If the proposed change is acceptable, and somebodey
volunteers to provide an implementation, it will get implemented in
2.4. There is no chance of changing 2.3 in an incompatible way.
And there is, of course, no chance of changing the copies of
Python 2.3 that have already been installed.

> Thanks for your clarification of the situation.  I suppose I'll have to look
> more closely at PEP's in the future.  These over-general encoding
> declarations seem like a pretty low blow.

I personally would have preferred a proper statement to declare the
encoding, such as

pragma encoding "iso-8859-1"

However, this approach was rejected as too intrusive, and a stealth
declaration in comments was considered more appropriate.

> This was just a really bad idea, put forward in stealth, buried in an re.
> Having a _restricted_ kind of special-purpose comment is one thing:  having
> a way-too-general kind of special-purpose comment is wrong, wrong, wrong.
> It needlessly invalidates comments that _should_ have been none of Python's
> business.

OTOH, LEO _should_ not have come up with its own syntax to specify an
encoding. Instead, LEO should have used established conventions, such
as

   -*- coding: <codingname> -*-

> My guess is that I could have read this many times without having the
> slightest hint of danger: the re bears almost no relation to the English
> words.

That is not true. The English language gives specific, recommended
examples. Users (i.e. python programmers) should use the recommended
syntax, instead of coming up with their own syntax that still matches
the regular expression.

The regular expression is introduced with the words "more precisely",
which always should make readers of formal specifications cautious.
In particular, this aspect is directed at developers of tools that
edit Python source, as this is the regular expression they need to
use to determine the encoding of the file. If LEO can read Python
files, this regular expression should have been used ever since
support for coding declarations was implemented.

Regards,
Martin