Language change and code breaks

Fri Jul 20 13:54:41 EDT 2001

Some observations and data from someone who has yet to participate in this
discussion.

I have seen at least one argument that interfaces to external packages it is
useful to be able to map more easily onto the identifiers in external
systems.  Most of those external systems tend to be case-sensitive, because
they tend to be written in C or C++.  What effect would case-insensitivity
in the language have on packages like Boost, SWIG or Jython?  I imagine that
at the least, they would have to develop some sort of name mapping.  One of
the advantages of automatic wrapper generators is that it preserves (most
of) the value of existing documentation.  Case-insensitivity would probably
hurt somewhat there.

I don't think you can add a switch to the language to make it
case-insensitive on-the-fly by the user.  Users would not be able to
reliably use external packages that had been coded expecting to only be used
in one environment or the other.  For the sanity of all involved, I think
the language is either going to have to be strictly case-sensitive or
strictly case-insensitive.  A schizophrenic interpreter would probably only
induce psychoses in its users. ;-)

I wrote a simple script using the tokenize module to classify the names in
the Python sources in the current distribution (cvs up'd from the
descr-branch this morning) and spit out potential name clashes if those
names were compared in a case-insensitive fashion.  I ran the script from my
build directory under .../dist/src as:

    find .. -name '*.py' | xargs ./python ~/tmp/spittokens.py

The output looks like:

    ../Demo/classes/Rat.py
       rat,Rat
    ...
    ../Lib/distutils/sysconfig.py
       PREFIX,prefix
       EXEC_PREFIX,exec_prefix
       TextFile,text_file

The script considers all names that don't begin with an underscore, however,
for all other names it elides underscores before classifying them.  For
example "my_dog", "mydog", "MyDog", and "my_dog__" would all be classified
the same.  My assumption was that in the absence of capitalization as a way
to distinguish names, underscores would be used more than they are today and
so "my_dog" and "MyDog" should be classified the same, because the most
likely way to prevent "mydog" and "MyDog" from clashing in a
case-insensitive world would be to rewrite the former as "my_dog".  This
obviously errs on the high side when considering what might need to be done
to the Python core libraries.  It also completely ignores the core C source
code, the names of the source files themselves, and potential name clashes
across module boundaries (all sources of plenty of potentially clashing
identifiers).

With those caveats, it identified 504 Python source files (out of a total of
1335 .py files) in the current distribution that might have name conflicts
(or would at least have to be checked) if Python became case-insensitive.
The directories with the most files that would have to be checked are

     64 ../Lib
     34 ../Lib/test
     32 ../Demo/tkinter/matt
     25 ../Tools/idle
     21 ../Mac/Tools/IDE
     21 ../Demo/tkinter/guido
     21 ../Demo/sgi/video
     15 ../Mac/Lib
     14 ../Lib/distutils
     13 ../Mac/Lib/test
     12 ../Lib/lib-tk
     10 ../Tools/scripts
     10 ../Tools/pynche
     10 ../Mac/scripts

The source and the output are available at

    http://musi-cal.mojam.com/~skip/spittokens.py
    http://musi-cal.mojam.com/~skip/spittokens.out

respectively.

Finally, I ran the script over my personal library of Python source files.
It identified 75 files out of 360 with potential name conflicts.  I suspect
if the language is changed I will adapt without much difficulty, but it will
be tedious.

-- 
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/