Refactoring in a large code base

Chris Angelico rosuav at gmail.com
Fri Jan 22 07:28:05 EST 2016


On Fri, Jan 22, 2016 at 10:54 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Chris Angelico <rosuav at gmail.com>:
>
>> On Fri, Jan 22, 2016 at 9:19 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> So what do you do with a huge program?
>
> Modularize. Treat each module as a separate product with its own release
> cycle, documentation, apis, ownership etc.
>
> What is a reasonable size of a module? It is something you would
> consider replacing with a new implementation with a moderate effort
> (say, in a single quarter).
>
>> CPython is a large and complex program. How do you propose doing it
>> "right"?
>
> I don't know CPython specifically to give solid recommendations, but I
> would imagine the core language engine should be in a repository
> separate from the standard library, and most standard library modules
> should be in their respective repositories and have their individual
> internal release cycles.
>
> A CPython release would then weave the package together from the
> components that were previously (internally) released.

Okay. So let's suppose we strip out huge slabs of the standard library
and make an absolutely minimal "base library", with an "extended
library" that can run on its own separate release cycle. (This has
already been discussed; the biggest problems with the idea aren't
technical, but logistical - not just for the Python devs but for
everyone who has to get approval for software upgrades.) Let's suppose
the base library consists of just the modules necessary for a basic
invocation:

rosuav at sikorsky:~$ python3 -c 'import sys; print(sorted(sys.modules.keys()))'
['__main__', '_codecs', '_collections_abc', '_frozen_importlib',
'_frozen_importlib_external', '_imp', '_io', '_signal',
'_sitebuiltins', '_stat', '_sysconfigdata', '_thread', '_warnings',
'_weakref', '_weakrefset', 'abc', 'builtins', 'codecs', 'encodings',
'encodings.aliases', 'encodings.latin_1', 'encodings.utf_8', 'errno',
'genericpath', 'io', 'marshal', 'os', 'os.path', 'posix', 'posixpath',
'site', 'stat', 'sys', 'sysconfig', 'zipimport']

Alright. Can you rewrite all of those modules in three months? Not
even digging into the language itself, just the base library. This is
the bare minimum to get a viable Python execution environment going
(you might be able to cut it down a bit, but not much), so it can't be
modularized into separate projects.

And then there's the language itself. The cpython/Python directory has
58 .c files, many of which are closely tied to each other. The
cpython/Objects directory has another 39, representing specific object
types (bytes, tuple, range, method) that are implemented in C. And
cpython/Parser has 17 more just to handle the language parser. Edits
often affect multiple files and must be kept in sync. How would you
modularize that out? Which part would you spin off as a separate
project with its own release cycle? The garbage collector? The string
object? The peephole optimizer? The import machinery? Each of these is
already too big to rewrite in three months, plus they're fairly
tightly linked to all the other modules. All that code represents the
accumulation of hundreds of thousands of fixes to prevent tens of
millions of bugs (some of which will be visible on bugs.python.org,
but most would have been found and prevented during early testing);
throwing the code away means throwing all that away.

http://www.joelonsoftware.com/articles/fog0000000069.html

I don't agree with everything Joel says, but seriously, do not waste
your time with a full rewrite - even in theory. And I can say this
from hard experience on both sides. I have an active project for a MUD
server, which was originally deployed as a byte-oriented service (it
took ASCII-compatible bytes from clients and sent those same octets
out to other clients). When I decided that the server should work with
Unicode text internally (expecting and transmitting UTF-8), I kept on
coming across stupid problems where the code had been written with
faulty assumptions, and I had to keep on fixing those. Would it have
been better to throw the code away and start over? Well, let me tell
you, it would certainly have made the Unicode handling a lot easier,
so if you're looking at starting your own project, make sure you learn
from my hassles and bake in Unicode support from the start! But that
would have meant throwing away all the bugfixes for all the bugs that
I'd noticed across the years, such as:

1) On login, typing "quit" when prompted for a user name or password
would log you out. The "passwd" (change password) command had to also
prevent you from *setting* your password to "quit", because that would
effectively lock your account against login.

2) Some clients send backspace as 08; others send FF; some send 08 20
08. Cope with them all.

3) If a bug prevents the admin account from working, there needs to be
a way to diagnose and fix that code using shell access to the back-end
server, without needing the actual admin account.

Etcetera, etcetera, etcetera. There's no way to "rewrite but keep all
the bug fixes". That's called not rewriting.

So how can you rewrite *any* large project in three months?

ChrisA



More information about the Python-list mailing list