Refactoring in a large code base

Fri Jan 22 09:08:06 EST 2016

On Sat, Jan 23, 2016 at 12:30 AM, Rustom Mody <rustompmody at gmail.com> wrote:
> You just gave a graphic vivid description...
> of the same thing Marko is describing: ;-) viz.
> A full-size language parser is something that you - an experienced developer -
> make a point of avoiding.

It's worth noting that "experienced developer" covers a huge range of
skills. There are quite a few other areas that I do not tinker with
(crypto, CPU-level optimizations, and such), not because they're
impossible to understand, but because *I* have not the skill to
understand and improve them. This does mean they're complicated
(they're beyond the "one weekend of tinkering" barrier that any
serious geek should be able to invest), but I'm sure there are
language nerds out there who are so familiar with the grammar of
<insert language here> that they'll pick up CPython's grammar and make
a change with confidence that it'll do what they expect.

> So then the question comes down to this: Is this the order of nature?
> Or is it man-made disorder?
> Jury's out on that one for lexers/parsers specifically.

Lexers/parsers are as complicated as the grammars they parse. A lexer
for a simple structured text file can be pretty easy to implement; for
instance, JSON is pretty straight-forward, with only a handful of
cases (insignificant whitespace, three keywords, two recursive
structures that start with specific characters ('{' and '['), strings
(which start with '"'), and numbers (which start with a digit or a
hyphen)), so a parser need only look for those few possibilities and
it knows exactly what else to fetch up. I could probably write a JSON
parser in a fairly short space of time, and wouldn't be scared of
digging into the internals of someone else's. It's when the grammar
adds complexities to deal with the real-world issues of full size
programming languages that it becomes hairier. The CPython grammar is
only ~150 lines of fairly readable directives, but the parser that
implements it is ~3500 lines of C code. Pike merges the two into a
YACC file of nearly 5000 lines of highly optimized code (it has
different grammar paths for things a human would consider the same, in
order to produce distinct code). That's where I'm ubercautious.

> For arbitrary code in general, the problem that it may be arbitrarily and unboundedly
> complex/complicated is the oldest problem in computer science: the halting problem.
>
> IOW anyone who thinks that *arbitrary* complexity can *always* be tamed either
> has a visa to utopia or needs to re-evaluate (or get) a CS degree

Exactly.

ChrisA