Python advocacy in scientific computation

Robert Kern robert.kern at gmail.com
Sun Mar 5 00:00:15 EST 2006


sturlamolden wrote:
> Michael Tobis skrev:
> 
> Being a scientist, I can tell you that your not getting it right. If
> you speak computer science or business talk no scientist are going to
> listen. Lets just see how you argue:

I see we've forgone the standard conventions of politeness and gone straight for
the unfounded assumptions. Fantastic.

>>These include: source and version control and audit trails for runs,
>>build system management, test specification, deployment testing (across
>>multiple platforms), post-processing analysis, run-time and
>>asynchronous visualization, distributed control and ensemble
>>management.
> 
> At this point, no scientist will no longer understand what the heck you
> are talking about. All have stopped reading and are busy doing
> experiments in the laboratory instead. Perhaps it sound good to a CS
> geek, but not to a busy researcher.
> 
> Typically a scientist need to:
> 
> 1. do a lot of experiments
> 
> 2. analyse the data from experiments
> 
> 3. run a simulation now and then

Being a one-time scientist, I can tell you that you're not getting it right. You
have an extremely myopic view of what a scientist does. You seem to be under the
impression that all scientists do exactly what you do. When I was in the
geophysics program at Scripps Institution of Oceanography, almost no one was
doing experiments. The closest were the people who were building and deploying
instruments. In our department, typically a scientist would

1. Write grant proposals.

2. Advise and teach students.

3. Analyze the data from the last research cruise/satellite passover/earthquake.

4. Do some simulations.

5. Write a lot of code to do #3 and #4.

There are whole branches of science where the typical scientist usually spends a
lot of his time in #5. Michael Tobis is in one of those branches, and his
article was directed to his peers. As he clearly stated.

You are not from one of those branches, and you have different needs. That's
fine, but please don't call the kettle black.

> Thus, we need something that is "easy to program" and "runs fast
> enough" (and by fast enough we usually mean extremely fast). The tools
> of choice seems to be Fortran for the older professors (you can't teach
> old dogs new tricks) and MATLAB (perhaps combined with plain C) for the
> younger ones (that would e.g. be yours truly). Hiring professional
> programmers are usually futile, as they don't understand the problems
> we are working with. They can't solve problems they don't understand.

I call shenanigans. Believe me, I would love it if it were true. I make my
living writing scientific software. For a company where half of us have science
degrees (myself included) and the other half have CS degrees, it would be great
advertising to say that none of those other companies could ever understand the
problems scientists face. But it's just not true.

Scientists are an important part of the process, certainly. They're called
"customers." Their needs drive the whole process. The depth and breadth of their
knowledge of the field and the particular problem are necessary to write good
scientific software. But it usually doesn't take particularly deep or broad
knowledge to write a specific piece of software. Once the scientist can reduce
the problem to a set of requirements, the CS guys are a perfect fit. That's what
a good professional programmer does: take requirements and produce software that
fulfills those requirements. They do the same thing regardless of the field. In
my company, everyone pulls their weight, even the person with the philosphy degree.

At that point, the CS skillset is perfectly suited to writing good scientific
software. Or at least, any given CS-degree person is no less likely to have the
appropriate skillset than a science-degree person. Frequently, they have a much
broader and deeper skillset that is actually useful to writing scientific
software. Most of the scientists I know couldn't write robust floating point
code to save his life. Or his career.

> What you really ned to address is something very simple:
> 
>     Why is Python better a better Matlab than Matlab?
> 
> The programs we need to write typically falls into one of three
> categories:
> 
> 1. simulations
> 2. data analysis
> 3. experiment control and data aquisition
> 
> (that are words that scientists do know)
> 
> In addition, there are 10 things you should know about scientific
> programming:
> 
> 1. Time is money. Time is the only thing that a scientist cannot afford
> to lose. Licensing fees for Matlab is not an issue. If we can spend
> $1,000,000 on specialised equipment we can pay whatever Mathworks or
> Lahey charges as well. However, time spent programming are an issue.
> (As are time time spend learning a new language.)
> 
> 2. We don't need fancy GUIs. GUI coding is a waste of time we don't
> have. We don't care if Python have fancy GUI frameworks or not.

Uh, time is money? Fighting unusable interfaces, GUI or otherwise, is a waste of
resources. My brother works in biostatistics at the NIH. Every once in a while,
the doctors he works for will ask him to do a particular analysis which requires
him to use a particularly unusable piece of software. Every time, he has to
spend half a day setting up the problem. This is why he's the one who gets to do
it instead of the doctors.

Now, he's considering rewriting the program in Python with a GUI that will
essentially provide a Big Red Go Button (TM) so the doctors can do the analysis
in a fraction of the time it takes now.

> 3. We do need fancy data plotting and graphing. We do need fancy
> plotting and graphing that are easy to use - as in Matlab or S-PLUS.
> 
> 4. Anything that has to do with website development or enterprise class
> production quality control are crap that we don't care about.

There are quite a few scientists who are managing gigantic amounts of data, and
run experiments/observations/whate-have-you so large that they need
multi-institutional participation. Sharing that data in an efficient manner
*does* require good dynamic websites and enterprise class software backing it up.

There are more kinds of scientist in Heaven and Earth than are dreamt of in your
philosophy.

> 5. Versioning control? For each program there is only one developer and
> a single or a handful users.

I used to think like that up until two seconds before I entered this gem:

  $ rm `find . -name "*.pyc"`

Okay, I didn't type it exactly like that; I was missing one character. I'll let
you guess which.

This is one thing that a lot of people seem to get wrong: version control is not
a burden on software development. It is a great enabler of software development.
It helps you get your work done faster and easier even if you are a development
team of one. You can delete code (intentionally!) because it's not longer used
in your code, but you won't lose it. You can always look at your history and get
it again. You can make sweeping changes to your code, and if that experiment
fails, you can go back to what was working before. Now you can do this by making
copies of your code, but that's annoying, clumsy, and more effort than it's
worth. Version control makes the process easier and lets you do more interesting
things.

I would go so far as to say that version control enables the application of the
scientific method to software development. When you are in lab, do you say to
yourself, "Nah, I won't write anything in my lab notebook. If the experiment
works at the end of the day, only that result matters"?

> 6. The prototype is the final version. We are not making software for a
> living, we are doing research.

I have lots of research code on my harddrive with decade-long changelogs that
give the lie to that statement. If the code is useful now, it will probably
still be useful in a few years. People will add to it, make suggestions, build
on your work.

This is how science is supposed to work. Practices which encourage this behavior
are good things for science.

> 7. "My simulation is running to slowly" is the number ONE complaint.
> Speed of excecution is an issue, regardless of what computer science
> folks try to tell you. That is why we spend disproportionate amount of
> time learning to vectorize Matlab code.
> 
> 8. "My simulation is running of of memory" is the number TWO complaint.
> Matlab is notoriously known for leaking memory and fragmenting the
> heap.
> 
> 9. What are algorithms and data structures? Very few of us knows how to
> use a datastructure more complicated than an array. That is why we like
> Matlab and Fortran so much.

Yes, and this is why you will keep saying, "My simulation is running too
slowly," and "My simulation is running out of memory." All the vectorization you
do won't make a quadratic algorithm run in O(n log(n)) time. Knowing the right
algorithm and the right data structures to use will save you programming time
and execution time. Time is money, remember, and every hour you spend tweaking
Matlab code to get an extra 5% of speed is just so much grant money down the drain.

That said, we have an excellent array object far superior to Matlab's.

  http://numeric.scipy.org/

> 10. We are novice programmers. We are not passionate programmers. We
> take no pride in our work. The easier hack the better. We don't care if
> we are doing OOP or not. However, we do hate complicated APIs or APIs
> that look funny. We are used to seeing sin(x) in our calculus textbooks
> and because of that we don't find Math.Sin(x) particularly elegant --
> even though Math.Sin(x) is more OOP and sin(x) clutters the global
> namespace.
> 
> Now please go ahead and tell me how Python can help me become a better
> scientist. And try to steer clear of the computer science buzzwords
> that don't mean anyting to me.

1. You will probably spend less time writing and running software.

2. If you play your cards right, more people will be able to use and improve
your software.

-- 
Robert Kern
robert.kern at gmail.com

"In the fields of hell where the grass grows high
 Are the graves of dreams allowed to die."
  -- Richard Harter




More information about the Python-list mailing list