Scientific Software: Get More Done with Less Pain

June 22, 2010

Greg Wilson

Anyone reading SIAM News knows that computers are as important to modern science as telescopes and test tubes. From analyzing climate data to modeling the interior of a cell, computers allow scientists to study problems that are too big, too small, too fast, too slow, too expensive, or too dangerous to tackle in the lab.

Readers will also know that scientists are almost never taught how to use computers effectively. After a generic first-year programming course, most science students are expected to figure out the rest of software engineering on their own. This is as unfair and ineffective as teaching students addition and long division, then expecting them to figure out calculus without any more help.

The now-infamous HARRY_READ_ME.txt file from the Climate Research Unit in East Anglia, which chronicles one researcher's struggles with legacy software, is one example of the hidden cost of scientists' lack of skills. Another is the all-too-familiar problem of research code that's thrown away because a grad student has moved on and no one knows where things are or how to make them work. These costs have never been quantified, but it's clear that they are large, and growing.

We clearly need to do better. Even setting aside the economic savings to be gained from improving the way we work, if we want to call the things we do "science," we need to teach people better ways of creating, validating, sharing, and tracking code and results.

Unfortunately, many educational efforts are driven by people who believe that "scientific computing" is synonymous with "high-performance computing." With a few laudable exceptions (such as The MathWorks), most companies and funding agencies prefer to focus on high-end projects at the expense of smaller-scale work that is less glamorous, but many times more common. As a result, scientists who have never used a version control system or unit testing framework are told that they should be learning MPI or OpenMP. This is as fair as putting brand new drivers in 18-wheelers, setting them loose on the highway, and then blaming them when they crash. Based on a 2008 survey of how scientists use computers [1,2], it's clear that the real "grand challenge" in scientific computing isn't petascale this or cloud that; it's giving scientists with problems of all scales a solid foundation so that they can tackle leading-edge problems without heroic effort.

But where is the time for this to come from? Every undergraduate science curriculum is already over-full, and graduate school is no better. If we want to teach geologists more about computing, what do we drop to make room: thermodynamics or mineralogy? "We'll just work a little into each course" is a fudge: Five minutes out of each lecture in a standard four-year program still works out to four courses. The only way to make this work is to demonstrate that investing time in computing skills will reliably save at least as much time later on.

Since 1997, my colleagues and I have been doing exactly that with a course called Software Carpentry, the aim of which is to give scientists the concepts and skills they need to use computers more effectively in their research. (We call it "carpentry" rather than "engineering" because it teaches the equivalent of putting an extension on the house, rather than digging the Channel Tunnel.) This training has consistently had an immediate and dramatic impact on participants' productivity: it makes their current work less onerous, and allows them to tackle larger and more complicated problems than they ever could before.

The course materials are available online under an open license [3], and have been viewed by more than 140,000 people from 70 countries. While it has changed shape several times, the topics covered currently include:

* Program design
* Version control
* Task automation
* Agile development
* Provenance and reproducibility
* Maintenance and integration
* User interface construction
* Testing and validation
* Working with text, XML, binary, and relational data.


Despite the popularity of the course, some of this material is now out of date. Additionally, it was designed for lecture-style delivery, rather than for asynchronous self-study on the web. The good news is that a major upgrade is under way: Thanks to support from organizations in Canada, the U.S., and Europe [4], a complete makeover started in May 2010. The aim is to allow 90% of graduate students in science and engineering to do 90% of the course on their own, while giving them an online community to turn to when they run into roadblocks. The lessons are mapped out in [5]; a description of the kinds of people the course is designed to help can be found in [6].

Software Carpentry was always meant to be by the scientific computing community, as well as for it. If you would like to help us help scientists worldwide spend less time wrestling with software, and more time doing research, please contact us at software@software-carpentry.org.

References
[1] J.E. Hannay, H.P. Langtangen, C. MacLeod, D. Pfahl, J. Singer, and G. Wilson, How do scientists develop and use scientific software?, Proceedings of the Second International Workshop on Software Engineering for Computational Science and Engineering, May 2009.
[2] G. Wilson, "How Do Scientists Really Use Computers?" American Scientist, September/October 2009.
[3] http://software-carpentry.org.
[4] This new round of work on Software Carpentry has been made possible by the generosity of the Indiana University School of Computing and Informatics, the Gene Expression in Disease and Development Focus Group at Michigan State University, MITACS, the Centre for Digital Music at Queen Mary University of London, SciNet, SHARCNET, and the UK Met Office. We would also like to thank The MathWorks, the University of Toronto, the Python Software Foundation, and Los Alamos National Laboratory, whose support has allowed us to help thousands of scientists use computers more productively.
[5] http://softwarecarpentry.wordpress.com/a-fresh-start/.
[6] http://softwarecarpentry.wordpress.com/user-profiles.

With a PhD in computer science from the University of Edinburgh, Greg Wilson has worked in high-performance scientific computing, data visualization, and computer security.




Renew SIAM · Contact Us · Site Map · Join SIAM · My Account
Facebook Twitter Youtube