Super Computing and BioInformatics

Gene Myers, University of California, Berkeley

We give an overview of the whole-genome shotgun sequencing approach to determining the DNA sequence of a species. For a genome the size of human or mouse, this is a large combinatorial problem involving millions of sequencing reads, billions of bases, and requiring thousands of CPU hours to solve.

Most computation in bioinformatics is embarrassingly parallel. Despite this, computer clusters and associated disk systems that appropriately optimize data distribution are not commonly available. We expose the basic issue, present a cluster we have configured at Berkeley to optimally distribute files, and discuss its performance on a number of compute intensive bioinformatics problems: whole genome assembly and whole genome comparisons.

Now that we have the genomes of human and mouse, and with many more on the way, the challenge becomes the interpretation of these genomes with the goal of understanding the first-order functioning of the cell. A coherent program based on the model organism Drosophila is outlined that promises to eliver an empirically verified and exhaustive annotation of the transcripts and cis-regulatory elements in the genome. Building and simulating models that capture this data and elucidate the behavior of the cell -- including development, differentiation, and cell signaling - is a major computational challenge.

Created: 9/29/03
Modified: 10/2/03