April 01, 1997
How Perl Saved the Human Genome ProjectAdding information to a pipe-based I/O streamLincoln Stein
Lincoln is the director of informatics at the MIT Center for Genome Research, and the author of How to Set Up and Maintain a World Wide Web Site (Addison-Wesley, 1996). He can be contacted at lstein@genome.wi.mit.edu.
The human genome project was inaugurated at the beginning of the decade as an ambitious international effort to determine the complete DNA sequence of human beings and several experimental animals. The justification for this undertaking is both scientific and medical. By understanding the genetic makeup of an organism in excruciating detail, we hope to better understand how organisms develop from single eggs into complex multicellular beings, how food is metabolized and transformed into the constituents of the body, and how the nervous system assembles itself into a smoothly functioning ensemble. From the medical point of view, the wealth of knowledge that will come from knowing the complete DNA sequence will greatly accelerate the process of finding the causes of, and potential cures for, human diseases.
Six years after its inception, the genome project is ahead of schedule. Detailed maps of humans and experimental animals have been completed (mapping out the DNA using a series of landmarks is an obligatory first step before determining the complete DNA sequence). The sequence of the smallest model organismyeastis nearly complete, and the sequence of the next smallesta tiny soil-dwelling wormisn't far behind. Large-scale sequencing efforts for human DNA started at several centers several months ago and will be in full swing within the year.
The scale of the human DNA sequencing project is enough to send your average UNIX system administrator running for cover. From a descriptive standpoint, DNA is a very long string consisting of the four letters G, A, T, and C. These letters are abbreviations for the four chemical units that form the "rungs" of the DNA double-helix ladder. The goal of the project is to determine the order of letters in the string. The size of the string is impressive, but not particularly mind-boggling: 3*109 letters long, or some three gigabytes of storage space if you use one byte to store each letter and you use no compression.
Three gigabytes is substantial, but certainly manageable by today's standards. Unfortunately, this is just the storage space for the finished data. The storage requirements for the experimental data needed to determine this sequence is comparatively vast. The essential problem is that DNA sequencing technology is currently limited to reading stretches of, at most, 500 contiguous letters. In order to determine longer sequences, the DNA must be sequenced as small overlapping fragments called "reads." Then the jigsaw puzzle must be reassembled by algorithms that look for areas where the sequences match.
Because the DNA sequence is not random (similar, but not entirely identical, motifs appear many times throughout the genome) and because DNA sequencing technology is noisy and error-prone, each region of DNA must be sequenced five to ten times in order to reliably assemble the reads into the true sequence. This increases the amount of data to manage by an order of magnitude. In addition, all the associated information that goes along with laboratory work must be stored: who performed the experiment, when it was performed, which section of the genome was sequenced, the identity and version of the software used to assemble the sequence, any comments to be attached to the experiment, and so forth. Generally, one also wants to store the raw output from the machine that performs the sequencing itself. Each 500 letters of sequence generates a data file that's 2030 KB long!
That's not the whole of it. It's not enough just to determine the sequence of the DNA. Within the sequence are functional areas scattered among long stretches of nonfunctional areas. There are genes, control regions, structural regions, and even a few viruses that got entangled in human DNA long ago and persist as fossilized remnants. Because the genes and control regions are responsible for health and disease, they have to be identified and marked as the DNA sequence is assembled. This type of annotation generates even more data that needs to be tracked and stored.
Altogether, it is estimated that some one to ten terabytes of information will be generated by the conclusion of the human genome project.
So What's Perl Got to Do With It?From the beginning, researchers realized that informatics would have to play a large role in the genome project. An informatics core formed an integral part of every genome center that was created. The mission of this core was twofold: to provide computer support and database services for their affiliated laboratories, and to develop data analysis and management software for use by the genome community as a whole.
Initial results of the informatics groups' efforts were mixed. Things were slightly better on the laboratory-management side of the coin. Some groups attempted to build large, monolithic systems on top of complex relational databases. However, they were thwarted time and again by the highly dynamic nature of biological research. By the time the software engineers had designed, implemented, and debugged a system that could deal with the ins and outs of a complex laboratory protocol, new technology had superseded the protocol, and the software engineers had to go back to the drawing board.
Most groups, however, learned to build modular, loosely coupled systems with parts that could be swapped in and out without retooling the entire system. In my group, for example, we discovered that many data-analysis tasks involve a sequence of semi-independent steps.
Consider the steps that may be performed on a bit of newly sequenced DNA. First, there's a basic quality check on the sequence: Is it long enough, and are the number of ambiguous letters below the maximum limit? Then, there's the "vector check." For technical reasons, the human DNA must be passed through a bacterium before it can be sequenced (this is the process of "cloning"). Not infrequently, the human DNA gets lost somewhere in the process, and the sequence that's read consists entirely of the bacterial vector. The vector check ensures that only human DNA gets into the database.
Next, there's a check for repetitive sequences. Human DNA is full of repetitive elements that make fitting the sequencing jigsaw puzzle together challenging. The repetitive-sequence check tries to match the new sequence against a library of known repetitive elements. The penultimate step is to attempt to match the new sequence against other sequences in a large community database of DNA sequences. Often, a match at this point will provide a clue to the function of the new DNA sequence. After performing all these checks, the sequence (along with the information that's been gathered about it along the way) is loaded into the local laboratory database.
The process of passing a DNA sequence through these independent, analytic steps looks like a pipeline, and we realized that a UNIX pipe could handle the job. We developed a simple Perl-based data-exchange format called "boulderio" that allowed loosely coupled programs to add information to a pipe-based I/O stream. Boulderio is based on tag/value pairs. A Perl module makes it easy for programs to reach into the input stream, pull out only the tags it is interested in, do something with them, and drop new tags into the output stream. Tags that the program isn't interested in are just passed through to standard output so that other programs in the pipeline can get to them.
Using this type of scheme, the process of analyzing a new DNA sequence looks something like Example 1. This is not exactly the set of scripts that we use, but it's close enough. A file containing the new DNA sequence is processed by the Perl script name_sequence.pl; its only job is to give the sequence a new unique name and to put it into boulderio format. Figure 1 shows sample output from name_sequence.pl.
Example 1: Using several scripts to analyze a new DNA sequence.
name_sequence.pl < new.dna | quality_check.pl | vector_check.pl |\
find_repeats.pl | search_big_database.pl | load_lab_database.pl
Figure 1: Sample output from name_sequence.pl.
use Boulder::Stream;
$stream = new Boulder::Stream;
while ($record = $stream->read_record('NAME','SEQUENCE')) {
$name = $record->get('NAME');
$sequence = $record->get('SEQUENCE');
.
[continue processing]
.
$record->add(QUALITY_CHECK=>"OK");
$stream->write_record($record);
}
One interesting occurrence is that multiple informatics groups independently converged on solutions that were similar to the boulderio idea. For example, several groups involved in the worm sequencing project began using a data-exchange format called ".ace." Although this format was initially designed as the data dump and reload format for the ACE database (a specialized database for biological data), it happens to use a tag/value format that's very similar to boulderio. Soon, .ace files were being processed by Perl script pipelines and loaded into the ACE database at the very last step.
Perl was also useful in other aspects of laboratory management. For example, many centers (including my own) use web-based interfaces for displaying the status of projects and allowing researchers to take action. Perl scripts are the perfect engine for web CGI scripts. Similarly, Perl scripts run e-mail database query servers, supervise cron jobs, prepare nightly reports summarizing laboratory activity, create instruction files to control robots, and handle almost every other information management task that a busy genome center needs.
As far as laboratory management went, the informatics cores were reasonably successful. The systems-integration picture, however, was not so rosy.
The Challenge of Systems IntegrationThe problem will be familiar to anyone who has worked on a large, loosely organized software project. Despite the best intentions, the project begins to drift. Programmers go off to work on ideas that interest them, modules that need to interface with one another are designed independently, and the same problems get solved several times in different, mutually incompatible ways. When the time comes to put all the parts together, nothing works. This happened in the genome project. Although everyone was working on the same problems, no two groups took exactly the same approach. Programs to solve a given problem were written multiple times, duplicating effort. While a given piece of software wasn't guaranteed to work better than its counterpart developed elsewhere, you could always count on it to sport its own idiosyncratic user interface and data format. A typical example is the central algorithm that assembles thousands of short DNA reads into an ordered set of overlaps. At last count, there were at least six different programs in widespread use, and no two of them use the same data input or output formats. This lack of interchangeability presents a terrible dilemma for the genome centers. Without interchangeability, an informatics group is locked into using the software that it developed in-house. If another genome center develops a better software tool to attack the same problem, a tremendous effort is required by the first center to retool their system in order to use that tool. The long-range solution to this problem is to come up with uniform data-interchange standards that genome software must adhere to. This would allow common modules to be swapped in and out easily. However, standards require time to develop, and while the various groups are involved in discussion and negotiation, there is still an urgent need to adapt existing software to the immediate needs of the genome centers. This is where Perl again came to the rescue. In February 1996, a summit was called in Cambridge, England, in part to deal with the data-interchange problem. Despite the fact that the two groups involved were close collaborators and superficially seemed to be using the same tools to solve the same problems, on closer inspection, nothing they were doing was exactly the same. The main software components in a DNA-sequencing project are:
When the time came to interchange data, however, the two groups hit a snag. Between them, they were now using two trace editors, three assemblers, two assembly editors, and (thankfully) one database. If two Perl scripts were required for each pair of components (one for each direction), as many as 62 different scripts would be needed to handle all the possible interconversion tasks. Each time the input or ouput format of one of these modules changed, 14 scripts might need to be examined and fixed.
Figure 6 shows the solution that was worked out during this meeting. The two groups decided to adopt a common data-exchange format known as CAF (the exact designation of this acronym was forgotten over the course of the meeting). CAF would contain a superset of the data that each of the analysis and editing tools needed. For each module, two Perl scripts would be responsible for converting from CAF into whatever format Module A expected ("CAF2ModuleA") and converting Module A's output back into CAF ("ModuleA2CAF"). This simplified the programming and maintenance task considerably. Now there were only 16 Perl scripts to write; when one module changed, only two scripts would need to be examined.
Figure 6: A common data-exchange format (CAF) solution.
This episode is not unique. Perl has been the solution of choice for genome centers whenever they need to exchange data or to retrofit one center's software module to work with another center's system.
ConclusionPerl has become the software mainstay for computation within genome centers as well as the glue that binds them together. Although genome informatics groups are constantly tinkering with other "high-level" languages such as Python, Tcl, and recently Java, nothing comes close to Perl's popularity. How has Perl achieved this remarkable position? Several factors are responsible.
($name,$start_position,$length) = split("\t");
Then one day, the input file's format changes without warning. Someone bumps the field count up by one by sticking a comment field between the name and the first integer. Now the unknowing script assigns a string to a variable that's expected to be numeric and silently discards the last field on the line. Rather than crashing or returning an error code, the script merrily performs integer arithmetic on a string, assuming a value of zero for the string (unless it happens to start with a digit). Although the calculation is meaningless, the output may look perfectly good, and the error may not be caught until much later.
In short, when the genome project was foundering in a sea of incompatible data formats, rapidly changing techniques, and monolithic data-analysis programs, Perl saved the day. Though it's not perfect, Perl seems to meet the needs of the genome centers remarkably well, and is usually the first tool we turn to when we have a problem to solve.
|
|
|||||||||||||||||||||
|
|
|
|