Johns Hopkins Magazine -- April 2000
Johns Hopkins 
     Magazine Home

APRIL 2000


There are 3 billion bits of data in the human genome. Steven Salzberg's quest: zeroing in on the particular stretches that comprise a gene.
APRIL 2000
Pioneers of Promise

· · · · · · · · · · · ·
Gleaning Genetic Gold
By Melissa Hendricks

One day recently when Steven Salzberg was driving to work, he heard a science reporter on the radio opine that genetics would be the science story of the future. Salzberg wholeheartedly agrees. He is a Hopkins research professor in computer science but spends most of his time developing software that searches for genes.

"The decoding of genomes will be the basis for a sweeping revolution in science and human health," says Salzberg. "The tools that we'll develop to treat diseases, both acute (such as infections) and chronic (such as aging-related problems), will be incredibly precise and effective compared to what we've had in the past. It will be like the difference between a relatively dull knife and a laser."

But before this revolution is realized, researchers must develop tools that find genes within the wealth of DNA sequences being generated at record pace by the world's DNA sequencing machines.

The amount of DNA sequence data is doubling every 15 months and is expected to accelerate. But some experts say that extracting meaning from the data--finding the genes--is the most challenging problem in molecular biology.

Consider that the human genome is comprised of 3 billion bits of data, with each "bit" being one of four different nucleotide bases (abbreviated A, T, C, and G). But only certain stretches of those 3 billion nucleotides are genes, or sequences that contain the code for the production of a protein. The rest is extra stuff, playing a supporting role. Computational biologists like Salzberg try to separate the genetic wheat from the chaff, so to speak. This task is made all the more difficult by the fact that genes often are not continuous. Portions of a gene might be scattered throughout the genome.

When he is not teaching computational biology at Hopkins, Salzberg directs the bioinformatics division at The Institute for Genomic Research (TIGR), a nonprofit research organization in Rockville, Maryland. He and his colleagues have built several "gene finders," or software programs tailored for panning for genes in a variety of organisms, particularly in pathogens that afflict people.

He and Arthur Delcher, PhD '90, a computer scientist at Loyola College, in Baltimore, developed a bacterial gene finder called Glimmer. Colleagues have used Glimmer to find genes in the organisms that cause Lyme disease, syphilis, tuberculosis, and others. It appears to be able to find more than 99 percent of the genes in any bacterium, Salzberg reports.

With Hopkins graduate student Mihaela Pertea, he also wrote a gene finder for Plasmodium falciparum, the parasite that causes malaria. Researchers have used the tool to identify genes for use in a novel DNA-based malaria vaccine.

Salzberg's group has posted these gene-finding tools on the World Wide Web ( and made them available free of charge to nonprofit organizations.

In creating his software, Salzberg uses several different computational strategies. One technique, "sequence alignment," relies upon the fact that many gene sequences are conserved across species, and uses the sequences of known genes to search for similar sequences in other organisms. It's something like using a "Rosetta stone" for genes, says Salzberg.

Other gene-finding programs employ "statistical pattern recognition," Salzberg explains by way of analogy. Suppose someone who speaks only English is asked to distinguish spoken French from gibberish. Most people could do it, says Salzberg. They could tell that what they were hearing was French, even though they would not understand the meaning of the French words. "Language has patterns. There are numerous regularities."

The same is true for DNA. Genes contain idiosyncratic stretches of sequence that distinguish them from the surrounding blur of the total genome. These blips of sequence (usually between 2 and 12 nucleotides long) occur at certain frequencies. The frequency and distribution pattern of these sequences are clues to the presence of a gene, says Salzberg.

Just as being able to recognize French is not the same as comprehending it, locating genes is far from understanding their function. That job, explains Salzberg, he leaves to the world's biologists.