Johns Hopkins Gazette | October 6, 2003


About The Gazette	Search	Back Issues	Contact Us
The newspaper of The Johns Hopkins University		October 6, 2003 \| Vol. 33 No. 6

Searchable database of human proteins unveiled

International science team says online catalog will change way biology is done

By Joanna Downer
Johns Hopkins Medicine

Like expert curators who verify and create catalogs of the world's great art collections, an international team of scientists has developed a human protein database they say will change the way biology is done. The team unveils the online Human Protein Reference Database in the October issue of Genome Research.

The database, which currently contains scientist-compiled entries on the 3,000 most-studied human proteins, including their known roles in health and disease, is expected to hold comprehensive information on 10,000 human proteins by year's end. Importantly, this database includes known interactions between proteins, creating a web that ties separate discoveries together.

"This is the real beginning of systems biology in the human," said principal investigator Akhilesh Pandey, assistant professor in the McKusick-Nathans Institute of Genetic Medicine at Johns Hopkins. "We wanted to make the best human protein database ever, so research could go faster and available information could be easier to find and easier to organize."

Pandey said advances in technology have made getting data much easier, but processing it and interpreting observations are now the big hurdles in laboratories.

"It has remained difficult to put together a big picture of biology, to see how one set of observations intersects with and complements others," he said. "With this single database, biologists now will be able to quickly review what is known about the proteins and how they interact, speeding the creation of new hypotheses to test in the lab."

The 3,000 proteins currently in the database are known to interact with anywhere from tens to hundreds of other proteins. Online, a user can pull up a visual web of protein-protein interactions with just the click of a mouse.

"The entries have been critically reviewed, making the information in the database as accurate and complete as possible," Pandey said. "Scientists can even link directly to the scientific paper behind an item, to judge for themselves its validity."

To create the database entries, dozens of trained biologists, most at the Institute of Bioinformatics in India, started with the database Online Mendelian Inheritance in Man, the offspring of a paper catalog of disease genes started in 1966 by Victor A. McKusick, University Professor of Medical Genetics at Johns Hopkins.

Focusing on these genes' proteins, the scientists critically reviewed hundreds of thousands of scientific papers, making connections between papers and resolving inconsistencies--something automated computer programs cannot do, Pandey said. They also pulled information from smaller, existing databases to complete each protein's entry.

"We believe that manual curation--lots of scientists poring through the literature--is the key to building a more accurate and more complete database," said Pandey, who serves as chief scientific adviser to the Institute of Bioinformatics. "Eventually, we hope the database will be managed by the larger community of scientists because it will be most useful if those who know these proteins best take responsibility for keeping entries up-to-date and accurate."

The database currently contains everything that's known about proteins involved in diseases, such as so-called breast cancer genes BRCA1 and BRCA2, and proteins in key pathways, such as families of enzymes that modify other proteins. It includes only experimentally proven or widely accepted facts about the proteins, without mixing in computer-generated predictions the way some other databases do, Pandey said.

The online database is also easy to use, in large part because those who designed it are experts in both computer science and biology, he added. A biologist looking for information about BRCA1, for example, can search by any of its names and get a single entry that contains everything--its alternative names, structure, function and sequence, how it's modified, other proteins with which it interacts, where it's found in cells, where it's found in the body and links to the papers that say so.

Aravinda Chakravarti, director of the McKusick-Nathans Institute and a co-author on the paper, said, "The richness of the database is astounding, since it was created in such a short time by expert reviews of individual publications. This would have been impossible without scientists to review the literature and computational biologists to make a database that is truly easy to use."

Academic researchers will have free access to the database. Johns Hopkins Licensing and Technology Development is currently establishing licensing criteria for companies interested in using the database. The database has been active for five months and has elicited almost 2 million hits, simply from word of mouth and presentations at scientific meetings, Pandey said.

The Human Protein Resource Database was built using freely available computer code, so-called open source, from ZOPE (Z Object Publishing Environment), which experts at the Institute of Bioinformatics adjusted to fit the project's needs. One of the benefits of using an object-oriented structure like ZOPE, Pandey said, is that there's no limit on the number of entries (i.e., proteins) or the number of characteristics that can be included.

Authors from Johns Hopkins on the paper are Pandey, Chakravarti, Suraj Peri, Daniel Navarro, Ramars Amanchy, Troels Kristiansen, Chandra Kiran Jonnalagadda, Mads Gronborg, Nieves Ibarrola, Chi Dang, Joe Garcia, Jonathan Pevsner and Ada Hamosh. Pandey serves as chief scientific adviser to the Institute of Bioinformatics. The terms of this arrangement are being managed by The Johns Hopkins University in accordance with its conflict of interest policies.

GO TO OCTOBER 6, 2003 TABLE OF CONTENTS.
GO TO THE GAZETTE FRONT PAGE.