Like expert curators who verify and create catalogs of
the world's great art collections, an international team of
scientists has developed a human protein database they say
will change the way biology is done. The team unveils the
online Human Protein Reference Database in the October
issue of Genome Research.
The database, which currently contains
scientist-compiled entries on the 3,000 most-studied human
proteins, including their known roles in health and
disease, is expected to hold comprehensive information on
10,000 human proteins by year's end. Importantly, this
database includes known interactions between proteins,
creating a web that ties separate discoveries together.
"This is the real beginning of systems biology in the
human," said principal investigator Akhilesh Pandey,
assistant professor in the
McKusick-Nathans Institute of Genetic
Medicine at Johns Hopkins. "We wanted to make the best
human protein database ever, so research could go faster
and available information could be easier to find and
easier to organize."
Pandey said advances in technology have made getting
data much easier, but processing it and interpreting
observations are now the big hurdles in laboratories.
"It has remained difficult to put together a big
picture of biology, to see how one set of observations
intersects with and complements others," he said. "With
this single database, biologists now will be able to
quickly review what is known about the proteins and how
they interact, speeding the creation of new hypotheses to
test in the lab."
The 3,000 proteins currently in the database are known
to interact with anywhere from tens to hundreds of other
proteins. Online, a user can pull up a visual web of
protein-protein interactions with just the click of a
mouse.
"The entries have been critically reviewed, making the
information in the database as accurate and complete as
possible," Pandey said. "Scientists can even link directly
to the scientific paper behind an item, to judge for
themselves its validity."
To create the database entries, dozens of trained
biologists, most at the Institute of Bioinformatics in
India, started with the database Online Mendelian
Inheritance in Man, the offspring of a paper catalog of
disease genes started in 1966 by Victor A. McKusick,
University Professor of Medical Genetics at Johns
Hopkins.
Focusing on these genes' proteins, the scientists
critically reviewed hundreds of thousands of scientific
papers, making connections between papers and resolving
inconsistencies--something automated computer programs
cannot do, Pandey said. They also pulled information from
smaller, existing databases to complete each protein's
entry.
"We believe that manual curation--lots of scientists
poring through the literature--is the key to building a
more accurate and more complete database," said Pandey, who
serves as chief scientific adviser to the Institute of
Bioinformatics. "Eventually, we hope the database will be
managed by the larger community of scientists because it
will be most useful if those who know these proteins best
take responsibility for keeping entries up-to-date and
accurate."
The database currently contains everything that's
known about proteins involved in diseases, such as
so-called breast cancer genes BRCA1 and BRCA2, and proteins
in key pathways, such as families of enzymes that modify
other proteins. It includes only experimentally proven or
widely accepted facts about the proteins, without mixing in
computer-generated predictions the way some other databases
do, Pandey said.
The online database is also easy to use, in large part
because those who designed it are experts in both computer
science and biology, he added. A biologist looking for
information about BRCA1, for example, can search by any of
its names and get a single entry that contains
everything--its alternative names, structure, function and
sequence, how it's modified, other proteins with which it
interacts, where it's found in cells, where it's found in
the body and links to the papers that say so.
Aravinda Chakravarti, director of the McKusick-Nathans
Institute and a co-author on the paper, said, "The richness
of the database is astounding, since it was created in such
a short time by expert reviews of individual publications.
This would have been impossible without scientists to
review the literature and computational biologists to make
a database that is truly easy to use."
Academic researchers will have free access to the
database. Johns Hopkins Licensing and Technology
Development is currently establishing licensing criteria
for companies interested in using the database. The
database has been active for five months and has elicited
almost 2 million hits, simply from word of mouth and
presentations at scientific meetings, Pandey said.
The Human Protein Resource Database was built using
freely available computer code, so-called open source, from
ZOPE (Z Object Publishing Environment), which experts at
the Institute of Bioinformatics adjusted to fit the
project's needs. One of the benefits of using an
object-oriented structure like ZOPE, Pandey said, is that
there's no limit on the number of entries (i.e., proteins)
or the number of characteristics that can be included.
Authors from Johns Hopkins on the paper are Pandey,
Chakravarti, Suraj Peri, Daniel Navarro, Ramars Amanchy,
Troels Kristiansen, Chandra Kiran Jonnalagadda, Mads
Gronborg, Nieves Ibarrola, Chi Dang, Joe Garcia, Jonathan
Pevsner and Ada Hamosh. Pandey serves as chief scientific
adviser to the Institute of Bioinformatics. The terms of
this arrangement are being managed by The Johns Hopkins
University in accordance with its conflict of interest
policies.
Related Web sites:
Genome Research
Human Protein Reference Database
Institute of Bioinformatics