The Johns Hopkins Gazette: September 27, 1999
September 27, 1999
VOL. 29, NO. 5


Harnessing the Flood of Scientific Data

NSF gives Hopkins-led collaboration a grant to improve data storage, use

By Michael Purdy

Johns Hopkins Gazette Online Edition

The fountain of information at the heart of science has become a fire hose, and an increase to riverlike volumes is on the way. A Hopkins-led collaboration has been working to develop new ways to handle this flood of information; this month, the value of their goals was recognized by the National Science Foundation, which awarded them a $2.5 million grant to support their innovative work.

To give the problem a sense of scale, Alexander Szalay, Alumni Centennial Professor of Physics and Astronomy and leader of the collaborative, notes that the CERN particle collider in Geneva, Switzerland currently produces more than 1 petabyte, or about 1,000,000,000,000,000 bytes, of information every year. The words and other text in all the books in the Library of Congress, in contrast, add up to only about one-thousandth of that information, or one terabyte (1 trillion bytes). And CERN is just one example of the tremendous information-generating powers of modern science.

"Our current ways of doing science are very much based on the concept that our data sets are so small that we can sort of 'eyeball' the whole thing and locate the interesting data," says Szalay. "And with the data sets we are getting in an increasing number of areas of science, this is just not going to be feasible. So we have to do something drastically different."

Participants in the Hopkins-led collaboration, which will develop new ways to store, access and search large volumes of data, include scientists from CalTech, the U.S. Department of Energy's Fermilab and Microsoft Corp.

"This problem is of course much bigger than astronomy or particle physics," Szalay says."I think this is actually becoming more a problem for the whole society. We are choking on information, and we have to sort out the relevant from the irrelevant. So I think what we're doing is a very interesting test bed for experimenting with new technologies that could have broader applications elsewhere."

Particle physicists were among the first to have to deal with huge quantities of information. Their work to manage that information led to the development of tools and techniques that found uses beyond the realm of the physics lab, notes Aihud Pevsner, Jacob P. Hain Professor of Physics and Astronomy and a member of the collaborative.

"To help work with large data sets at CERN, Tim Berners-Lee invented in 1989 what later became the World Wide Web," says Pevsner. "He did it because the tools that they had at the time were inadequate for the distribution of the data sets they were working with."

Pevsner, a particle physicist, will be one of 500 American physicists working at the Large Hadron Collider at CERN, the world's most powerful particle collider. The LHC is expected to produce 100 petabyte data sets.

Szalay is a researcher for the Sloan Digital Sky Survey, an effort he calls the "cosmic genome project," which will map everything visible in several large chunks of the northern and southern sky. SDSS starts next year, and before it is over he estimates that it will produce 40 terabytes of data with a two-terabyte catalog.

Such a high volume of data reduces the chances that astronomers will miss gathering important information, but it also makes it harder to find that information. "When you have so much data that it chokes you, you have to keep breaking it up into smaller chunks until it no longer chokes you," Szalay says.

Developing better ways to break down large quantities of information is the first major component of research under the NSF grant. The SDSS information, for example, might be broken up both by the area of the sky that the data comes from and by the color of the objects observed in the sky. The challenge, though, is to make sure that this process of partitioning the data improves the scientists' abilities to see important patterns and irregularities in the data.

"We want to try to make it possible for data that will be of interest to the same kinds of queries to be 'located' close together so they are easier to find," says Ethan Vishniac, director of the Johns Hopkins Center for Astrophysical Sciences, also a collaborative member.

Another concern is that these huge chunks of information will probably be stored at geographically different locations. Some next-generation science projects involve so much information, according to Szalay, that it cannot be brought to researchers across computer networks. Arranging ways to simultaneously access data in these different locations without ever bringing it together in one database, a technique called "distributed processing," is the second major component of research supported by the NSF grant.

The third component of the NSF grant will improve a technique called "parallel" querying. This involves searching in different locations at the same time, not unlike sending out an army of librarians to search or work in several different, large libraries at once. Researchers will strive to make these search agents smarter and more independent by improving the software they use.

To test their efforts at dealing with these challenges, researchers will use data from the SDSS, from the CERN Particle Collider and from GALEX, a sky-mapping survey that covers the same areas as SDSS but measures different forms of radiation.

"Data sets that are astronomical in every sense of that word are great test beds for computer scientists to experiment with to develop novel techniques for visualizing, organizing and querying information," says Michael Goodrich, professor of computer science and a member of the collaborative.

Additional collaborators include physicist Harvey Newman, research scientist Julian Bond and astronomer Chris Martin of Caltech; physicist Thomas Nash of Fermilab; computer scientist Jim Gray of Microsoft; and astronomers Ani Thakar and Peter Kunszt of Hopkins.

The $2.5 million NSF grant is one of 31 announced by NSF as part of a new effort to support "knowledge and distributed intelligence" projects. The grants are focused on efforts to apply new computer technology across multidisciplinary areas in science and engineering.

Related Web site:
Alexander Szalay's page on scalable computing efforts