Database

A Conversation with Jim Gray

By Philippe Lourier, February 05, 2007

A computing great takes time to discuss transaction theory, databases, and more in this 2001 interview.

DDJ: We were talking a bit about performance and time. How about space? That is another dimension. You are working on Microsoft Terra Server. A huge terabyte database.

JG: That is an interesting story. The story is that we wanted to put together a big database just to see how Microsoft products work. We also wanted to be on the web and be up all the time so we could show people that NT doesn't crash all the time. If you treat it with some respect, it will stay up all the time. The Terra Server. . . We have operated it now for about a year and a half -- www.terraserver.microsoft.com. It has all the aerial imagery that we have gotten from the USGS. We now have about 50% coverage of the U.S. at one meter resolution imagery. It was supposed to go up this week but, by the time you see this, we will also have all the topo maps of the United States. We will have 100% coverage of the U.S. with topographic maps at two meter resolution and then. . . So, this ends up being about 15 terabytes worth of imagery. Because we compress it, and tile it, and so on. It boils down to something like about 5 or 6 terabytes worth of imagery.

DDJ: Does that matter -- the size of the database. . . The transactions per second? This could be a huge database.

JG: There are a couple of things that matter. Let's imagine that you want to backup a terabyte. Let's image that you have a tape and you start your backup. The tape runs at 10 megabytes a second. It will take a day and a half to back the system up. We are talking 10 terabytes here. It is going to take two weeks to back up your 10- terabyte database. Whenever you start working with large quantities of data, you are forced into doing things in parallel. Any utilities that you have must run in parallel, have to be restartable, and have toincremental. The problems you face are much, much more demanding. One strategy is to say, "Oh, it hurts when you go like that. Don't go like that. Don't have terabyte databases. Who, on earth, has terabyte databases anyway." Lots of people do. The web is creating data at a huge rate. People are going out and searching the web, and making archival copies of it. People are keeping track of the quick stream data associated with their web site. Transaction data, associated with the web, and the raw data associated with it -- the images, sounds, and videos, easily give rise to very large data sets. There is a flip sideof this which is that the market price for a terabyte today is about$10,000-$12,000. That is to say that if you go shopping and you don't go to one of these expensive disk vendors, but you go to a no name storage vendor, you can get a terabyte for about $10,000 or $15,000.

DDJ: So, this is really gearing up for the near future really where there will be this amount of data.

JG: Almost any company of, let's say, 100 employees can afford to have terabytes worth of data.

DDJ: This reminds me of the story. . . The early PCs had 6 to 40K of memory.

JG: Data sources exist for giving you lots and lots of data, both image data and transaction data. The image data is the thing that gives you orders of magnitude larger records.

DDJ: These are esthetic images. They are not even video or multi-media type content. . .

JG: The tera server is, in fact, fairly esthetic. That fact is that people are putting up satellites and taking pictures of the earth. Every two weeks you get a new picture. Every two weeks you get another 10 terabytes. We have been working with Sloan Digital Sky Survey, which is like the tera server looking the other direction. Every night of observation they get 200 gigabytes worth of images of the stars. They want to process that information. After a little bit of processing, it turns into a terabyte. Here are some guys who can give us a terabyte a day, if we are interested.

DDJ: How about web search engines? How do those work? What are the back ends of those?

JG: As you may know, Brewster Kale... Have you talked to Brewster? Brewster Kale is running the internet archive over hear at Presido [in San Francisco]. The Internet, if you scan it. . . As far as we can tell, the fraction of the web that gets crawled, at this point, is below 20%. In fact, the fraction has been shrinking. Even if you crawl that 20%, you come up with a quarter to a half a terabyte of HTML data. If you go off and start picking up the images, the gifts, and MP3's, and the AVI files, and so on, then you are talking about a lot more bytes than that. If you go off and grab all of the HTML, you don't have quite a terabyte worth of data. There are people, lots of people, who are indexing the web, going off and grabbing this information, and then building textual indices on them. That is one of the sources.

DDJ: ... SQL databases from the back end.

JG: You know, there not. They may be. That is not the typical way. Certainly, Brewster Kale is not using the SQL. These people are mostly dealing with text data. The first question they have is -- Is this English, French, Russian, etc.? Is this XML instruction data or is it unstructured data? They are trying then to pull out the URL's and build a graph, which is the graph of the internet. They are doing studies trying to decide. . . Google is an interesting case where they go out and crawl the web. They try and figure out -- Who points to this page? If a lot of people point to this page, this is a good page. If not very many people point to this page, it is probably not such a good page. If this is a good page, and it points to your page, then your page is probably pretty good. There are lots and lots of data analysis and mining that is going on with people who are making copies of the web and of the web graph. I would say, overall, the place where SQL is being used is mostly in the e-commerce space and not so much in the analysis of the web.

DDJ: What about search engines like Alta Vista or where you put in a key word and it comes up with the. . .

JG: Most of the search engines are using proprietary indexing mechanisms -- Infoseek, Yahoo, Alta Vista. There is a standard text indexing mechanism that comes with Microsoft -- IIS (the web server).There is a standard text indexing engine that comes with Netscape. Both of those are special-purpose database systems which are built to handle text. They are not general purpose SQL systems. They do have an SQL interface to the Microsoft text search engines. I don't know if theyhave one to the Netscape. You can ask SQL queries of them. That is notthe typical way that they are accessed, I believe.

Previous 5 6 7 8 9 10 11 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

A Conversation with Jim Gray

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Database

A Conversation with Jim Gray

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Database Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content