Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

A Conversation with Jim Gray


DDJ: We were talking a bit about performance and time. How about space? That is another dimension. You are working on Microsoft Terra Server. A huge terabyte database.

JG: That is an interesting story. The story is that we wanted to put together a big database just to see how Microsoft products work. We also wanted to be on the web and be up all the time so we could show people that NT doesn't crash all the time. If you treat it with some respect, it will stay up all the time. The Terra Server. . . We have operated it now for about a year and a half -- www.terraserver.microsoft.com. It has all the aerial imagery that we have gotten from the USGS. We now have about 50% coverage of the U.S. at one meter resolution imagery. It was supposed to go up this week but, by the time you see this, we will also have all the topo maps of the United States. We will have 100% coverage of the U.S. with topographic maps at two meter resolution and then. . . So, this ends up being about 15 terabytes worth of imagery. Because we compress it, and tile it, and so on. It boils down to something like about 5 or 6 terabytes worth of imagery.

DDJ: Does that matter -- the size of the database. . . The transactions per second? This could be a huge database.

JG: There are a couple of things that matter. Let's imagine that you want to backup a terabyte. Let's image that you have a tape and you start your backup. The tape runs at 10 megabytes a second. It will take a day and a half to back the system up. We are talking 10 terabytes here. It is going to take two weeks to back up your 10- terabyte database. Whenever you start working with large quantities of data, you are forced into doing things in parallel. Any utilities that you have must run in parallel, have to be restartable, and have toincremental. The problems you face are much, much more demanding. One strategy is to say, "Oh, it hurts when you go like that. Don't go like that. Don't have terabyte databases. Who, on earth, has terabyte databases anyway." Lots of people do. The web is creating data at a huge rate. People are going out and searching the web, and making archival copies of it. People are keeping track of the quick stream data associated with their web site. Transaction data, associated with the web, and the raw data associated with it -- the images, sounds, and videos, easily give rise to very large data sets. There is a flip sideof this which is that the market price for a terabyte today is about$10,000-$12,000. That is to say that if you go shopping and you don't go to one of these expensive disk vendors, but you go to a no name storage vendor, you can get a terabyte for about $10,000 or $15,000.

DDJ: So, this is really gearing up for the near future really where there will be this amount of data.

JG: Almost any company of, let's say, 100 employees can afford to have terabytes worth of data.

DDJ: This reminds me of the story. . . The early PCs had 6 to 40K of memory.

JG: Data sources exist for giving you lots and lots of data, both image data and transaction data. The image data is the thing that gives you orders of magnitude larger records.

DDJ: These are esthetic images. They are not even video or multi-media type content. . .

JG: The tera server is, in fact, fairly esthetic. That fact is that people are putting up satellites and taking pictures of the earth. Every two weeks you get a new picture. Every two weeks you get another 10 terabytes. We have been working with Sloan Digital Sky Survey, which is like the tera server looking the other direction. Every night of observation they get 200 gigabytes worth of images of the stars. They want to process that information. After a little bit of processing, it turns into a terabyte. Here are some guys who can give us a terabyte a day, if we are interested.

DDJ: How about web search engines? How do those work? What are the back ends of those?

JG: As you may know, Brewster Kale... Have you talked to Brewster? Brewster Kale is running the internet archive over hear at Presido [in San Francisco]. The Internet, if you scan it. . . As far as we can tell, the fraction of the web that gets crawled, at this point, is below 20%. In fact, the fraction has been shrinking. Even if you crawl that 20%, you come up with a quarter to a half a terabyte of HTML data. If you go off and start picking up the images, the gifts, and MP3's, and the AVI files, and so on, then you are talking about a lot more bytes than that. If you go off and grab all of the HTML, you don't have quite a terabyte worth of data. There are people, lots of people, who are indexing the web, going off and grabbing this information, and then building textual indices on them. That is one of the sources.

DDJ: ... SQL databases from the back end.

JG: You know, there not. They may be. That is not the typical way. Certainly, Brewster Kale is not using the SQL. These people are mostly dealing with text data. The first question they have is -- Is this English, French, Russian, etc.? Is this XML instruction data or is it unstructured data? They are trying then to pull out the URL's and build a graph, which is the graph of the internet. They are doing studies trying to decide. . . Google is an interesting case where they go out and crawl the web. They try and figure out -- Who points to this page? If a lot of people point to this page, this is a good page. If not very many people point to this page, it is probably not such a good page. If this is a good page, and it points to your page, then your page is probably pretty good. There are lots and lots of data analysis and mining that is going on with people who are making copies of the web and of the web graph. I would say, overall, the place where SQL is being used is mostly in the e-commerce space and not so much in the analysis of the web.

DDJ: What about search engines like Alta Vista or where you put in a key word and it comes up with the. . .

JG: Most of the search engines are using proprietary indexing mechanisms -- Infoseek, Yahoo, Alta Vista. There is a standard text indexing mechanism that comes with Microsoft -- IIS (the web server).There is a standard text indexing engine that comes with Netscape. Both of those are special-purpose database systems which are built to handle text. They are not general purpose SQL systems. They do have an SQL interface to the Microsoft text search engines. I don't know if theyhave one to the Netscape. You can ask SQL queries of them. That is notthe typical way that they are accessed, I believe.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.