Site Archive (Complete)
Architecture Blog: The DBMS vs. Map/Reduce: Is That Really Competition?
Architecture & Design
PATTERN LANGUAGE

Modeling, Managing, Making it Right.

by Jonathan Erickson
IF YOU BUILD IT

... Will they Come?

by Arnon Rotem-Gal-Oz
January 22, 2008

The DBMS vs. Map/Reduce: Is That Really Competition?

David DeWitt and Michael Stonebraker write about MapReduce in The Database Column. Now I usually like what Michael Stonebraker writes (e.g., his piece on the RDBMS Demise which I also wrote about). However, I can't say that this time around.

David and Michael write that MapReduce is a big step backwards. Before I discuss what they write, here is a (very high level) reminder of what Map/Reduce is.

As Google's Jeffery Dean and Sanjay Ghemawat explain, MapReduce is a way to get automatic parallelization and distribution along with
fault tolerance, monitoring, and I/O scheduleing for tasks that need to work on complete datasets. MapReduce uses two functions:

  • Map - multiple instances of which run in parallel to process a key/value pair and produce produce a set of grouping key(s) and intermediate values.
  • Reduce - which runs per grouping key and merge the intermediate values to a a set of merged outputs (usually one).

David and Michael claims that MapReduce is:

  • A step backwards because it doesn't build on Schema.
  • A poor implementation because it doesn't use indexes.
  • Not new.
  • Missing features - like bulk load, indexing, updates, transactions, integrity constraints, referential integrity, views.
  • Incompatible with DBMS tools - like report writers, BI tools, replication tools, design tools.

Well, if anything, it seems that David and Michael don't really understand what MapReduce is. As noted above, MapReduce is a way to go over complete sets in an efficient distributed manner. In fact, it can even be used to build the index of a traditional RDBMS. It isn't really competing with databases Relational or other. Yep, comparing MapReduce and databse is an apples and oranges thing...

I guess they might have meant to talk about another Google tool called BigTable which is at least sort of a column database (Michael's company also makes a column database) for storing structured data in a highly
distributed , high-performance way. However, David and Michael would still be wrong, as BigTable is proprietary and targeted at a specific purpose so it isn't supposed to solve the same problems as a general-purpose database not to mention that it is highly scalable (ever heard of Google's search engine ;) ) and does support things like indexes, updates, etc.

Also as mentioned in the RDBMS Is Dead, the Internet proved that RDBMS features (like transactions, etc.) can only only scale so much. While Databases focus on the Consistency and Availability parts of the CAP conjecture and ACID tenets , Internet scale systems pick Partitioning and Availability and BASE tenets instead.

Posted by Arnon Rotem-Gal-Oz at 02:43 PM  Permalink




 
INFO-LINK


Related Sites: DotNetJunkies, SD Expo, SqlJunkies