MapReduce vs Traditional RDDBMS

Traditional RDBMS MapReduce
Max Data Size per DB many gigabytes many petabytes
Access Method interactive & batch batch
Updates read & update many times read many times — write once
Schema Define before inital load of data, and difficult to make (and test?) extensive schema changesAlso has structured data, in which field entries are in a defined format such as XML or the type of a db column.

The keys/values are defined at the the time of data insert.

No schema required, and if schema then it can be changed dynamically without regression impact.Fields may be unstructured (free text, images) or semi-structured (spreadsheet or even this table, in which some description is implied by the row and column headings).

The keys/values are defined at the time of processing.

System Integrity/Availability Generally designed to be high.Recovery made more difficult due to multiple autonomous components (SAN, database cluster, applications server farm), that must communicate with each other. Assumed to be low.Failure is assumed to have occured, so computations occur on multiple redundant nodes whose results are sorted and merged, or are rescheduled.

Tasks generally do not have dependencies on each other in a shared-nothing architecture.

Scalability non-linear (cannot simply add additional cloned servers to the cluster forever) linear (can scale by factor of 10, 100, or even 1000 cloned nodes)
Normalization Database performs better if data is normalized. Database performs better is data is de-normalized since the data is scattered across multiple nodes. As de-normalized, all information is available within each block as it is read. Example is a webserver log, in which hostnames are specified in full each time.
Data locality Data is typically stored on a SAN, and access speed limited by fibre bandwidth. Additional processing may be done on an application server, in which bandwidth is limited by the datacenter LAN. Data is processed by CPU on the same computer which hosts the data on internal drives. Access speed is limited by internal bus bandwidth. This model is called “lack of data motion”.



Comments are closed.