Parallelism is about speeding up computations by using multiple processors.
When processing big data, data parallelism is used to move computations close to the data, instead of moving data in the network.
This means that our programming model and execution should hide (but not forget!) those.
Map/Reduce is a general computation framework, loosely based on functional programming[1]. It assumes that data exists in a K/V store
map((K, V), f: (K, V) -> (X, Y)): List[(X, Y)]
reduce((K, List[V])): List[(X, Y)]
Map/Reduce as a system was proposed by Dean & Ghemawat [2], along with the GFS.
Hadoop cluster
What is good about Hadoop: Fault tolerance
Hadoop was the first framework to enable computations to run on 1000s of commodity computers.
What is wrong: Performance
The Hadoop Workflow
Microsoft’s DryadLINQ combined the Dryad distributed execution engine with the LINQ language for defining queries.