Processing data with Spark

The world before Spark

Parallelism

Parallelism is about speeding up computations by utilising clusters of machines.

  • Task parallelism: Distribute computations across different processors
  • Data parallelism: Apply the same computation on different data sets

Distributed data parallelism involves splitting the data over several distributed nodes, where nodes work in parallel, and combine the individual results to come up with a final one.

Issues with data parallelism

  • Latency: Operations are 1.000x (disk) or 1.000.000x (network) slower than accessing data in memory
  • (Partial) failures: Computations on 100s of machines may fail at any time

This means that our programming model and execution should hide (but not forget!) those.

Hadoop: Pros and Cons

What is good about Hadoop: Fault tolerance

Hadoop was the first framework to enable computations to run on 1000s of commodity computers.

What is wrong: Performance

  • Before each Map or Reduce step, Hadoop writes to HDFS
  • Hard to express iterative problems in M/R
    • All machine learning problems are iterative

The Hadoop Workflow

DryadLINQ

Microsoft’s DryadLINQ combined the Dryad distributed execution engine with the LINQ language for defining queries.