Languages for Big Data processing

Scala and Python

The de facto languages of Big Data and data science are

  • Scala Mostly used for data intensive systems
  • Python Mostly used for data analytics tasks

Other languages include

  • Java The “assembly” of big data systems; the language that most big data infrastructure is written into.
  • R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools.

In our course, we will be using Scala and Python.

Scala and Python from 10k feet

  • Both support object orientation, functional programming and imperative programming
    • Scala’s strong point is the combination of FP and OO
    • Python’s strong point is the combination of OO and IP
  • Python is interpreted, Scala is compiled

Hello world


object Hello extends App {
    println("Hello, world")
    for (i <- 1 to 10) {
  • Scala is compiled to JVM bytecode
  • Can interoperate with JVM libraries
  • Scala is not sensitive to spaces/tabs. Blocks are denoted by { and }



val a: Int = 5
val b = 5
b = 6 // re-assignment to val

// Type of foo is infered
val foo = new ImportantClass(...)

var a = "Foo"
a = "Bar"
a = 4 // type mismatch
  • Type inference used extensively
  • Two types of variables: vals are single-assignment, vars are multiple assignment

Declaring functions


def max(x: Int, y: Int): Int = 
  if (x >= y) x else y
  • Statically typed
  • Evaluated expressions have types
  • The return type is the most generic type of all return expressions

Higher order functions


def bigger(x: Int, y: Int,
  f: (Int,Int) => Boolean) =
  f(x, y)

bigger (1, 2, (x, y) => (x < y))
bigger (1, 2, (x, y) => (x > y))
// Compile error
bigger (1, 2, x => x)

bigger is a higher-order function, i.e. a function whose behaviour is parametrised by another function. f a function parameter. To call a HO function, we need to pass a function with the appropriate argument types. The compiler checks this in the case of Scala.

Declaring classes


class Foo(val x: Int,
          var y: Double = 0.0)

// Type of a is infered
val a = new Foo(1, 4.0)
println(a.x) //x is read-only
println(a.y) //y is read-write
a.y = 10.0
println(a.y) //y is read-write
a.y = "Foo"   // Type mismatch, y is double
  • val means a read-only attribute. var is read-write
  • A default constructor is created automatically

Object-Oriented programming


class Foo(val x: Int,
          var y: Double = 0.0)

class Bar(x: Int, y: Int, z: Int)
  extends Foo(x, y)

trait Printable {
  val s: String
  def asString() : String

class Baz(x: Int, y: Double, private z: Int)
  extends Foo(x, y)
  with Printable

In both cases, the traditional rules of method overriding apply. Traits in Scala are similar to default interfaces in Java > 9; in addition, they can include attributes (state).

Data classes


case class Address(street: String, 
  number: Int)
case class Person(name: String, 
  address: Address)

val p = new Person("G", 
  new Address("a", 2))

Data classes are blueprints for immutable objects. We use them to represent data records. Both languages implement equals (or __eq__) for them, so we can objects directly.

Pattern matching in Scala

Pattern matching is if..else on steroids

// Code for demo only, won't complile

value match {
  // Match on a value, like if
  case 1 => "One"
  // Match on the contens of a list
  case x :: xs => "The remaining contents are " + xs
  // Match on a case class, extract values
  case Email(addr, title, _) => s"New email: $title..."
  // Match on the type
  case xs : List[_] => "This is a list"
  // With a pattern guard
  case xs : List[Int] if xs.head == 5 => "This is a list of integers"
  case _ => "This is the default case"

Reading ahead

This is by far not an introduction to either programming languages. Please read more here


G. Hutton, “A tutorial on the universality and expressiveness of fold,” Journal of Functional Programming, vol. 9, no. 4, pp. 355–372, 1999.