Languages for Big Data processing

Scala and Python

The de facto languages of Big Data and data science are

  • Scala Mostly used for data intensive systems
  • Python Mostly used for data analytics tasks

Other languages include

  • Java The “assembly” of big data systems; the language that most big data infrastructure is written into.
  • R The statistician’s tool of choice. Great selection of libraries for serious data analytics, great plotting tools.

In our course, we will be using Scala and Python.

Scala and Python from 10k feet

  • Both support object orientation, functional programming and imperative programming
    • Scala’s strong point is the combination of FP and OO
    • Python’s strong point is the combination of OO and IP
  • Python is interpreted, Scala is compiled

Hello world

Scala

object Hello extends App {
    println("Hello, world")
    for (i <- 1 to 10) {
      System.out.println("Hello")
    }
}
  • Scala is compiled to JVM bytecode
  • Can interoperate with JVM libraries
  • Scala is not sensitive to spaces/tabs. Blocks are denoted by { and }

Declarations

Scala

val a: Int = 5
val b = 5
b = 6 // re-assignment to val

// Type of foo is infered
val foo = new ImportantClass(...)

var a = "Foo"
a = "Bar"
a = 4 // type mismatch
  • Type inference used extensively
  • Two types of variables: vals are single-assignment, vars are multiple assignment

Declaring functions

Scala

def max(x: Int, y: Int): Int = {
  if (x >= y)
    x
  else
    y
}
  • Scala is statically typed
  • The return value depends on the evaluation of expressions. The last evaluated expression determines the result (also the function return type)

Scala

def bigger(x: Int, y: Int,
  f: (Int,Int) => Boolean) = {

  f(x, y)
}

bigger (1, 2, (x, y) => (x < y))
bigger (1, 2, (x, y) => (x > y))
// Compile error
bigger (1, 2, x => x)

In both cases, bigger is a higher-order function, i.e. a function whose behaviour is parametrised by another function. f a function parameter. To call a HO function, we need to construct a function with the appropriate arguments. The compiler checks this in the case of Scala.

Declaring classes

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

// Type of a is infered
val a = new Foo(1, 4.0)
println(a.x) //x is read-only
println(a.y) //y is read-write
a.y = 10.0
println(a.y) //y is read-write
a.y = "Foo"   // Type mismatch, y is double
  • val means a read-only attribute. var is read-write
  • A default constructor is created automatically

Object-Oriented programming

Scala

class Foo(val x: Int,
          var y: Double = 0.0)

class Bar(x: Int, y: Int, z: Int)
  extends Foo(x, y)

trait Printable {
  val s: String
  def asString() : String
}

class Baz(x: Int, y: Double, private z: Int)
  extends Foo(x, y)
  with Printable

In both cases, the traditional rules of method overriding apply. Traits in Scala are similar to interfaces in Java; in addition, declared methods may be implemented and they can include attributes (state).

Case classes in Scala

case class Address(street: String, number: Int)
case class Person(name: String, address: Address)

val a1 = new Address("Mekelweg", 4)
val p1 = new Person("Georgios", a1)

val p2 = new Person("Georgios", a1)

p1 == p2 // True

Case classes are blueprints for immutable objects. We use them to represent data records. Scala automatically implements hashCode and equals for them, so we can compare them directly.

Pattern matching in Scala

Pattern matching is if..else on steroids

// Code for demo only, won't complile

value match {
  // Match on a value, like if
  case 1 => "One"
  // Match on the contens of a list
  case x :: xs => "The remaining contents are " + xs
  // Match on a case class, extract values
  case Email(addr, title, _) => s"New email: $title..."
  // Match on the type
  case xs : List[_] => "This is a list"
  // With a pattern guard
  case xs : List[Int] if xs.head == 5 => "This is a list of integers"
  case _ => "This is the default case"
}

Reading ahead

This is by far not an introduction to either programming languages. Please read more here

Pick one and become good at it!

  • BSc student? Pick Scala
  • Minor student? Pick Python

Bibliography

[1]
G. Hutton, “A tutorial on the universality and expressiveness of fold,” Journal of Functional Programming, vol. 9, no. 4, pp. 355–372, 1999.