Big and Fast Data

What is big data?

An overloaded, fuzzy term

“Data too large to be efficiently processed on a single computer”

“Massive amounts of diverse, unstructured data prodiced by high-performance applications”

How big is “Big?”

Typical numbers associated with Big Data

2.5 Exabytes (\(10^3\) TB) produced daily
IoT: 8.4 Billion devices with internet access
Amazon: 600 orders per second (2016) up from 35 in 2012
Alibaba in 2015: 400 orders per second
Google in 2016: 2+ trillion searches or 65k per second
- Each query involves > 1000 machines
- Each search touches 200+ services
Alibaba in 2018 Singles Day: 1 billion revenue in 1 min and 25 secs.

How big is “Big?” – Instagram

Instagram

1B daily users, clicking around the app
95M photos daily
Most followed user: 181M followers

How big is “Big?” – FaceBook

FaceBook

Warning: numbers (\(\Uparrow\)) from 2014! Today on FB:

2 Billion users
1.32 Billion active users per day
350 million photos per day (148k/min)
Every min: 510k comments, 293k status updates

The many Vs of Big data

Main Vs, by Doug Laney

Volume: large amounts of data
Variety: data comes in many different forms from diverse sources
Velocity: the content is changing quickly

More Vs

Value: data alone is not enough; how can value be derived from it?
Veracity: can we trust the data? How accurate is it?
Validity: ensure that the interpreted data is sound
Visibility: data from diverse sources need to be stitched together

Volume

We call Big Data big because it is really big:

90% of all the data ever was created in the last 2 years
By 2020, each person will generate 1.7MB per sec
The Big data / data analytics industry will be worth €200 Billion in 2020

Data growth rate

Variety

Structured data: SQL tables, images, format is known
Semi-structured data: JSON, XML
Unstructured data: Text, mostly

We often need to combine various data sources of different types to come up with a result

Velocity

Data is not just big; it is generated and needs to be processed fast. Think of:

Datacenters writing to log files
IoT sensors reporting temperatures around the globe
Twitter: 500 million tweets a day (or 6k/sec)
Stock Markets: high-frequency trading (latency costs money)
Online advertising

Data needs to be processed with soft or hard real-time guarantees

Big Data processing

The ETL cycle
- Extract: Convert raw or semi-structured data into structured data
- Transform: Convert units, join data sources, cleanup etc
- Load: Load the data into another system for further processing
Big data engineering is concerned with building pipelines
Big data analytics is concerned with discovering patters

How to process all this data?

Batch processing: All data exists in some data store, a program processes the whole dataset at once
Stream processing: Processing of data as they arrive to the system

2 basic approaches to distribute data processing operations on lots of machines

Divide the data in chunks, apply the same algorithm on all chunks (concurrency)
Divide the problem in chunks, run it on a cluster of machines (parallelism)

Large-scale computing

Not a new discipline:

Cray-1 appeared in the late ’70s
Physicists used super computers for simulations in the ’80s
Shared-memory designs still in large scale use (e.g. TOP500 supercomputers)

What is new?

Large scale processing on distributed, commodity computers, enabled by advanced software using elastic resource allocation.

Software (not HW!) is what drives the Big Data industry

A brief history of Big Data tech

2003: Google publishes the Google Filesystem paper, a large-scale distributed file system
2004: Google publishes the Map/Reduce paper, a distributed data processing abstraction
2006: Yahoo creates and open sources Hadoop, inspired by the Google papers
2006: Amazon lunches its Elastic Compute Cloud, offering cheap, elastic resources
2007: Amazon publishes the DynamoDB paper, sketches the blueprints of a cloud-native database
2009 – onwards: The NoSQL movement. Schema-less, distributed databases defy the SQL way of storing data
2010: Matei Zaharia et al. publish the Spark paper, brings FP to in-memory computations
2012: Both Spark Streaming and Apache Flink appear, able to handle really high volume stream processing
2012: Alex Krizhevsky et al. publish their deep learning image classification paper re-igniting interest in neural networks and solidifying the value of big data

The Big Data Tech Landscape 2017

The big data landscape

Progress is mostly industry-driven

D: Most advancement in Big Data technologies came from the industry. The universities only started contributing late. Why?

Data is the new oil

Typical problems solved with Big Data

Modeling: What factors influence particular outcomes/behaviours?
Information retrieval: Finding needles in haystacks, aka search engines
Collaborative filtering: Recommending items based on items other users with similar tastes have chosen
Outlier detection: Discovering outstanding transactions

Image credits

Data is the new oil picture (c) the Economist

Bibliography

[1]

J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proceedings of the 6th conference on symposium on opearting systems design & implementation - volume 6, 2004, pp. 10–10.

[2]

S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.

[3]

G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” ACM SIGOPS operating systems review, vol. 41, no. 6, pp. 205–220, 2007.

[4]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no. 10–10, p. 95, 2010.

[5]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[6]

A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.

Big and Fast Data

Georgios Gousios

09 September 2021

Big and Fast Data

What is big data?

How big is “Big?”

How big is “Big?” – Instagram

How big is “Big?” – FaceBook

The many Vs of Big data

Volume

Variety

Velocity

Big Data processing

How to process all this data?

Large-scale computing

A brief history of Big Data tech

The Big Data Tech Landscape 2017

Progress is mostly industry-driven

Typical problems solved with Big Data

Image credits

Bibliography

Copyright