An overloaded, fuzzy term
“Data too large to be efficiently processed on a single computer”
“Massive amounts of diverse, unstructured data prodiced by high-performance applications”
Typical numbers associated with Big Data
Warning: numbers (\(\Uparrow\)) from 2014! Today on FB:
Main Vs, by Doug Laney
More Vs
We call Big Data big because it is really big:
We often need to combine various data sources of different types to come up with a result
Data is not just big; it is generated and needs to be processed fast. Think of:
Data needs to be processed with soft or hard real-time guarantees
The ETL cycle
Big data engineering is concerned with building pipelines
Big data analytics is concerned with discovering patters
2 basic approaches to distribute data processing operations on lots of machines
Not a new discipline:
What is new?
Large scale processing on distributed, commodity computers, enabled by advanced software using elastic resource allocation.
Software (not HW!) is what drives the Big Data industry
D:: Most advancement in Big Data technologies came from the industry. The universities only started contributing late. Why?
Data is the new oil
Figure by Banko and Brill, 2001. They showed that simple algorithms perform better than complex ones when the data is big enough.
D: Identify a big data system you interact with every day. Try to answer the following questions:
Use this URL: http://bit.ly/big-data-course-1
Work with the people around you, for 5 mins
This work is (c) 2017 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.