Data Science is similar to physics: it attemps to create theories of realities based on some formalism that another science brings. For physics it was mathematics, for data science it is computer science. Data has grown expeditiously in these last years and has reached a distance that in metres is the distance to Jupiter. The galaxy is in the order of magnitude of 400 Yottametres, which has $3 \cdot 8$ zeros following after it. So quite a lot. We don’t know if the magnitude of the data will grow this fast but certainly we need to be able to face this case.
Small History of Information systems
We can pinpoint three main historical developments of information systems that coincide with some revolutions in human history. 0. First humans just stored their information in the brains. The stories, culture was mainly transmitted orally from person to person. This was similar to the point made in (Harari 2024) by professor Harari: this allowed humans to create networks of information that created large scale alliances.
- Humans invent writing: first the people needed some storage for economical transactions, this created the need to have some durable tables were to store this information.
- Humans invent the printing press: Gutemberg’s innovation allowed information storage and duplication to be much more cheaper compared to the historical manual copying and writing. This empowered ideas like the christian religion to spread even further and have much higher impact.
- Invention of the silicon based processors. This innovation enabled further storage and processing, and ultra-fast communication, having another deep effect on humanity as a whole.
20k Ishango Bone is one of the first. 250 BC there was a library of Alexandria. With physical books it was very difficult to get some higher level trends. Now we can just analyze hundreds, and millions of documents in a very fast manner.
Evaluating an Information system
Velocity 🟩
For the velocity we care about the capacity, throughput and latency. Capacity is how much you can store, throughput is how fast can you read, and latency is how much you have to wait until the first byte of data. Some of the first devices in 1956 had capacities of 5MB, throughput of 12.5kB/s and latencies of 600ms. Now in 2024 it has capacities of 26TB, throughput of 261MB/s and latencies of 4ms.
The important thing to observe is that
- Capacity has exploded very fast, more than one million orders of magnitude!
- Also throughput has increased, by only by about 4 orders of magnitude
- Latency has not advanced much.
The important consideration is that if we want to process the same amount of data, it’s much more important to parallelize so that we can read faster.
Volume 🟩
How big should data be to be considered to be part of big data? We need first to learn something about the scales :D and orders of magnitude!. Those should be learned by hearth
- Kilo - 1000
- Mega - 1.000.000
- Giga - 1.000.000.000
- Tera - 1.000.000.000.000
- Peta - 1.000.000.000.000.000
- Exa - 1.000.000.000.000.000.000
- Zetta - 1.000.000.000.000.000.000.000
- Yotta - 1..000.000.000.000.000.000.000.000
- Ronna - 1.000.000.000.000.000.000.000.000.000
- Quetta - 1.000.000.000.000.000.000.000.000.000.000
When we go on the other side we have
- Milli
- Micro
- nano
- Pico
- femto
- atto
- zepto
- yocto
All by hearth! The threshold for big data is currently the Peta because it can’t be stored in a single computer.
Variety 🟩
Data could have different shapes, it’s important for the exam that you learn these shapes by hearth:
- Graphs
- Cubes
- Unstructured
- Trees
- Tables
A definition of Big Data 🟩
Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodating for the issue of growing discrepancy between capacity, throughput and latency.
This has some links with the definition of data, information, knowledge and wisdom, that you can find here.
Usage examples 🟩–
There are some real life companies and environments where storing many many gigabytes of data everyday is the most common thing ever: for example
- CERN produces 50PB of data every year.
- Sloan Digital Sky Survey (SDSS) which attempts to map every part of the sky produces 200 GB of data every day. It has the most detailed 3D map of the sky Also biology DNA can be seen as a data storage device.
Reading and Writing intensive systems
These are called respectively OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing for the write intensive.
Evolution of the data stack
We have 10 layers, instead of the 7 Architettura e livelli 1, 2 of the ISO OSI layers of the networking.
We will rebuild the whole datastack and understand how every layer works together with one another to handle the big data.
We will link for each part some important nodes regarding those
- Storage: Cloud Storage, Wide Column Storage, Distributed file systems
- Encoding and Syntax: HTML e Markup
- Data models and Validation: Data Models and Validation
- Processing: Massive Parallel Processing
CAP Theorem
We can only have two of the following properties:
- Consistency (doesn’t depend on the machine that answers to your request).
- Availability (it should answer something).
- Partition tolerance (the system continues to function even if the network linking its machines is occasionally partitioned.)
We don’t have ACID anymore (see Advanced SQL) in the case of Big Data. So now we have 3 possible scenarios, which correspond to the 3 couples that is possible to have with these properties.
For example, let’s say we have a partition of the network then we have two cases: not available until the network is connected again, but we still have the same data. Or we have two parts that answer differently (this is usually called eventual consistency, because after the network is connected then it will return to the consistent state), but are still available to the users.
When network partitions happen we need to choose what property we want to keep, so we have three possible cases: CP, CA or AP. Services like Dynamo Key value store (see Cloud Storage#Key-value stores) choose AP and thus have eventual consistency.
Vector Clocks
Sometimes when we have a network partition we lose the linear timing of the system and so we have directed acyclic graphs
We have vectors for updates by different nodes. The merge is done by a certain node and updates the vector again.
The merge happens in the following manner: just choose the maximum value for each resource that has been modified. This grants consistency, but could lose some data.
PACELC Theorem
This is a generalization of the CAP Theorem, but I did not understood it and it is not currently present in the book or slides.
References
[1] Harari “Nexus: A Brief History of Information Networks from the Stone Age to AI” Random House 2024