Data Science is similar to physics: it attemps to create theories of realities based on some formalism that another science brings. For physics it was mathematics, for data science it is computer science. Data has grown expeditiously in these last years and has reached a distance that in metres is the distance to Jupiter. The galaxy is in the order of magnitude of 400 Yottametres, which has $3 \cdot 8$ zeros following after it. So quite a lot. We don’t know if the magnitude of the data will grow this fast but certainly we need to be able to face this case.
Early versions
Small History of Information systems 🟨–
We can pinpoint three main historical developments of information systems that coincide with some revolutions in human history. 0. First humans just stored their information in the brains. The stories, culture was mainly transmitted orally from person to person. This was similar to the point made in (Harari 2024) by professor Harari: this allowed humans to create networks of information that created large scale alliances.
- Humans invent writing: first the people needed some storage for economical transactions, this created the need to have some durable tables were to store this information.
- Humans invent the printing press: Gutemberg’s innovation allowed information storage and duplication to be much more cheaper compared to the historical manual copying and writing. This empowered ideas like the christian religion to spread even further and have much higher impact.
- Invention of the silicon based processors. This innovation enabled further storage and processing, and ultra-fast communication, having another deep effect on humanity as a whole.
20k Ishango Bone is one of the first. 250 BC there was a library of Alexandria. With physical books it was very difficult to get some higher level trends. Now we can just analyze hundreds, and millions of documents in a very fast manner.
Codd’s Data Independence 🟩
Edgar Codd suggested that a usable database management system should hide all the physical complexity from the user and expose instead a simple, clean model.
See (Codd 1970). The other important contribution in the paper is the suggestion to use tables, which gave birth to the relational languages and algebra.
As in Architettura e livelli 1, 2, the data systems are divided into inter-operating levels, that communicate with each other using interfaces. This makes easy to update the underlying level without the upper one noticing.
Evaluating an Information system
Velocity 🟩
For the velocity we care about the capacity, throughput and latency. Capacity is how much you can store, throughput is how fast can you read, and latency is how much you have to wait until the first byte of data. Some of the first devices in 1956 had capacities of 5MB $(1.7m \times 1.5m)$, throughput of 12.5kB/s and latencies of 600ms. Now in 2024 it has capacities of 26TB $(14.7cm \times 2.6cm)$, throughput of 261MB/s and latencies of 4ms.
The important thing to observe is that
- Capacity has exploded very fast, more than one million orders of magnitude!
- Also throughput has increased, by only by about 4 orders of magnitude
- Latency has not advanced much.
The important consideration is that if we want to process the same amount of data, it’s much more important to parallelize so that we can read faster.
Volume 🟩-
How big should data be to be considered to be part of big data? We need first to learn something about the scales :D and orders of magnitude!. Those should be learned by hearth
- Kilo - 1000
- Mega - 1.000.000
- Giga - 1.000.000.000
- Tera - 1.000.000.000.000
- Peta - 1.000.000.000.000.000
- Exa - 1.000.000.000.000.000.000
- Zetta - 1.000.000.000.000.000.000.000
- Yotta - 1..000.000.000.000.000.000.000.000
- Ronna - 1.000.000.000.000.000.000.000.000.000
- Quetta - 1.000.000.000.000.000.000.000.000.000.000
When we go on the other side we have
- Milli
- Micro
- nano
- Pico
- femto
- atto
- zepto
- yocto
All by hearth! The threshold for big data is currently the Peta because it can’t be stored in a single computer.
Variety 🟩
Data could have different shapes, it’s important for the exam that you learn these shapes by hearth:
- Graphs
- Cubes
- Unstructured (Text is a often cited example of this).
- Trees
- Tables
A definition of Big Data 🟩
Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodating for the issue of growing discrepancy between capacity, throughput and latency.
This has some links with the definition of data, information, knowledge and wisdom, that you can find here.
Usage examples 🟩–
There are some real life companies and environments where storing many many gigabytes of data everyday is the most common thing ever: for example
- CERN produces 50PB of data every year, and most of these data needs to be analyzed, see Massive Parallel Processing.
- Sloan Digital Sky Survey (SDSS) which attempts to map every part of the sky produces 200 GB of data every day. It has the most detailed 3D map of the sky Also biology DNA can be seen as a data storage device.
Reading and Writing intensive systems 🟩
These are called respectively OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing for the write intensive. This has been explained in a more detailed manner in Data Cubes.
Techniques for Big Data
We appoint two as the main techniques used to handle big amounts of data. Many of these arise due to the ever growing amount of data in modern times:
- Parallelization: if we have many many processors, it’s easy to read many pages at the same time. This is what is leveraged in Massive Parallel Processing.
- Batch Processing: due to the discrepancy between throughput and latency, it’s often better to read a lot of data at the same time, and then process it in a batch. This is also leveraged in systems like MapReduce introduced in Massive Parallel Processing.
Evolution of the data stack
We have 10 layers, instead of the 7 Architettura e livelli 1, 2 of the ISO OSI layers of the networking.
We will rebuild the whole datastack and understand how every layer works together with one another to handle the big data.
We will link for each part some important nodes regarding those
- Storage: Cloud Storage, Wide Column Storage, Distributed file systems
- Encoding and Syntax: Markup
- Data models and Validation: Data Models and Validation
- Processing: Massive Parallel Processing
References
[1] Harari “Nexus: A Brief History of Information Networks from the Stone Age to AI” Random House 2024
[2] Codd “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM Vol. 13(6), pp. 377–387 1970