Data Science is similar to physics: it attemps to create theories of realities based on some formalism that another science brings. For physics it was mathematics, for data science it is computer science. Data has grown expeditiously in these last years and has reached a distance that in metres is the distance to Jupiter. The galaxy is in the order of magnitude of 400 Yottametres, which has $3 \cdot 8$ zeros following after it. So quite a lot. We don’t know if the magnitude of the data will grow this fast but certainly we need to be able to face this case.

Early versions

Small History of Information systems 🟨++

We can pinpoint three main historical developments of information systems that coincide with some revolutions in human history. 0. First humans just stored their information in the brains. The stories, culture was mainly transmitted orally from person to person. This was similar to the point made in (Harari 2024) by professor Harari: this allowed humans to create networks of information that created large scale alliances.

  1. Humans invent writing: first the people needed some storage for economical transactions, this created the need to have some durable tables were to store this information.
  2. Humans invent the printing press: Gutemberg’s innovation allowed information storage and duplication to be much more cheaper compared to the historical manual copying and writing. This empowered ideas like the christian religion to spread even further and have much higher impact.
  3. Invention of the silicon based processors. This innovation enabled further storage and processing, and ultra-fast communication, having another deep effect on humanity as a whole.

20k Ishango Bone is one of the first. 250 BC there was a library of Alexandria. With physical books it was very difficult to get some higher level trends. Now we can just analyze hundreds, and millions of documents in a very fast manner.

Codd’s Data Independence 🟩

Edgar Codd suggested that a usable database management system should hide all the physical complexity from the user and expose instead a simple, clean model.

See (Codd 1970). The other important contribution in the paper is the suggestion to use tables, which gave birth to the relational languages and algebra.

As in Architettura e livelli 1, 2, the data systems are divided into inter-operating levels, that communicate with each other using interfaces. This makes easy to update the underlying level without the upper one noticing.

Database management systems

A database management system stack can be viewed as a four-layer stack:

  • A logical query language with which the user can query data;
  • A logical model for the data;
  • A physical compute layer that processes the query on an instance of the model;
  • A physical storage layer where the data is physically stored.

Evaluating an Information system

Velocity 🟩

For the velocity we care about the capacity, throughput and latency. Capacity is how much you can store, throughput is how fast can you read, and latency is how much you have to wait until the first byte of data. Some of the first devices in 1956 had capacities of 5MB $(1.7m \times 1.5m)$, throughput of 12.5kB/s and latencies of 600ms. Now in 2024 it has capacities of 26TB $(14.7cm \times 2.6cm)$, throughput of 261MB/s and latencies of 4ms.

The important thing to observe is that

  • Capacity has exploded very fast, more than one million orders of magnitude!
  • Also throughput has increased, by only by about 4 orders of magnitude
  • Latency has not advanced much.

The important consideration is that if we want to process the same amount of data, it’s much more important to parallelize so that we can read faster.

Image from Big data introduction Course, Ghislain Fourny

Volume 🟩-

How big should data be to be considered to be part of big data? We need first to learn something about the scales :D and orders of magnitude!. Those should be learned by hearth

  • Kilo - 1000
  • Mega - 1.000.000
  • Giga - 1.000.000.000
  • Tera - 1.000.000.000.000
  • Peta - 1.000.000.000.000.000
  • Exa - 1.000.000.000.000.000.000
  • Zetta - 1.000.000.000.000.000.000.000
  • Yotta - 1..000.000.000.000.000.000.000.000
  • Ronna - 1.000.000.000.000.000.000.000.000.000
  • Quetta - 1.000.000.000.000.000.000.000.000.000.000

When we go on the other side we have

  • Milli
  • Micro
  • nano
  • Pico
  • femto
  • atto
  • zepto
  • yocto

All by hearth! The threshold for big data is currently the Peta because it can’t be stored in a single computer.

Variety 🟩

Data could have different shapes, it’s important for the exam that you learn these shapes by hearth:

  • Graphs
  • Cubes
  • Unstructured (Text is a often cited example of this).
  • Trees
  • Tables

A definition of Big Data 🟩

Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodating for the issue of growing discrepancy between capacity, throughput and latency.

This has some links with the definition of data, information, knowledge and wisdom, that you can find here.

Usage examples 🟩–

There are some real life companies and environments where storing many many gigabytes of data everyday is the most common thing ever: for example

  • CERN produces 50PB of data every year, and most of these data needs to be analyzed, see Massive Parallel Processing.
  • Sloan Digital Sky Survey (SDSS) which attempts to map every part of the sky produces 200 GB of data every day. It has the most detailed 3D map of the sky Also biology DNA can be seen as a data storage device.

Reading and Writing intensive systems 🟩

These are called respectively OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing for the write intensive. This has been explained in a more detailed manner in Data Cubes.

Techniques for Big Data

We appoint two as the main techniques used to handle big amounts of data. Many of these arise due to the ever growing amount of data in modern times:

  1. Parallelization: if we have many many processors, it’s easy to read many pages at the same time. This is what is leveraged in Massive Parallel Processing.
  2. Batch Processing: due to the discrepancy between throughput and latency, it’s often better to read a lot of data at the same time, and then process it in a batch. This is also leveraged in systems like MapReduce introduced in Massive Parallel Processing.

Paradigms of data storage

ETL framework 🟩

This is the classical database approach: We load the data in the database and let the underlying system handle it. This method needs some added cost in extracting, transforming and loading the data that we have stored previously in an optimized format so that it can be used for views, or else.

Data Lakes 🟩

We usually refer to Data Lakes when we store our data with Distributed file systems or using Cloud Storage: cheap ways to dump the data without caring about the possibility of modifying them.

We typically store data in the filesystem, where it is viewed simply as files. This approach works well when we only need to read the data. It’s often referred to as in situ storage because there is no need to extract the data first. However, the drawback arises when we need to modify the data, as it can lead to numerous inconsistencies.

Another significant limitation of filesystems is that they cannot scale to manage billions of files efficiently.

This is what the professors in the first database course says Filesystem: are not the perfect technology to handle this type of load, see Introduction to databases.

Scaling principles

Usually the best thing is to make things work in a single computer, then it’s cheaper to scale horizontally, and then to scale vertically, adding better hardware.

Current Limits

When A relational table grows too much, a single system could have difficulty in handling it. Common limits nowadays are:

  1. Millions of rows
    1. Cloud Storage, Distributed file systems, Massive Parallel Processing, usually handle well data with lots of rows (samples, with the same columns)
  2. More than 256 columns, we usually use Wide Column Storage.
  3. With a lot of nested data.
    1. We use Document Stores, like MongoDB, where the Markup is quite important.

There are also problems for file-systems:

  1. Difficulty on collaborating on the same file with the same Filesystem
  2. Doesn’t scale over billions of files, standard filesystems are not designed to handle so much data.

Scaling vertically 🟩

There are two ways of scaling, scaling vertically or horizontally. Scaling Up concerns in building better machines, building better algorithms and being more efficient with what exactly we have.

Scaling horizontally 🟩–

Scaling horizontally is the simple idea of adding more things, it could be more computers, more ram, more disk, more CPUs. But these have some physical limits that we should need to keep track of.

There is a physical limit of number of computers in a data-center (1k to 100k which seems to be the hard limit constrained by energy and cooling requirements). Zürich’s datacenter consumes as much energy as an airport. And we have about 1-200 cores in a single computer of a datacenter.

We also have a limit for RAM and local storage. Respectively about 0.016-24TB of RAM and 1-30TB of storage. Its unthinkable that the RAM memory has the same order of magnitude of local storage. This is because in memory databases are becoming more common (they are usually faster, lower latency).

We also have a bandwidth of 1-200Gbit/s for a single server on an Ethernet cable.

We have standardized rack units for every server (or storage if the module is just for storage), storage and routers, they are usually connected together by switches and similar networking thingies. Usually we have 1-4 rack units for a server

Type Range
Computers in Data Center 1k-100k (100k hard limit for electricity designs)
Cores in a Single computer 200
RAM 0.016-24TB
Network Bandwidth 1-200 Gbit
Racks per server (module) 1-4
HDD Storage 26TB
Throughput 261MB/s
Latency 4 ms

Analysis of bottlenecks 🟩

We already have said that a way to improve on the possible bottlenecks of our systems is having better code: using our resources in a better manner. The easier way is choosing to buy more resources. Important bottlenecks in our context are CPU, Memory, RAM, and network. We can know if have one of these bottlenecks by monitoring the real-time resource usages.

Disk-IO -> MapReduce and Spark, or using Parquet instead of JSON can

In fact, you should always first try to improve your code before scaling out. The vast majority of data processing use cases fit on a single machine, and you can save a lot of money as well as get a faster system by squeezing the data on a machine (possibly compressed) and writing very efficient code.

For example, an easy way to address a memory bottlenecks is to analyze the classes that are instantiated many many times, but their fields are not often used, or have repeated information. Scaling up should be the last resource!.

For CPU, we should pay attention to the places we are doing so much class hierarchy, type checking, useless loops, overridden methods (they have to dynamically retrieve the function!) And too many method calls that are not easily inlinable.

For DIsk-IO, we use better formats, or compression, and parallel processing frameworks.

For Network, try to get rid of data that doesn’t need to be transmitted, and put it in a nicer format so that it is easier to compress it! (This is called Push Down). And use batch processing.

Evolution of the data stack

We have 10 layers, instead of the 7 Architettura e livelli 1, 2 of the ISO OSI layers of the networking. Introduction to Big Data-20240924145135862

We will rebuild the whole datastack and understand how every layer works together with one another to handle the big data.

We will link for each part some important nodes regarding those

References

[1] Codd “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM Vol. 13(6), pp. 377–387 1970

[2] Harari “Nexus: A Brief History of Information Networks from the Stone Age to AI” Random House 2024