Paradigms of data storage
ETL framework 🟩
This is the classical database approach: We load the data in the database and let the underlying system handle it. This method needs some added cost in extracting, transforming and loading the data that we have stored previously in an optimized format so that it can be used for views, or else.
Data Lakes 🟩
We usually refer to Data Lakes when we store our data with Distributed file systems or using Cloud Storage: cheap ways to dump the data without caring about the possibility of modifying them.
We typically store data in the filesystem, where it is viewed simply as files. This approach works well when we only need to read the data. It’s often referred to as in situ storage because there is no need to extract the data first. However, the drawback arises when we need to modify the data, as it can lead to numerous inconsistencies.
Another significant limitation of filesystems is that they cannot scale to manage billions of files efficiently.
This is what the professors in the first database course says Filesystem: are not the perfect technology to handle this type of load, see Introduction to databases.
Object storage design principles 🟨–
We don’t want the hierarchy that is common in Filesystems, so we need to simplify that and have these four principles:
- Black-box objects
- Flat and global key-value model (trivial model, easy to access, without the need to trasverse a file hieararchy).
- Flexible metadata
- Commodity hardware (the battery idea of Tesla until 2017).
Scaling principles
Usually the best thing is to make things work in a single computer, then it’s cheaper to scale horizontally, and then to scale vertically, adding better hardware.
Current Limits
When A relational table grows too much, a single system could have difficulty in handling it. Common limits nowadays are:
- Millions of rows
- Cloud Storage, Distributed file systems, Massive Parallel Processing, usually handle well data with lots of rows (samples, with the same columns)
- More than 256 columns, we usually use Wide Column Storage.
- With a lot of nested data.
- We use Document Stores, like MongoDB, where the Markup is quite important.
Scaling vertically 🟩
There are two ways of scaling, scaling vertically or horizontally. Scaling Up concerns in building better machines, building better algorithms and being more efficient with what exactly we have.
Scaling horizontally 🟩–
Scaling horizontally is the simple idea of adding more things, it could be more computers, more ram, more disk, more CPUs. But these have some physical limits that we should need to keep track of.
There is a physical limit of number of computers in a data-center (1k to 100k which seems to be the hard limit constrained by energy and cooling requirements). Zürich’s datacenter consumes as much energy as an airport. And we have about 1-200 cores in a single computer of a datacenter.
We also have a limit for RAM and local storage. Respectively about 0.016-24TB of RAM and 1-30TB of storage. Its unthinkable that the RAM memory has the same order of magnitude of local storage. This is because in memory databases are becoming more common (they are usually faster, lower latency).
We also have a bandwidth of 1-200Gbit/s for a single server on an Ethernet cable.
We have standardized rack units for every server (or storage if the module is just for storage), storage and routers, they are usually connected together by switches and similar networking thingies. Usually we have 1-4 rack units for a server
Type | Range |
---|---|
Computers in Data Center | 1k-100k (100k hard limit for electricity designs) |
Cores in a Single computer | 200 |
RAM | 0.016-24TB |
Network Bandwidth | 1-200 Gbit |
Racks per server (module) | 1-4 |
HDD Storage | 26TB |
Throughput | 261MB/s |
Latency | 4 ms |
Analysis of bottlenecks 🟩
We already have said that a way to improve on the possible bottlenecks of our systems is having better code: using our resources in a better manner. The easier way is choosing to buy more resources. Important bottlenecks in our context are CPU, Memory, RAM, and network. We can know if have one of these bottlenecks by monitoring the real-time resource usages.
Disk-IO -> MapReduce and Spark, or using Parquet instead of JSON can
In fact, you should always first try to improve your code before scaling out. The vast majority of data processing use cases fit on a single machine, and you can save a lot of money as well as get a faster system by squeezing the data on a machine (possibly compressed) and writing very efficient code.
Object Stores
Object storage usages 🟩
Object storages are useful to store things that are usually read-intensive. Some examples are
- Static websites
- Images, Videos or video chunks
- Big files that are usually only to read (e.g. datasets).
Service Level Agreements (4) 🟨++
We will talk now about Service Level Agreements which are important to understand the contract part of using cloud services. So, if a company does not satisfy these requirements, one can sue them for breach of contract.
-
Scalability We have 100 buckets per account that could be extended on request.
-
Durability We lose 1 in a $10^{11}$ objects during a year (which is 99.999999999% of durability), which is quite strong (so there is still a possibility that the object is lost). This is useful for the lawyers because they can offer a guarantee. If this is not respected then people can sue them.
-
Availability We have an availability of 99.99% which means maximum of 1h for a year. Usually it’s useful to remember the percentages of availability:
99% 4 days/year
99.9% 9 hours/year
99.99% 53 minutes/year
99.999% 6 minutes/year
99.9999% 32 seconds/year
99.99999% 4 seconds/year
- Response time Legally it’s hard to guarantee average speeds, so what they do is actually count the times where the service is guaranteed to be below a certain threshold! So they guarantee that in $99.9\%$ of the cases they have a response lower than $300 ms$, in other cases its higher. However, as it is usually difficult to satisfy a latency requirements which is often geographically dependent answer, S3 offers guarantees based on system throughput, which is how many reads and writes it can handle without any problem.
Amazon S3
Amazon used to sell books, then they started to rent cloud services because they had too many machines that most of the time they were not using. This was a big win from an economical point of view. They created amazon web services. Now they are selling these abstractions: Amazon S3 stands for simple storage service.
S3 Document Identification 🟩
With S3 every document is identified by a bucket id and a object id, which could be maximum 5 TB (probably physical constraints of the file), it is only possible to upload an object in a single chunk if it is less than 5 GB. The buckets can uniquely identify that in the world. The maximum amount of buckets that a user can have is 100 by default.
We don’t know how S3 works underneath, they have not published it.
Storage Classes 🟨–
The cost of the storage service changes with the frequency of the access. For example
- Amazon Glacier has a very high latency (hours to get the files), but its cost is quite low. This should be used for example for backups! With these applications we don’t care if the answer is in seconds. Hours is fine.
- Standard with infrequent access: we have a cost of retrieving, with less availability
- Standard: it’s just the standard S3 structure.
Accessing a resource 🟩
We use Uniform Resource Identifier to identify the resource we want to access, and the usually send a REST request to modify, delete or get it.
This is an example of a S3 bucket [http://bucket.s3.amazonaws.com/object-name
](http://bucket.s3.amazonaws.com/object-name`) (if you want to access the bucket, just remove the object identifier!).
We can have operations like PUT, GET, DELETE for Buckets and objects.
Usage Examples 🟩
Most common usage of buckets is storing read intensive data, like static websites or dataset shards (which then become useful for systems like MapReduce or Spark). The performance is nice, the professor reports about 300ms for the website to load. Usually you can see a hierarchy on the UI for such systems, but that is just for interface, under the hood, we just have a flat key-value store.
It is common to place a content delivery network (CDN) service on top of the storage bucket of a website in order to accelerate and cache these files at multiple places on the planet.
Azure Blob Storage
In this section we will describe the service provided by Azure cloud on cloud storage.
Azure Document Identification 🟩–
Azure blob storage needs 3 id to identify a document:
- Account
- Container (bucket equivalent), which is sometimes called partition in literature.
- Blob (document id equivalent) 195 GB for an Append Blob to 190.7 TB for a Block Blob. The maximum storage size is different!
Object APIs 🟨
And we can divide the files into blocks which support a higher size.
- Blocks are for data. Maximum size of 50k 4GB blocks, about 190.7 TB.
- Append are for logs because they are optimized to append stuff, of maximum size of 195 GB. Which corresponds to 50k 4MB blocks.
- Page are for in memory virtual machines maximum size of 8 terabytes.
Stamps 🟩–
Azure Blob Storage is organized into the so called storage stamps. These stamps are 10-20 racks with 18 storage nodes each (maximum storage for a stamp is about 30PB of data). Usually kept below 80% data (if not they’ll have a warning), else they could buy more storage or move data elsewhere.
Replication (2) 🟨–
They have two types of replications:
- Intra-stamp: which is synchronous way of replicating (immediately replicated).
- Inter-stamp: asynchronously replicating to other stamps.
Regions (2) 🟩–
As with AWS, they have regions to optimize for latency and be resistant to natural catastrophes. The latency part is intuitive: if a data center is physically closer to you, then it’s more probable that the data will be served faster. Natural catastrophes are mitigated by having the data in different regions, so if one is destroyed, then the data is still safe.
Key-value stores
S3 is far more slower than typical database systems to store and query the data. In the order of hundreds of 100ms against 1-9ms for key-value stores. So two orders of difference! Too high latency for some uses!.
Key value stores (aka associative arrays, we use map data structure) solve this problem and can be adapted to be used as a database system, the cost is that the type of objects we can store is far smaller, in the order of few hundreds of kilobytes.
Characteristics of Key-value stores 🟨
This is designed for performance and scalability, but it gives up consistency for eventual consistency. This is usually a simpler version, but it doesn’t have the same features as more complex data storage systems, as relational database management systems.
Usually these are used for shopping carts for a very large online shop, another is for storing likes and comments on social media
Design principles of Key-value stores (4) 🟥++
-
Incremental stability New nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash. But we don’t need to have too many nodes to crash, which is not the scope of our course.
-
Symmetry No node is particular in any way, the nodes are similar to one and another. They run the same code.
-
Decentralization There is no leader or orchestrator in the network. Note that symmetric protocols could elect a leader nonetheless, while in this case we have the strict requirement not to have a leader.
-
Heterogeneity the nodes may have different CPU power, amounts of memory, etc.
Comparison With ObjectStorage
Limitations compared to S3 and Azure Blob Storage 🟩
Yet, this brings some drawbacks: We have the values to be far smaller, about 400kb of data (this is what Dynamo does) and we cannot store metadata It has a simplified APIs that just supports get, put or delete the document, with a given key or value (in reality there is also a context variable). This is why they are often more suitable for real-time data, simplicity and speed.
What are Contexes?
These values are the state of the updates for every node, this is used to merge the vector clocks in case of need. Providing a context to get or put, allows the key-value store to correctly update the context so that later in case of partitions, it could be exactly solved.
Common Usage Patterns
Particular usages are for example shopping carts, likes and comments on social media, and so on.
One of the typical use cases for a key-value store is storing shopping carts for a very large online shop, another is for storing likes and comments on social media.
best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, from (DeCandia et al. 2007).
Chord protocol
The amazon paper (DeCandia et al. 2007) describes this system.
This is based on a distributed hashing protocol. We use hashes because they have some properties to be robust against failures of some kind, which I have not understood.
With this protocol, every node has an ID. For example a code in $2^{128}$ (Dynamo indeed uses 128 bytes). If we have a key $k$ this is assigned to a position on the ring. Then from this position we follow the ring clockwise until we find a node, this node should handle the value of this key.
Consistent Hashing
The ring structure of the Chord protocol is called consistent hashing. It has been designed to minimize the amount of transfer of data in distributed systems that join and leave frequently. See Tabelle di hash for a primer of how hashing works.
Other systems use this type of hashing, like the CassandraDB or CDN services. It acts as an automatic load balancing system.
The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
Join, Leave and Crash 🟩
When a node joins, the should take the responsibility of part of the data in front of him in a clockwise fashion. When a node leaves, it should give responsibility of part of the data to the node in front of him. Clearly this is not possible when we have a crash, this is why we need redundancy, which is easily done by having 2-range redundancy, but it can be set to any value.
One thing that should be noted with the updates is how it handles the replication: only single node modifies, but after it propagates the update to the replicas and receives acks. Then you use vector clocks to choose the highest update number. A nice parallel with vector clocks is the parallel between these and the Newtonian physics vs Structure of spacetime and different timelines.
Reads on Dinamo 🟨–
A client periodically picks a random Dynamo node and downloads its current view of Dynamo membership state. Using this information the client can determine which set of nodes form the preference list for any given key.
A preference list is just a map of each object and the node that is currently storing that object. With the classical Chord protocol the so called finger tables where used where you could employ binary search to find the node that is responsible for the key. With finger tables, each node knows what the following nodes to the power of two have, so that the search could be employed.
Every time a node wants to look up a key $k$, it will pass the query to the closest successor or predecessor (depending on the finger table) of $k$ in its finger table (the “largest” one on the circle whose ID is smaller than $k$), until a node finds out the key is stored in its immediate successor.
Probably this approach has been dismissed because it is a little bit more burgeoning to update the finger table on joins or leaves, and preference lists are a better practical solution.
This is called the client driven approach, and from their tests, it seems twice as fast as the server driven approach.
Preference lists 🟩
This solves the problem of finding the actual node that has the data we are querying for. It is just a table with a key -> and nodes that have it. We have distributed algorithms that make the nodes agree on this preference list which is just knowing what machine is responsible for what interval range. For each key range, the first node on the list is called the coordinator node, it this fails, the random node that has gotten request attempt to contact every node until one answers. $R$ is the minimum number of nodes from which the value needs to be retrieved for it to be considered consistent. $W$ is the minimum number of nodes for which an ack should be received to consider it towards a successful read. We have the theoretical necessity that $R + W > N$ to set up a quorum system, whose acknowledgement is based on the majority of notes that have accepted a certain modification. Sometimes the system is also configured to values less than $N$ for a better latency, which is configurable by the developer.
Tuning the values for $R$ and $W$ scale the performance stability of reads and writes. For example read-intensive applications, would prefer to have $R=1$ and $W=N$ so that it is fast to read, but could potentially read some inconsistent data. The most common configuration is $N=3, R=2, W=2$ for Dynamo.
Advantages and disadvantages 🟨+
Pros:
- Highly scalable
- Robust to failure (because we replicate data with the ones in front of us).
- Self-organizing. I don’t know if the above properties are exclusive, but they are nice
Cons:
- Only lookup, no search (obvious)
- No data-integrity (we don’t have a way to check for integrity constraints)
- Security issues (Need to understand this)
Virtual Nodes 🟩
Two problems arise with the Chord protocol: there could be large gaps in the Circle and the underlying machines could have different hardware proposals. The idea is to have some nodes take other nodes in a way proportional to its hardware. In this manner, if a node has more resources, it has more nodes, which implies it is expected to handle more data (which is fine because it has more resources).
CAP Theorem
Statement of CAP 🟩
We can only have two of the following properties:
- Consistency (doesn’t depend on the machine that answers to your request).
- Availability (it should answer something).
- Partition tolerance (the system continues to function even if the network linking its machines is occasionally partitioned.)
We don’t have ACID anymore (see Advanced SQL) in the case of Big Data. So now we have 3 possible scenarios, which correspond to the 3 couples that is possible to have with these properties.
For example, let’s say we have a partition of the network then we have two cases: not available until the network is connected again, but we still have the same data. Or we have two parts that answer differently (this is usually called eventual consistency, because after the network is connected then it will return to the consistent state), but are still available to the users.
When network partitions happen we need to choose what property we want to keep, so we have three possible cases: CP, CA or AP. Services like Dynamo Key value store (see Cloud Storage#Key-value stores) choose AP and thus have eventual consistency.
Vector Clocks 🟨–
Sometimes when we have a network partition we lose the linear timing of the system and so we have directed acyclic graphs
We have vectors for updates by different nodes. The merge is done by a certain node and updates the vector again.
The merge happens in the following manner: just choose the maximum value for each resource that has been modified. This grants consistency, but could lose some data.
PACELC Theorem 🟩–
PACELC is a generalization of the CAP theorem. The acronym PACELC stands for:
- P: Partition tolerance (as in the CAP theorem).
- A: Availability.
- C: Consistency.
- E: Else.
- L: Latency.
- C: Consistency.
PACELC integrates the latency in normal running cases. We can tradeoff between being low latency but accepting some inconsistencies, or being consistent with a little higher latency.
References
[1] DeCandia et al. “Dynamo: Amazon’s Highly Available Key-Value Store” ACM SIGOPS Operating Systems Review Vol. 41(6), pp. 205–220 2007