Paradigms of data storage
ETL framework π©
This is the classical database approach: We load the data in the database and let the underlying system handle it. We won’t talk much about this but this is the general concept!
Data Lakes π©
We typically store data in the filesystem, where it is viewed simply as files. This approach works well when we only need to read the data. It’s often referred to as in situ storage because there is no need to extract the data first. However, the drawback arises when we need to modify the data, as it can lead to numerous inconsistencies.
Another significant limitation of filesystems is that they cannot scale to manage billions of files efficiently.
This is what the professors in the first database course says Filesystem: are not the perfect technology to handle this type of load, see Introduction to databases.
Object storage design principles π¨
We don’t want the hierarchy that is common in Filesystems, so we need to simplify that and have these four principles:
- Black-box objects
- Flat and global key-value model (trivial model, easy to access)
- Flexible metadata
- Commodity hardware (the battery idea of Tesla until 2017).
Scaling principles
Usually the best thing is to make things work in a single computer, then it’s cheaper to scale horizontally, and then to scale vertically, adding better hardware.
Scaling vertically π©
There are two ways of scaling, scaling vertically or horizontally. Scaling Up concerns in building better machines, building better algorithms and being more efficient with what exactly we have.
Scaling horizontally π©
Scaling horizontally is the simple idea of adding more things, it could be more computers, more ram, more disk, more CPUs. But these have some physiscal limits that we should need to keep track of.
There is a physical limit of number of computers in a data-center (1k to 100k which seems to be the hard limit constrained by energy and cooling requirements). ZΓΌrich’s datacenter consumes as much energy as an airport. And we have about 1-200 cores in a single computer of a datacenter.
We also have a limit for RAM and local storage. Respectively about 0.016-24TB of RAM and 1-30TB of storage. Its unthinkable that the RAM memory has the same order of magnitude of local storage. This is because in memory databases are becoming more common (they are usually faster, lower latency).
We also have a bandwidth of 1-200Gbit/s for a single server on an Ethernet cable.
We have standardized rack units for every server, storage and routers, they are usually connected together by switches and similar networking thingies. Usually we have 1-4 rack units for a server
Analysis of bottlenecks
We already have said that a way to improve on the possible bottlenecks of our systems is having better code: using our resources in a better manner. The easier way is choosing to buy more resources. Important bottlenecks in our context are CPU, Memory, RAM, and network. We can know if have one of these bottlenecks by monitoring the real-time resource usages.
Disk-IO -> MapReduce and Spark, or using Parquet instead of JSON can
Object Stores
Object storage usages π©
Object storages are useful to store things that are usually read-intensive. Some examples are
- Static websites
- Images, Videos or video chunks
- Big files that are usually only to read (e.g. datasets).
Service Level Agreements (4) π¨+
We will talk now about Service Level Agreements which are important to understand the contract part of using cloud services.
-
Scalability We have 100 buckets per account that could be extended on request.
-
Durability We lose 1 in a $10^{11}$ objects during a year, which is quite strong (so there is still a possibility that the object is lost). This is useful for the lawyers because they can offer a guarantee. If this is not respected then people can sue them.
-
Availability We have an availability of 99.99% which means maximum of 1h for a year. Usually it’s useful to remember the percentages of availability:
99% 4 days/year
99.9% 9 hours/year
99.99% 53 minutes/year
99.999% 6 minutes/year
99.9999% 32 seconds/year
99.99999% 4 seconds/year
- Response time Legally it’s hard to guarantee average speeds. So they guarantee that in $99.9\%$ of the cases they have a response lower than $300 ms$, in other cases its higher.
Amazon S3
Amazon used to sell books, then they started to rent cloud services because they had too many machines that most of the time they were not using. This was a big win from an economical point of view. They created amazon web services. Now they are selling these abstractions: Amazon S3 stands for simple storage service.
S3 Document Identification π©
With S3 every document is identified by a bucket id and a object id, which could be maximum 5 TB (probably physical constraints of the file), it is only possible to upload an object in a single chunk if it is less than 5 GB. The buckets can uniquely identify that in the world.
We don’t know how S3 works underneath, they have not published it.
Storage Classes π©-
The cost of the storage service changes with the frequency of the access. For example amazon Glacier has a very high latency (hours to get the files), but its cost is quite low. This should be used for example for backups! With these applications we don’t care if the answer is in seconds. Hours is fine.
Accessing a resource π©
We use Uniform Resource Identifier to identify the resource we want to access, and the usually send a REST request to modify, delete or get it.
This is an example of a S3 bucket [http://bucket.s3.amazonaws.com/object-name
](http://bucket.s3.amazonaws.com/object-name`) (if you want to access the bucket, just remove the object identifier!).
We can have operations like PUT, GET, DELETE for Buckets and objects.
Azure Blob Storage
In this section we will describe the service provided by Azure cloud on cloud storage.
Azure Document Identification π©
Azure blob storage needs 3 id to identify a document:
- Account
- Container (bucket equivalent)
- Blob (document id equivalent) 195 GB for an Append Blob to 190.7 TB for a Block Blob. The maximum storage size is different!
Object APIs π©–
And we can divide the files into blocks which support a higher size.
- Blocks are for data.
- Append are for logs because they are optimized to append stuff.
- Page are for in memory virtual machines
Stamps π©-
They have stamps which are 10-20 racks with 18 storage nodes each (30PB of data). Usually kept below 80% data (if not warning), else they could buy more storage or move data elsewhere.
Replication (2) π¨–
They also have replication of the data (intra-stamp replication) then they asynchronously replicate it to other stamps.
Regions (2) π©–
As with AWS, they have regions to optimize for latency and be resistant to natural catastrophes.
Key-value stores
S3 is far more slower than typical database systems to store and query the data. 100ms agains 1-9ms. So two orders of difference! Too high latency! Key value stores (aka associative arrays, we use map datastructure) solve this problem and can be adapted to be used as a database system.
Limitations compared to S3 and Azure Blob Storage π©
Yet, this brings some drawbacks: We have the values to be far smaller, about 400kb of data (this is what Dynamo does) and we cannot store metadata. It has some basic APIs to get and put the document, with a given key or value (in reality there is also a context variable). This is why they are often more suitable for real-time data.
Characteristics of Key-value stores π¨
This is designed for performance and scalability, but it gives up consistency for eventual consistency. This is usually a simpler version, but it doesn’t have the same features as more complex data storage systems, as relational database management systems.
Usually these are used for shopping carts for a very large online shop, another is for storing likes and comments on social media
Design principles of Key-value stores (4) π¨
-
Incremental stability new nodes can join the system at any time, and nodes can leave the system at any time, sometimes gracefully, sometimes in a sudden crash. But we don’t need to have too many nodes to crash, which is not the scope of our course.
-
Symmetry No node is particular in any way, the nodes are similar to one and another. They run the same code.
-
Decentralization There is no leader or orchestrator in the network. Note that symmetric protocols could elect a leader nonetheless, while in this case we have the strict requirement not to have a leader.
-
Heterogeneity the nodes may have different CPU power, amounts of memory, etc.
Chord protocol
The amazon paper (DeCandia et al. 2007) describes this system.
This is based on a distributed hashing protocol. We use hashes because they have some properties to be robust against failures of some kind, which I have not understood.
With this protocol, every node has an ID. For example a code in $2^{128}$. If we have a key $k$ this is assigned to a position on the ring. Then from this position we follow the ring clockwise until we find a node, this node should handle the value of this key.
Join, Leave and Crash π©
When a node joins, the should take the responsibility of part of the data in front of him in a clockwise fashion. When a node leaves, it should give responsibility of part of the data to the node in front of him. Clearly this is not possible when we have a crash, this is why we need redundancy, which is easily done by having 2-range redundancy.
One thing that should be noted with the updates is how it handles the replication: only single node modifies, but after it propagates the update to the replicas and receives acks. Then you use vector clocks to choose the highest update number. A nice parallel with vector clocks is the parallel between these and the Newtonian physics vs Structure of spacetime and different timelines.
Reads on Dinamo
A client periodically picks a random Dynamo node and downloads its current view of Dynamo membership state. Using this information the client can determine which set of nodes form the preference list for any given key.
This is called the client driven approach, and from their tests, it seems twice as fast as the server driven approach.
Preference lists π©
This solves the problem of finding the actual node that has the data we are querying for. We have distributed algorithms that make the nodes agree on this preference list which is just knowing what machine is responsible for what interval range. We have the theoretical necessity that $R + W > N$ but i still have not understood this.
Advantages and disadvantages π¨
Pros:
- Highly scalable
- Robust to failure (because we replicate data with the ones in front of us).
- Self-organizing. I don’t know if the above properties are exclusive, but they are nice
Cons:
- Only lookup, no search (obvious)
- No data-integrity (we don’t have a way to check for integrity constraints)
- Security issues (Need to understand this)
Virtual Nodes
Two problems arise with the Chord protocol: there could be large gaps in the Circle and the underlying machines could have different hardware proposals. The idea is to have some nodes take other nodes in a way proportional to its hardware. In this manner, if a node has more resources, it has more nodes, which implies it is expected to handle more data (which is fine because it has more resources).
References
[1] DeCandia et al. βDynamo: Amazon’s Highly Available Key-Value Storeβ ACM SIGOPS Operating Systems Review Vol. 41(6), pp. 205–220 2007