📓Big-Data

HTTP e REST

HTTP is the acronym for HyperText Transfer Protocol. Caratteristiche principali (3) Comunicazioni fra client e server, e quanto sono comunicate le cose si chiude la connessione e ci sono politiche di caching molto bone (tipo con i proxy) Generico: perché è un protocollo utilizzato per caricare moltissime tipologie di risorse! Stateless, ossia non vengono mantenute informazioni su scambi vecchi, in un certo modo ne abbiamo parlato in Sicurezza delle reti quando abbiamo parlato di firewall stateless. Solitamente possiamo intendere questo protocollo come utile per scambiare risorse di cui abbiamo parlato in Uniform Resource Identifier. ...

Cloud Storage

Object Stores Characteristics of Cloud Systems Object storage design principles We don’t want the hierarchy that is common in Filesystems, so we need to simplify that and have these four principles: Black-box objects Flat and global key-value model (trivial model, easy to access, without the need to trasverse a file hierarchy). Flexible metadata Commodity hardware (the battery idea of Tesla until 2017). Object storage usages Object storage are useful to store things that are usually read-intensive. Some examples are ...

Performance at Large Scales

Some specific phenomenons in modern systems happen only when we scale into large systems. This note will gather some observations about the most important phenomena we observe at these scales. Tail Latency Phenomenon Tail latency refers to the high-end response time experienced by When scaling our services, using Massive Parallel Processing, and similar technology, it is not rare that a small percentage of requests in a system experience a high-end response time, typically measured at the 95th or 99th percentile. This significant delays that can degrade user experience or system reliability. ...

Massive Parallel Processing

We have a group of mappers that work on dividing the keys for some reducers that actually work on that same group of data. The bottleneck is the assigning part: when mappers finish and need to handle the data to the reducers. Introduction Common input formats You need to know well what Shards Textual input binary, parquet and similars CSV and similars Sharding It is a common practice to divide a big dataset into chunks (or shards), smaller parts which recomposed give the original dataset. For example, in Cloud Storage settings we often divide big files into chunks, while in Distributed file systems the system automatically divides big files into native files of maximum 10 GB size. ...

Apache Spark

This is a new framework that is faster than MapReduce (See Massive Parallel Processing). It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG. There are other benefits of using Spark instead of the Map reduce Framework: Spark processes data in memory, avoiding the disk I/O overhead of MapReduce, making it significantly faster. Spark uses a DAG to optimize the entire workflow, reducing data shuffling and stage count. But MapReduce sometimes has its advantages: ...

Data Cubes

Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes. The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. ...

Data Models and Validation

A data model is an abstract view over the data that hides the way it is stored physically. The same idea from (Codd 1970) This is why we should not modify data directly, but pass though some abstraction that maintain the properties of that specific data model. Data Models Tree view We can view all JSON and XML data, as presented in Markup, as trees. This structure is usually quite evident, as it is inherent in their design. Converting from the tree structure to a memory model is known as serialization, while the reverse process is called parsing. ...

Distributed file systems

We want to know how to handle systems that have a large number of data. In previous lesson we have discovered how to quickly access and make Scalable systems with huge dimensions, see Cloud Storage. Object storage could store billions of files, we want to handle millions of petabyte files. Designing DFSs The Use Case Remember that the size of the files where heavily limited for Cloud Storage. The physical limitation was due to the limited size of a single hard disk, which was usually in the order of the Terabytes. Here, we would like to easily store petabytes of data in a single file, for example big datasets. Another feature that should be easily supported is highly concurrent access to the filesystem, last but not least being able to set up permissions in the system. ...

Document Stores

p> Document stores provide a native database management system for semi-structured data. Document stores also scale to Gigabytes or Terabytes of data, and typically millions or billions of records (a record being a JSON object or an XML document). Introduction to Document Stores A document store, unlike a data lake, manages the data directly and the users do not see the physical layout. Unlike data lakes, using document stores prevent us from breaking data independence and reading the data file directly: it offers an automatic manager service for semi-structured data that we need to throw and read quickly. ...

Graph Databases

We have first cited the graph data model in the Introduction to Big Data note. Until now, we have explored many aspects of relational data bases, but now we are changing the data model completely. The main reason driving this discussion are the limitations of classical relational databases: queries like traversal of a high number of relationships, reverse traversal requiring also indexing foreign keys (need double index! Index only work in one direction for relationship traversal, i.e. if you need both direction you should build an index both for the forward key and backward key), looking for patterns in the relationships, are especially expensive when using normal databases. We have improved over the problem of joining with relational database using Document Stores with three data structure, but these cannot have cycles. We call index-free adjacency: we use physical memory pointers to store the graph. ...