Apache Spark

This is a new framework that is faster than MapReduce (See Massive Parallel Processing). It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG. There are other benefits of using Spark instead of the Map reduce Framework: Spark processes data in memory, avoiding the disk I/O overhead of MapReduce, making it significantly faster. Spark uses a DAG to optimize the entire workflow, reducing data shuffling and stage count. But MapReduce sometimes has its advantages: ...

December 27, 2024 路 Reading Time: 9 minutes 路  By Xuanqiang Angelo Huang

Data Cubes

Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes. The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. ...

December 20, 2024 路 Reading Time: 4 minutes 路  By Xuanqiang Angelo Huang

Introduction to Big Data

Data Science is similar to physics: it attemps to create theories of realities based on some formalism that another science brings. For physics it was mathematics, for data science it is computer science. Data has grown expeditiously in these last years and has reached a distance that in metres is the distance to Jupiter. The galaxy is in the order of magnitude of 400 Yottametres, which has $3 \cdot 8$ zeros following after it. So quite a lot. We don鈥檛 know if the magnitude of the data will grow this fast but certainly we need to be able to face this case. ...

December 20, 2024 路 Reading Time: 10 minutes 路  By Xuanqiang Angelo Huang

Querying Denormalized Data

TODO: write the introduction to the note. JSONiq purports as an easy query language that could run everywhere. It attempts to solve common problems in SQL i.e. the lack of support for nested data structures and also the lack of support for JSON data types. A nice thing about JSONiq is that it is functional, which makes its queries quite powerful and flexible. It is also declarative and set-based. These are some commonalities with SQL. ...

November 26, 2024 路 Reading Time: 6 minutes 路  By Xuanqiang Angelo Huang

Codifica dei caratteri

Introduzione sull鈥檈ncoding Ossia trattiamo metodi per codificare caratteri dei linguaggi umani, come ASCII, UCS e UTF. Digitalizzare significa encodarlo in un sistema che possa essere memorizzato su un dispositivo di memorizzazione elettronico. Ovviamente non possiamo mantenere l鈥檌nformazione cos矛 come 猫, ma vogliamo memorizzarne una forma equivalente, ma pi霉 facile da manipolare dal punto di vista del computer. Creiamo quindi un mapping, o anche isomorfismo tra il valore di mappatura (o encoding), solitamente un valore numerico, tra il singolo valore atomico originale e il numero. ...

January 15, 2025 路 Reading Time: 9 minutes 路  By Xuanqiang Angelo Huang

Normalizzazione dei database

Introduzione alla normalizzazione Perch茅 si normalizza? Cercare di aumentare la qualit脿 del nostro database, perch茅 praticamente andiamo a risolvere delle anomalie possibili al nostro interno, e questo aiuta per la qualit脿. Solitamente queste anomalie sono interessanti per sistemi write intensive, in cui vogliamo mantenere i nostri dati in una forma buona. Per貌 capita non raramente che vogliamo solamente leggere. In quei casi sistemi come Cloud Storage, Distributed file systems potrebbero risultare pi霉 effettivi. ...

January 5, 2025 路 Reading Time: 6 minutes 路  By Xuanqiang Angelo Huang

Uniform Resource Identifier

URI Sono stata LA vera invenzione di Berners Lee accennati in Storia del web. Il problema 猫 avere un modo per identificare una risorsa in modo univoco sull鈥檌nternet. Introduzione La risorsa Una risorsa 猫 qualunque struttura che sia oggetto di scambio tra applicazioni all鈥檌nterno del World Wide Web. Ora una risorsa pu貌 essere qualunque cosa, non solamente solo un file! Quindi 猫 agnostico rispetto a contenuto oppure metodo di memorizzazione del dato, appare anche in questo ambiente importante vedere quanto siano importanti standard che permettano una comunicazione ...

January 28, 2025 路 Reading Time: 6 minutes 路  By Xuanqiang Angelo Huang

Structured Query Language

Little bits of history It was invented in 1970 in Almaden (San Jose) by IBM (Don Chamberlin, Raymond Boyce worked on this) for the first relational database, called system R. Then for copyright issues it hasn鈥檛 been called SEQUEL, so they branded it as SQL. SQL is a declarative language With declaratives language there is a separation between what I call the intentionality and the actual process. In declarative languages we just say what we want the result to be, and don鈥檛 care what the actual implementation is like. This allows queries to be executed and optimized in different ways, even if the query on the surface is the same ...

December 20, 2024 路 Reading Time: 7 minutes 路  By Xuanqiang Angelo Huang