Cache Optimization

Locality principles Remember the two locality principles in Memoria. Temporal locality and spatial locality. Temporal Locality Some elements just are accessed many times in time. This is an example of a temporal locality. Spatial locality Some elements are accessed close to each other, this is an idea of spatial locality. In modern architectures, the a line of cache is usually 64 bytes. For example consider this snippet: sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Sum is an example of temporal locality as the same memory location (or register) is accessed many times, and the access of the array a is an example of spatial locality. loops cycle through the same instructions, this is an example of temporal locality. ...

2 min · Xuanqiang 'Angelo' Huang

Compiler Limitations

On Compiler Adding compilation flags to gcc not always makes it faster, it just enables a specific set of optimization methods. It’s also good to turn on platform specific flags to turn on some specific optimization methods to that architecture. Remember that compilers are conservative, meaning they do not apply that optimization if they think it does not always apply. What are they good at Compilers are good at: mapping program to machine ▪ register allocation ▪ instruction scheduling ▪ dead code elimination ▪ eliminating minor inefficiencies ...

2 min · Xuanqiang 'Angelo' Huang

Fast Linear Algebra

Many problems in scientific computing include: Solving linear equations Eigenvalue computations Singular value decomposition LU/Cholesky/QR decompositions etc… And the userbase is quite large for this types of computation (number of scientists in the world is growing exponentially ) Quick History of Performance Computing Early seventies it was EISPACK and LINPACK. Then In similar years Matlab was invented, which simplified a lot compared to previous systems. LAPACK redesigned the algorithms in previous libraries to have better block-based locality. BLAS are kernel functions for each computer, while LAPACK are the higher level functions build on top of BLAS (1, 2,3). Then another innovation was ATLAS, which automatically generates the code for BLAS for each architecture. This is called autotuning because it does a search of possible enumerations and chooses the fastest one. Now autotuning has been done a lot for NN systems. ...

4 min · Xuanqiang 'Angelo' Huang

Skylake Microprocessor

The Skylake processor is a 2015 Intel processor. The Intel Processor In 1978 Intel made the choice to have retrocompatibility for every processor. At that time they had the 8086 processor, with some number of memory bits. For backwards compatibility intructions have usually just grown. They used geographic locations because these are not suable. If we want new code to run for old processors, we should need to put specific flags. ...

4 min · Xuanqiang 'Angelo' Huang

Sparse Matrix Vector Multiplication

Algorithms for Sparse Matrix-Vector Multiplication Compressed Sporse Row This is an optimized way to store rows for sparse matrices: Sparse MVM using CSR void smvm(int m, const double* values, const int* col_idx, const int* row_start, double* x, double* y) { int i, j; double d; /* Loop over m rows */ for (i = 0; i < m; i++) { d = y[i]; /* Scalar replacement since reused */ /* Loop over non-zero elements in row i */ for (j = row_start[i]; j < row_start[i + 1]; j++) { d += values[j] * x[col_idx[j]]; } y[i] = d; } } Let’s analyze this code: Spatial locality: with respect to row_start, col_idx and values we have spatial locality. Temporal locality: with respect to y we have temporal locality. (Poor temporal with respect to $x$) Good storage efficiency for the sparse matrix. But it is 2x slower than the dense matrix multiplication when the matrix is dense. Block CSR But we cannot do block optimizations for the cache with this storage method. ...

3 min · Xuanqiang 'Angelo' Huang

Recurrent Neural Networks

Recurrent Neural Networks allows us to model arbitrarily long sequence dependencies, at least in theory. This is very handy, and has many interesting theoretical implication. But here we are also interested in the practical applicability, so we may need to analyze common architectures used to implement these models, the main limitation and drawbacks, the nice properties and some applications. These network can bee seen as chaotic systems (non-linear dynamical systems), see Introduction to Chaos Theory. ...

5 min · Xuanqiang 'Angelo' Huang

Cloud Reliability

Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for. Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem. Defining the vocabulary Availability $$ \text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}} $$MTTF: Mean Time To Failure $$ \text{MTTF} = \frac{1}{r} $$ This definition does not include repair time, and assumes the failures are independent with each other. ...

6 min · Xuanqiang 'Angelo' Huang

Devices OS

Devices Categorizzazione (6)🟨- Trasferimento dei dati Accesso al device sinfonia del trasferimento condivisone fra processi Velocità del trasferimento I/O direction (scrittura o lettura) Vediamo che molte caratteristiche sono riguardo il trasferimento Slide categorizzazione I/O Blocchi o caratteri 🟩- Slide devices blocchi o caratteri Tecniche di gestione devices (4) 🟨- Buffering Possiamo mettere un buffer per favorire la comunicazione fra i devices. la cosa migliore che fa è creare maggiore efficienza. Un altro motivo è la velocità diversa di consumo. ...

6 min · Xuanqiang 'Angelo' Huang

Queueing Theory

Queueing theory is the theory behind what happens when you have lots of jobs, scarce resources, and subsequently long queues and delays. It is literally the “theory of queues”: what makes queues appear and how to make them go away. This is basically what happens in clusters, where you have a limited number of workers that need to execute a number of jobs. We need some little maths to model the stochastic process of request arrivals. ...

6 min · Xuanqiang 'Angelo' Huang

Birdsong and Song System

How are inputs made into motion? We analyze feedback systems in auditory systems in birds. Motivation Birds are very good at producing and reproducing songs by moving their vocal cords complexly (sensory motor learning). Birds and Humans do not have much of a common ancestry (last one was fishes). 71% of the birds, both female and male birds sing, for Zebra finch it is a mating behaviour, so only male sing. ...

5 min · Xuanqiang 'Angelo' Huang