Notes

Optimizations for DNN

Mixture of Experts There is a gate that opens a subset of the experts, and the output is the weighted sum of the outputs of the experts. The weights are computed by a gating network. One problem is load balancing, non uniform assignment. And there is a lot of communication overhead when you place them in different devices. LoRA: Low-Rank Adaptation We only finetune a part of the network, called lora adapters, not the whole thing. There are two matrices here, a matrix A and B, they are some sort of an Autoencoders, done for every Q nd V matrices in the LLM attention layer. The nice thing is that there are not many inference costs if adapters are merged post training: ...

Architecture of the Brain

First, the brain is organized into functionally specific areas, and second, neurons in different parts of the vertebrate nervous system, indeed in all nervous systems, are quite similar. Small comparison with Computers A gross observation between computer’s transistors and human neurons is that there a big difference of numbers: trillions of transistors vs billions of neurons. 6 orders of magnitude frequency difference (Ghz versus 1kHz for neurons). Many many neural types and different types of connections. And the digital vs analog and chemical modes of communication. Parallel processor abilities. Fixed vs plastic architectures But this is comparing with transistors with one higher level object, so this comparison might not be completely fair. They are very different from this point of view. And only some brain areas are similar to real neural networks. ...

Systems for Artificial Intelligence

At the time of writing, the compute requirements for machine learning models and artificial intelligence are growing at a staggering rate of 200% every 3.5 months. Interest in the area is being quantified as 10k papers per month on the topic, while dollar investments on compute (energy, cooling, sustainability of compute in general) have had a hard time keeping up with the continuous new requests. Image from here ...

RL Function Approximation

These algorithms are good for scaling state spaces, but not actions spaces. The Gradient Idea Recall Temporal difference learning and Q-Learning, two model free policy evaluation techniques explored in Tabular Reinforcement Learning. A simple parametrization The idea here is to parametrize the value estimation function so that similar inputs gets similar values akin to Parametric Modeling estimation we have done in the other courses. In this manner, we don’t need to explicitly explore every single state in the state space. ...

Proximal Policy Optimization

(Schulman et al. 2017) è uno degli articoli principali che praticamente hanno dato via al campo. Anche questo è buono per Policy gradients: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/ See RL Function Approximation, this document is deprecated. References [1] Schulman et al. “Proximal Policy Optimization Algorithms” arXiv preprint arXiv:1707.06347 2017

Group Relative Policy Optimization

https://hlfshell.ai/posts/grpo/

Proximal Polixy Optimization

This document is DEPRECATED, please see RL Function Approximation. This documents attempts to briefly present the algorithm and some experiments found online about it. The following repo seems to be a good resource: here. Usually, PPO is explained as an actor critic framework. This means there is an agent that acts on the environment, and then there is a critic that collects the feedback from the environment. The main idea about this framework is to select a policy that is similar, so that it is less probable that a bad policy, a very different policy from the original is selected. This is achieved by clipping over the advantage. And then ...

Classical Cyphers

Introduzione a Crittografia al corso di crittografia di Christof Paar su Youtube, con aggiunte del corso Unibo. Classifications and definitions Classification nowadays as many many applications like, and it’s a increasing important field Cryptology (2) La branca comunemente riferita come crittografia è divisa principalmente in due campi crittografia e cryptanalysis in cui una cerca di creare nuovi metodi per cifrare i messaggi, e l’altro prova ad attaccare questi messaggi ritrovando il messaggio originale. ...

Bayesian Optimization

While Active Learning looks for the most informative points to recover a true underlying function, Bayesian Optimization is just interested to find the maximum of that function. In Bayesian Optimization, we ask for the best way to find sequentially a set of points $x_{1}, \dots, x_{n}$ to find $\max_{x \in \mathcal{X}} f(x)$ for a certain unknown function $f$. This is what the whole thing is about. Definitions First we will introduce some useful definitions in this context. These were also somewhat introduced in N-Bandit Problem, which is one of the classical optimization problems we can find in the literature. ...

Datacenter Hardware

We want to optimize the parts of the datacenter hardware such that the cost of operating the datacenter as a whole would be lower, we need to think about it as a whole. Datacenter CPUs Desktop CPU vs Cloud CPU Isolation: Desktop CPUs have low isolation, they are used by a single user. Cloud CPUs have high isolation, they are shared among different users. Workload and performance: usually high workloads and moving a lot of data around. They have a spectrum of low and high end cores, so that if you have high parallelism you can use lower cores, while for resource intensive tasks, its better to have high end cores, especially for latency critical tasks. ...