Counterfactual Invariance

Machine learning cannot distinguish between causal and environment features. Shortcut learning Often we observe shortcut learning: the model learns some dataset dependent shortcuts (e.g. the machine that was used to take the X-ray) to make inference, but this is very brittle, and is not usually able to generalize. Shortcut learning happens when there are correlations in the test set between causal and non-causal features. Our object of interest should be the main focus, not the environment around, in most of the cases. For example, a camel in a grass land should still be recognized as a camel, not a cow. One solution could be engineering invariant representations which are independent of the environment. So having a kind of encoder that creates these representations. ...

January 18, 2025 · Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Low-Rank Adaptation

LoRA: Low-Rank Adaptation for Fine-Tuning Large Models Motivation & Problem Setting Full fine-tuning of modern foundation models updates all $|\Theta|$ parameters, where $|\Theta|$ can reach $10^{11}$. Each downstream task produces a full-sized checkpoint, making per-task storage, distribution, and serving infeasible. We want a method that (i) reduces trainable parameters by orders of magnitude, (ii) does not add inference latency, and (iii) matches full fine-tuning quality. LoRA — introduced by Hu et al. (2021, Microsoft) — is currently the dominant answer. ...

Reading Time: 16 minutes ·  By Xuanqiang Angelo Huang

Sobolev Spaces

Sobolev Spaces Motivation & Setup PDE theory and the calculus of variations require function spaces in which (i) differentiation makes sense for non-smooth functions, (ii) the space is complete under an $L^p$-flavored norm, and (iii) one can embed into $L^q$ or Hölder spaces. The classical $C^k$ spaces fail (ii); pure $L^p$ fails (i). Sobolev spaces $W^{k,p}$ are the fix: they replace pointwise differentiation with the weak derivative and complete $C^k$ under the natural $L^p$-Sobolev norm. ...

Reading Time: 1 minute ·  By Xuanqiang Angelo Huang

Spectral Theorem

The Spectral Theorem: A Theorem-Chain Construction Strategic Overview The spectral theorem is not one theorem but a family. Here we build the finite-dimensional versions — both for self-adjoint operators on real inner product spaces and normal operators on complex inner product spaces — and end by sketching the path to the infinite-dimensional (bounded / unbounded / compact) generalizations. Two genuinely different proof routes are possible. The thing to remember is that the complex case is cleaner because the Fundamental Theorem of Algebra hands you an eigenvalue for free; the real case requires an extra trick (complexification or a quadratic-factor argument) to extract an eigenvalue from a self-adjoint operator. ...

Reading Time: 11 minutes ·  By Xuanqiang Angelo Huang

Universal Composability

Universal Composability (UC) Framework Motivation: The Composition Problem The fundamental issue UC addresses: protocols proven secure in isolation can fail catastrophically when run concurrently with other protocols. Classical security definitions (e.g., standalone simulation-based security) do not guarantee that security properties are preserved under arbitrary composition. Real-world systems always run multiple protocols simultaneously, sharing state, keys, randomness, and communication channels. The Composition Problem A protocol $\pi$ proven secure in a standalone setting may become insecure when executed concurrently with arbitrary protocols $\pi_1, \dots, \pi_n$, even if each $\pi_i$ is itself secure in isolation. ...

Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Multi-Objective Gradient Descent

Multi-Objective Gradient Descent (MGDA) Problem Formulation Brief context: standard gradient descent optimizes a single scalar loss. MGDA generalizes this to $T$ tasks, seeking Pareto-optimal solutions rather than a single weighted compromise. Multi-Objective Optimization Setup Single-objective baseline: minimize $\mathcal{L}(\theta)$ — one gradient, one direction. Multi-objective generalization: given $T$ task losses $\{\mathcal{L}_t(\theta)\}_{t=1}^T$, there is generally no $\theta$ minimizing all simultaneously. The target shifts to the Pareto front. Pareto optimality: $\theta^*$ is Pareto-optimal if there exists no $\theta'$ such that $\mathcal{L}_t(\theta') \leq \mathcal{L}_t(\theta^*)$ for all $t$, with strict inequality for at least one $t$. ...

Reading Time: 5 minutes ·  By Xuanqiang Angelo Huang

Topology Crash Course

A Crash Course in Topology This is a tour of the landmarks beyond Topological Spaces and Metric Spaces. Order of climb: first see how topologies are generated, then ascend the axiom ladder (separation, countability, compactness), then meet the invariants (algebraic topology, Euler characteristic), and finally connect back to fixed-point arguments — the bridge to game theory and mechanism design. Generating Topologies Listing all open sets is intractable. We generate them. Basis A basis $\mathcal{B}$ for a topology on $X$ is a family of subsets satisfying: ...

Reading Time: 14 minutes ·  By Xuanqiang Angelo Huang

Entropy

Questo è stato creato da 1948 Shannon in (Shannon 1948). Questa nozione è basata sulla nozione di probabilità, perché le cose rare sono più informative rispetto a qualcosa che accade spesso. Introduction to Entropy The Shannon Information Content $$ h(x = a_{i}) = \log_{2}\frac{1}{P(x = a_{i})} $$ We will see that the entropy is a weighted average of the information, so the expected information content in a distribution. Kolmogorov complexity è un modo diverso per definire la complessità. Legato è Neural Networks#Kullback-Leibler Divergence. ...

September 20, 2024 · Reading Time: 15 minutes ·  By Xuanqiang Angelo Huang

Distributional Reinforcement Learning

Distributional Reinforcement Learning Motivation: Why Bother With the Whole Distribution? Standard value-based RL collapses the random return into a single scalar via expectation: $Q(s,a) = \mathbb{E}[Z(s,a)]$. The distributional perspective (Bellemare, Dabney, Munos, 2017) argues that this is information-destructive: two policies with identical means can have wildly different return distributions (bimodal vs unimodal, heavy-tailed vs concentrated), and modelling the full distribution yields: Auxiliary learning signal — richer targets stabilise representation learning, even when only $\mathbb{E}[Z]$ is used for control. Empirically the biggest reason it works. Risk-sensitivity — CVaR, distortion measures, robust planning all need $F_Z$, not just $\mathbb{E}[Z]$. Better gradient information for function approximators (denser supervision than a single scalar regression target). [!note] The empirical surprise Distributional RL was originally motivated by risk-sensitivity, but the headline result of C51 was that it improved risk-neutral (mean-greedy) control on Atari. Modelling the distribution is a regulariser / representation booster, not just a risk tool. ...

Reading Time: 8 minutes ·  By Xuanqiang Angelo Huang

Ad-hoc Teamwork

Ad-Hoc Teamwork (AHT) in Reinforcement Learning Problem Setting & Motivation Ad-hoc teamwork concerns agents that must cooperate effectively with previously unknown teammates without any prior coordination, communication protocol, or shared learning history. This is the multi-agent analogue of zero-shot generalization: the agent is dropped into a team and must immediately contribute productively. The setting departs from self-play / centralized training in that the partner distribution at test time is exogenous and possibly non-stationary. ...

Reading Time: 7 minutes ·  By Xuanqiang Angelo Huang