LoRA: Low-Rank Adaptation for Fine-Tuning Large Models
Motivation & Problem Setting
Full fine-tuning of modern foundation models updates all $|\Theta|$ parameters, where $|\Theta|$ can reach $10^{11}$. Each downstream task produces a full-sized checkpoint, making per-task storage, distribution, and serving infeasible. We want a method that (i) reduces trainable parameters by orders of magnitude, (ii) does not add inference latency, and (iii) matches full fine-tuning quality. LoRA — introduced by Hu et al. (2021, Microsoft) — is currently the dominant answer.
Why not just freeze + linear probe?
Linear probing is too weak for generative tasks (no internal representations adapt). We need a method that adapts internal computations of every transformer block, but cheaply.
Why not adapters?
Adapters (Houlsby et al., 2019; Pfeiffer et al.) insert small bottleneck MLPs $\sigma(xW_{\text{down}})W_{\text{up}}$ inside each transformer block. They work, but the inserted modules are sequentially on the forward path → inference latency overhead, especially painful at batch size 1 in autoregressive decoding. LoRA’s key engineering win is that its reparameterization can be algebraically merged into the base weight at inference time.
Core Mathematical Formulation
The LoRA Reparameterization
Let $W_0 \in \mathbb{R}^{d \times k}$ be a frozen pre-trained weight matrix (e.g., $W_Q$, $W_V$ in attention). Standard fine-tuning learns
$$ W = W_0 + \Delta W, \quad \Delta W \in \mathbb{R}^{d \times k}. $$LoRA constrains $\Delta W$ to be low-rank:
$$ \Delta W = B A, \quad B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times k}, \quad r \ll \min(d, k). $$The forward pass becomes:
$$ h = W_0 x + \Delta W\, x = W_0 x + B(Ax). $$[!note] Algebraic Merge at Inference Once trained, you compute $W' = W_0 + BA$ once and serve $W'$. Zero added latency, zero added memory at inference. This is the defining advantage over adapters and prefix tuning.
Parameter Count
- Full FT of one matrix: $dk$ params.
- LoRA: $r(d+k)$ params.
- For $d=k=4096$, $r=8$: $\frac{r(d+k)}{dk} = \frac{8 \cdot 8192}{16{,}777{,}216} \approx 0.0004$ — a ~2500× reduction for that matrix.
Initialization Scheme
$$ A \sim \mathcal{N}(0, \sigma^2), \quad B = \mathbf{0}. $$Hence at step 0: $\Delta W = B A = 0$, so the model behaves exactly like the frozen base. Training starts at the pre-trained solution and moves outward.
[!question] Why not initialize both to small random? Two reasons. (1) Symmetry: if both are random, $\Delta W_0$ is nonzero and the model is perturbed before any data is seen — you lose the pre-trained init guarantee. (2) Gradient pathology: with $B=0$, the gradient $\partial \mathcal{L}/\partial A \propto B^\top (\cdot) = 0$ at the very first step, so only $B$ updates first. This staggered update actually helps stability — see LoRA+ below for the asymmetric-LR analysis.
Scaling Factor $\alpha/r$
LoRA introduces a scalar $\alpha$ and uses:
$$ \Delta W = \frac{\alpha}{r} BA. $$The original paper sets $\alpha$ once and varies $r$, claiming this approximately removes the need to retune the learning rate when sweeping $r$. The intuition: as $r$ grows, $\|BA\|$ tends to grow proportionally; dividing by $r$ keeps the effective update magnitude scale-invariant.
[!tip] Practical rule of thumb Many practitioners fix $\alpha = 2r$ (e.g. $r=8, \alpha=16$). Some fix $\alpha = r$. The choice is empirical, but be aware that changing $\alpha$ at fixed $r$ is equivalent to scaling the LR for the LoRA branch — they are not independent knobs.
Theoretical Foundations
The Intrinsic-Rank Hypothesis
LoRA’s central conjecture: the update matrix $\Delta W$ acquired during fine-tuning has low intrinsic rank, even though it sits in an ambient space of dimension $dk$.
Empirical evidence in the paper:
- Random projection to $r=1$ or $r=2$ subspaces sometimes nearly matches $r=64$.
- The top singular directions of $\Delta W$ amplify features that are already present but underused in $W_0$ (i.e., $\Delta W$ correlates with task-relevant singular directions of $W_0$ that have small singular values).
This builds directly on Aghajanyan et al. 2020 “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” which showed that one can re-parameterize the entire fine-tuning trajectory in a $\sim 200$-dimensional subspace and recover ≥90% of full-FT performance for BERT-base. Larger pre-trained models have smaller intrinsic dimension — the opposite of what naive parameter-counting would suggest.
[!note] Why low intrinsic rank is plausible Pre-training already learns generic features. Adapting to a downstream task should mostly require amplifying or suppressing a small number of feature directions — not learning new bases from scratch. This connects to Linear Mode Connectivity and the observation that fine-tuned and pre-trained models are connected by low-loss paths.
Expressivity of Rank-$r$ Updates
Any matrix in $\mathbb{R}^{d \times k}$ of rank $\le r$ can be written as $BA$ with the given shapes. So LoRA’s hypothesis class is exactly the rank-$\le r$ matrices. This is a strict subset of $\mathbb{R}^{d \times k}$. When $r = \min(d,k)$, LoRA recovers full fine-tuning capacity (but loses parameter savings).
The implicit prior: $\Delta W$ should be a low-rank perturbation. This is a strong inductive bias and is also a source of LoRA’s limitations (see Biderman et al. critique below).
Gradient Dynamics
For loss $\mathcal{L}$:
$$ \begin{align*} \frac{\partial \mathcal{L}}{\partial B} &= \frac{\partial \mathcal{L}}{\partial \Delta W} A^\top \\ \frac{\partial \mathcal{L}}{\partial A} &= B^\top \frac{\partial \mathcal{L}}{\partial \Delta W} \end{align*} $$Note the asymmetry: $A$’s gradient is left-multiplied by $B^\top$, which is small early in training (init at 0). $B$’s gradient depends on $A^\top$, which is full-rank random Gaussian. This creates a feature-extraction/feature-projection split: $A$ identifies useful input directions, $B$ projects them into output space. The asymmetry motivates LoRA+ (Hayou et al. 2024), which argues $B$ should have a much larger LR than $A$ (often $\lambda_B = 16 \lambda_A$) to be in the feature-learning regime rather than the lazy regime.
Which Weights to Adapt?
Standard Application
The original paper applies LoRA only to attention weight matrices, and within attention, finds:
| Targets | Quality |
|---|---|
| $W_Q$ only | Weak |
| $W_V$ only | Decent |
| $W_Q + W_V$ | Strong (best param-efficient) |
| All of $\{W_Q, W_K, W_V, W_O\}$ | Marginally better, doubled params |
| FFN layers | Often substantial gains in modern practice |
In modern instruction tuning (LLaMA, Mistral), it is now standard to apply LoRA to all linear projections (attention + MLP up/gate/down). This is what HuggingFace PEFT defaults to via target_modules="all-linear".
[!tip] Heuristic If you have any compute budget, apply LoRA to all linear layers. The marginal cost is small compared to base model forward/backward, and FFN adaptation often matters more than the original paper suggested for generative tasks.
Embeddings and Output Heads
LoRA is usually not applied to token embeddings or LM head. If your task requires new tokens (e.g., role markers <|user|>), you typically train those embeddings with full precision separately — they’re cheap.
Implementation Sketch (PyTorch)
import torch
import torch.nn as nn
import math
class LoRALinear(nn.Module):
def __init__(self, base_linear: nn.Linear, r: int, alpha: float, dropout=0.0):
super().__init__()
self.base = base_linear # frozen
for p in self.base.parameters():
p.requires_grad = False
in_f, out_f = base_linear.in_features, base_linear.out_features
self.A = nn.Parameter(torch.empty(r, in_f))
self.B = nn.Parameter(torch.zeros(out_f, r))
nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
self.scale = alpha / r
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# base path (frozen)
out = self.base(x)
# LoRA path
out = out + (self.dropout(x) @ self.A.T @ self.B.T) * self.scale
return out
Two design choices worth noting:
- Dropout on the LoRA branch only (regularizes the low-rank update).
- The base forward and LoRA forward are independent computational graphs — only LoRA params accumulate gradients.
Major Variants
QLoRA
Dettmers et al. 2023. Combines:
- 4-bit NormalFloat (NF4) quantization of the frozen base $W_0$.
- Double quantization of the quantization constants themselves.
- Paged optimizers to handle memory spikes from gradient checkpointing.
- Standard LoRA on top of the quantized base.
Forward pass: dequantize $W_0$ block-wise on-the-fly, compute $W_0 x$, add $BAx$. Backward pass: gradients flow only into $B, A$ (the quantized base is non-differentiable).
[!note] Why QLoRA matters QLoRA reduced the memory cost of fine-tuning a 65B model from ~780 GB to <48 GB, enabling single-A100 fine-tuning of LLaMA-65B. This single paper made LoRA the universal default for open-source LLM adaptation.
NF4 is theoretically the information-optimal 4-bit datatype for normally-distributed weights (which pre-trained LLM weights approximately are, post-LayerNorm), based on quantile quantization.
DoRA — Weight-Decomposed LoRA
Liu et al. 2024. Decomposes $W$ into magnitude and direction:
$$ W = m \cdot \frac{V}{|V|_c}, \quad m \in \mathbb{R}^{1 \times k},, V \in \mathbb{R}^{d \times k} $$where $|\cdot|_c$ is column-wise L2 norm. Then $V$ is LoRA-decomposed ($V_0 + BA$) and the magnitude $m$ is learned directly. Empirically narrows the gap to full FT, especially at low rank. Interpretation: full FT adjusts both magnitude and direction; vanilla LoRA only adjusts a low-rank combination of both, which is sub-optimal when the task primarily requires magnitude recalibration.
AdaLoRA
Zhang et al. 2023. Instead of fixing rank $r$ uniformly, parameterize $\Delta W = P \Lambda Q$ (SVD form) and prune $\Lambda$’s singular values during training via an importance score. Allocates more rank to layers/heads that need it. The bookkeeping is non-trivial and gains are modest; less popular in production than DoRA.
LoRA+
Hayou et al. 2024. Pure theory result: in the infinite-width limit, $A$ and $B$ require different learning rates to be in the feature-learning regime. Empirically, $\lambda_B / \lambda_A \in [4, 16]$ improves convergence speed and final quality, especially for harder tasks. Drop-in compatible with any LoRA codebase.
VeRA — Vector-based Random Matrix Adaptation
Kopiczko et al. 2024. Freezes $A, B$ as random projections shared across layers and only learns per-layer scaling vectors $b, d$:
$$ \Delta W_\ell = \text{diag}(b_\ell) B , \text{diag}(d_\ell) A $$with $A, B$ random and identical across layers. Reduces trainable params by another order of magnitude vs. LoRA, with surprisingly small quality loss. Echoes the Lottery Ticket Hypothesis family — most of the work can be done by selecting/scaling random features.
LoHa, LoKr
LyCORIS family (originally from diffusion fine-tuning). Replace the $BA$ outer product with Hadamard (element-wise) or Kronecker products of two low-rank factorizations. Higher effective rank at the same param count. More common in image generation (Stable Diffusion LoRAs) than in LLM land.
Comparison Table — PEFT Methods
| Method | Trainable params | Inference latency | Mergeable | Modifies architecture |
|---|---|---|---|---|
| Full FT | $ | \Theta | $ | 0 |
| Adapter (Houlsby) | ~3% | ↑ | No | Yes (inserts modules) |
| Prefix tuning | ~0.1% | ↑ (longer seq) | No | No (input-side) |
| Prompt tuning | ~0.01% | ↑ slight | No | No |
| BitFit | ~0.05% (biases only) | 0 | Trivially | No |
| LoRA | ~0.1%–1% | 0 | Yes | No |
| QLoRA | ~0.1%–1% | 0 | Yes (after dequant) | No (base is quantized) |
| DoRA | ~LoRA + small | 0 | Yes | No |
| VeRA | ~0.01% | 0 | Yes | No |
[!note] The “merge” property dominates in production Most companies serving fine-tuned LLMs to many tenants use a base model + per-tenant LoRA swap: keep $W_0$ shared, hot-swap ${B, A}$ pairs per request via S-LoRA or similar serving stacks. This is structurally impossible for adapters or prefix tuning.
Limitations & Recent Critiques
LoRA Learns Less and Forgets Less
Biderman et al. 2024. Controlled study on code and math fine-tuning. Headline findings:
- LoRA underperforms full FT on out-of-distribution generalization for hard domains (e.g., code generation, math).
- LoRA preserves the base model’s behavior on tasks outside the adaptation domain better than full FT — i.e., less catastrophic forgetting.
- Both effects come from the same source: the low-rank constraint limits how far the model can move from $W_0$.
[!question] When should you NOT use LoRA? When the target distribution is very far from pre-training (large domain shift, new capabilities like new programming languages or specialized math). For instruction tuning on data close to pre-training distribution, LoRA matches full FT. For capability acquisition, full FT or large-rank LoRA may be required.
Effective Rank ≠ Nominal Rank
Empirical analyses show the effective rank of LoRA-learned $\Delta W$ (top-$k$ singular values containing 95% of energy) is often much smaller than nominal $r$. So setting $r=64$ vs $r=8$ may not give you a meaningfully more expressive update — the optimizer simply doesn’t use the additional capacity. Implication: simply cranking $r$ up isn’t a free lunch.
Catastrophic Interference in Multi-Task LoRA
Composing multiple LoRAs (e.g., averaging two task adapters’ $\Delta W$) often produces a model that is worse at both tasks than either individually. The space of low-rank updates is not closed under addition in a task-preserving sense. See LoRA Hub (Huang et al.), Mixture-of-LoRAs, and the broader Model Merging literature (Task Arithmetic, TIES Merging, DARE).
Connection to Adjacent Topics
vs. Sparse Fine-Tuning
Sparse FT (e.g., LT-SFT, FISH Mask): learn $\Delta W$ as a sparse matrix rather than a low-rank one. Same goal (parameter-efficient), different inductive bias. Sparse FT is harder to implement efficiently on GPU because sparse matmul kernels are slower than dense ones, despite the smaller flop count. This is why low-rank (which uses dense small matmuls) dominates in practice — hardware bias, not algorithmic superiority.
vs. RLHF Fine-Tuning
RLHF (or DPO) on top of LoRA is now standard: SFT-LoRA → DPO-LoRA on a frozen base. Memory savings compound, but be aware that DPO’s reference model is the SFT-LoRA-merged model, which adds bookkeeping. The reward model itself is usually a separate model (often LoRA-tuned too).
This is directly relevant to safety fine-tuning: many RLHF and Constitutional AI pipelines train safety-relevant heads or behaviors via LoRA, which means the safety modification is a low-rank perturbation and thus potentially fragile — see the literature on LoRA jailbreaks and the Shadow Alignment paper (Yang et al. 2023) showing that 100 examples of bad LoRA training can undo safety alignment. For your line of work on safe-by-construction AI: the fact that alignment lives in a low-rank subspace that can be easily perturbed by another low-rank update is a structural alignment fragility worth taking seriously.
vs. Mechanistic Interpretability
Recent work (e.g., Sharkey et al., Bushnaq et al. on circuits) is starting to ask: what subspace does LoRA modify? If you can characterize the low-rank update’s singular directions in terms of pre-existing model features (SAE directions, attention head functions), you can predict generalization. This connects to Linear Representation Hypothesis and the broader project of treating fine-tuning as a tractable, low-dimensional intervention rather than an opaque parameter shift.
Practical Recipe (Default Settings, May 2026)
[!tip] LoRA Defaults That Just Work
- Base: 4-bit NF4 quantization (QLoRA).
- Targets: all linear layers (
target_modules="all-linear").- Rank: $r = 16$ for instruction tuning, $r = 64$–$128$ for capability acquisition.
- Alpha: $\alpha = 2r$.
- Dropout: $0.05$.
- LR: $2 \times 10^{-4}$ for $A$, $3 \times 10^{-3}$ for $B$ (LoRA+ ratio ≈ 16).
- Schedule: cosine with warmup over ~3% of steps.
- Optimizer: paged AdamW 8-bit if memory-constrained, else AdamW.
- Precision: bf16 mixed-precision for LoRA weights and activations.
- Variant: vanilla LoRA → DoRA if the gap to full FT matters.
Final Synthesis
LoRA’s enduring impact comes from a clean factorization of two concerns: (1) the inductive-bias claim that fine-tuning lives in a low-rank subspace, which is theoretically grounded in intrinsic-dimension work, and (2) the engineering claim that low-rank + mergeable at inference are the two properties that matter for production. Variants (QLoRA, DoRA, VeRA, LoRA+) refine each axis without abandoning the framework. The most interesting open questions are not “can we make $r$ smaller” but “what does the low-rank update actually represent, and what behaviors are categorically out of reach for any low-rank intervention?” — and the latter is where the alignment/safety community is now circling.
Memory Cheatsheet: Training vs Inference (with LoRA)
Quick mental model: in inference you only pay for weights + activations + KV cache. In training you also pay for gradients, optimizer states, and activations have to be kept around for backward. That last bit is the killer.
The base unit: bytes per parameter
| Precision | Bytes/param |
|---|---|
| fp32 | 4 |
| fp16 / bf16 | 2 |
| int8 | 1 |
| int4 / NF4 | 0.5 |
So a 7B model in bf16 weights = $7 \times 10^9 \times 2 = 14$ GB. Memorize this — everything else is multipliers on top.
Inference memory
Roughly three buckets:
Total ≈ Weights + KV Cache + Activations(small)
Weights: params × bytes_per_param. 7B bf16 → 14 GB. 70B bf16 → 140 GB.
The 2 is for K and V. For LLaMA-2-7B (L=32, 32 heads, d_head=128, bf16) at seq_len=4096, batch=1: $2 \times 32 \times 32 \times 128 \times 4096 \times 1 \times 2 \approx 2.1$ GB. Doubles every time you double context.
Quick trick: for a 7B-class model, KV ≈ 0.5 MB per token per batch in bf16. So 8k context, batch 4 = ~16 GB just in KV. This is why GQA / MQA exist.
Activations: tiny at inference (you discard them layer by layer). Ignore unless you’re being pedantic.
Training memory — the 4× rule
Rule of thumb for full fine-tuning with AdamW in mixed precision:
Total ≈ Weights + Gradients + Optimizer + Activations
≈ 2P + 2P + 8P + activations
≈ ~16 bytes/param + activations
Where the 8P for Adam = two moments (m, v) in fp32 = 4+4 bytes, plus often a fp32 master copy of weights = another 4. People bookkeep this slightly differently, but ~16–20 bytes/param is the right ballpark.
$$ \text{act} \sim \text{batch} \times \text{seq\_len} \times L \times d_{\text{model}} \times \text{(some multiplier ~10-30)} $$So a 7B model needs ~112 GB just for the static stuff — already over an A100-80GB. Add activations and you’re toast. This is why nobody full-FTs 7B on one GPU without tricks.
The multiplier depends on what gets saved for backward (attention scores hurt the most — quadratic in seq_len without FlashAttention). With gradient checkpointing, you trade ~30% compute for activations going from $O(L)$ down to $O(\sqrt{L})$ — typically cuts activation memory by 5–10×.
The LoRA cheat
LoRA changes the picture dramatically because only the adapter params get gradients + optimizer states. Let $P$ = base params, $P_{\text{LoRA}}$ = LoRA params (typically 0.1–1% of $P$).
LoRA training ≈ 2P (frozen weights) ← bf16, no grads
+ 16 × P_LoRA ← grads + Adam on tiny matrices
+ activations ← still depend on full forward!
Concrete: 7B in bf16, LoRA with r=16 on all linears, ~40M trainable params:
- Frozen weights: 14 GB
- LoRA grads + optimizer: $16 \times 40\text{M} \approx 0.6$ GB
- Activations: maybe 5–15 GB depending on batch/seq
Total ~20–30 GB. Fits an A6000 or even a 3090 with checkpointing.
QLoRA — the further cheat
Quantize the frozen base to 4-bit NF4:
- Frozen weights: $7\text{B} \times 0.5 = 3.5$ GB (was 14)
- LoRA grads + optimizer: ~0.6 GB
- Activations: same
Total ~10–15 GB. This is why a 65B model fits on one A100-80GB — it goes from ~130 GB frozen to ~32.5 GB frozen.
Quick estimation workflow
When someone asks “will this fit?”:
- Weights: params × bytes (2 for bf16, 0.5 for 4-bit).
- Mode:
- Inference → add KV cache (≈ 0.5 MB/token for 7B-class, scales with model size).
- Full FT → multiply weights cost by ~8 (so bf16 weights × 8 ≈ training static cost).
- LoRA → just weights cost + ~1 GB for adapters.
- Activations: rough guess
batch × seq × 0.5 MBfor 7B-class without checkpointing; divide by ~5 with checkpointing. - Add 10–20% slop for fragmentation, CUDA workspace, etc.
Sanity check examples
| Setup | Memory |
|---|---|
| 7B inference, bf16, seq=2k, bs=1 | ~16 GB |
| 7B inference, bf16, seq=32k, bs=1 | ~30 GB (KV dominates) |
| 7B full FT, bf16, bs=4, seq=2k | ~110 GB + activations → multi-GPU |
| 7B LoRA, bf16, bs=4, seq=2k | ~25 GB → single 3090/A6000 |
| 7B QLoRA, bs=4, seq=2k | ~12 GB → single 4090 |
| 70B QLoRA, bs=1, seq=2k | ~45 GB → single A100-80GB |
The mental shortcut I actually use: bf16 inference ≈ 2× params in GB. Full FT ≈ 16× params in GB. LoRA training ≈ inference cost + a bit. QLoRA ≈ inference cost / 4.