Generative Adversarial Network has been introduced in 2014 by Ian Goodfellow (at that time they where still gray and white). Now the images have been improved so much with Diffusion Models. This idea has been considered by Yann LeCun as one of the most important ideas. Nowadays (2025) they are still used for super-resolution and other applications, but it has still some limitations (mainly stability), and now has good competition against other models. The resolution purported by GAN is much higher than VAE (see Autoencoders#Variational Autoencoders). This is a easy plugin to improve the results of other models (VAE, flow, Diffusion). Also ChatGPT has some sort of adversarial learning for example, not explained in the same manner as here.

General Idea

Here we have two main networks that are jointly trained:

  • Generator: this is a neural network that takes a random vector as input and generates a fake image. The goal of the generator is to produce images that are indistinguishable from real images.
  • Discriminator: this is a neural network that takes an image as input and predicts whether it is real or fake. The goal of the discriminator is to correctly classify images as real or fake.
  • Adversarial Loss: the generator and discriminator are trained in an adversarial manner. The generator tries to fool the discriminator, while the discriminator tries to correctly classify images. This creates a game-like scenario where both networks improve over time. This has some sort of similarity of natural evolution when you have a predator and a prey and they coevolve to surpass each other’s strategy.

We can define this more formally:

Generator: $G: \mathbb{R}^{L} \to \mathbb{R}^{Q}$ where $Q \gg L$, and the discriminator is a function $D: \mathbb{R}^{Q} \to [0, 1]$.

Training a Generative Adversarial Network.

Training Process 🟩

$$ \min_{G} \max_{D} V(G,D) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

This is the original loss function, but it has some problems (like vanishing gradients). The generator tries to minimize this function, while the discriminator tries to maximize it.

Theory shows that the stable point $p_{\text{model}} = p_{\text{data}}$ but only with $G$ and $D$ have infinity capacity (they can match any data distribution, which is a quite strong assumption). In practice optimizing both jointly is quite expensive computationally. Usually you do $k$ iterations for D and $1$ for $G$ because we want to have informative output for $G$. It looks to me that this idea is quite similar to the Actor Critic model see RL Function Approximation.

Problems with Likelihood

Given a generative model, we ask here, is likelihood of a certain sample a good indicator of it?

  • Sometimes we have poor samples yet good likelihood, this is usually the case when you have a strong spike of likelihood values, irrelevant of the noise (e.g. $\log(0.01p(x)) = \log p(x) - \log 100$, if $p(x)$ is proportional to $d$, for independent dimensions, the first value could be very high, nonetheless having poor sample quality, this is an example from eq 11 of https://arxiv.org/pdf/1511.01844)
  • Sometimes we have low likelihood yet good samples: a simple example is test data evaluation after over-fitting, models would give low likelihood to good samples, since they have not been seen during training. This means that likelihood alone is not a good metric to compare the models.

GAN Training Algorithm 🟩

While not converged do:

  1. $$\nabla_{\Theta_D} \frac{1}{N} \sum_{i=1}^{N} [\log(D(x^{(i)})) + \log(1 - D(G(z^{(i)})))]$$

    * $\Theta_D$ represents the parameters of the discriminator. * $D(x)$ is the discriminator’s output for a real sample $x$ (probability that $x$ is real). * $G(z)$ is the generator’s output for a noise sample $z$ (a generated fake sample). * The goal is to maximize this objective, making the discriminator better at distinguishing real from fake samples.

  2. Freeze Discriminator (D). Draw $N$ noise samples $\{z^{(1)}, ..., z^{(N)}\}$ from $p(z)$.

  3. $$\nabla_{\Theta_G} \frac{1}{N} \sum_{i=1}^{N} [\log(1 - D(G(z^{(i)})))]$$
    • $\Theta_G$ represents the parameters of the generator.
    • The goal is to minimize this objective, making the generator better at fooling the discriminator into thinking its generated samples are real.

Explanation:

  • The discriminator tries to learn to distinguish between real data and fake data generated by the generator.
  • The generator tries to learn to produce fake data that is indistinguishable from real data, thus fooling the discriminator.

The process typically involves alternating between training the discriminator for $k$ steps and then training the generator for one step (although other ratios are possible). This ensures that the discriminator doesn’t become too strong too quickly, preventing the generator from learning useful gradients. The training continues until a convergence criterion is met, ideally when the generator produces samples that the discriminator can no longer reliably classify as fake.

Optimal behaviour requirements

We need to have a specific format for $D$ to be optimal, which is: TODO.

GAN optimizes for Jensen-Shannon Divergence 🟩

$$ D_{JS}(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$$

where $M = \frac{1}{2}(P + Q)$. One can prove that the min max game above is the same as optimizing for the JS divergence.

The value function of the optimal discriminator $D^*$ in a Generative Adversarial Network (GAN) is given by:

$$V(G, D^*) = \mathbb{E}_{x \sim p_d} \left[ \log \left( \frac{p_d(x)}{p_d(x) + p_m(x)} \right) \right] + \mathbb{E}_{x \sim p_m} \left[ \log \left( \frac{p_m(x)}{p_d(x) + p_m(x)} \right) \right]$$

minimizing this divergence equals maximizing some likelihood.

$$ \begin{align*} V(G, D^*) &= \mathbb{E}_{x \sim p_d} \left[ \log \left( \frac{p_d(x)}{(p_d(x) + p_m(x))/2} \cdot \frac{1}{2} \right) \right] + \mathbb{E}_{x \sim p_m} \left[ \log \left( \frac{p_m(x)}{(p_d(x) + p_m(x))/2} \cdot \frac{1}{2} \right) \right] \\ &= \mathbb{E}_{x \sim p_d} \left[ \log \left( \frac{2 p_d(x)}{p_d(x) + p_m(x)} \right) - \log(2) \right] + \mathbb{E}_{x \sim p_m} \left[ \log \left( \frac{2 p_m(x)}{p_d(x) + p_m(x)} \right) - \log(2) \right] \\ &= -\log(2) + \mathbb{E}_{x \sim p_d} \left[ \log \left( \frac{2 p_d(x)}{p_d(x) + p_m(x)} \right) \right] - \log(2) + \mathbb{E}_{x \sim p_m} \left[ \log \left( \frac{2 p_m(x)}{p_d(x) + p_m(x)} \right) \right] \\ &= -2\log(2) + \int_x p_d(x) \log \left( \frac{2 p_d(x)}{p_d(x) + p_m(x)} \right) dx + \int_x p_m(x) \log \left( \frac{2 p_m(x)}{p_d(x) + p_m(x)} \right) dx \\ &= -2\log(2) + \int_x p_d(x) \log \left( \frac{p_d(x)}{(p_d(x) + p_m(x))/2} \right) dx + \int_x p_m(x) \log \left( \frac{p_m(x)}{(p_d(x) + p_m(x))/2} \right) dx \\ &= -2\log(2) + D_{KL} \left( p_d(x) \middle\| \frac{p_d(x) + p_m(x)}{2} \right) + D_{KL} \left( p_m(x) \middle\| \frac{p_d(x) + p_m(x)}{2} \right) \\ &= -2\log(2) + 2 D_{JS}(p_d(x) \| p_m(x)) \end{align*} $$

Where:

  • $p_d(x)$ is the true data distribution.
  • $p_m(x)$ is the distribution of the generated samples (implicitly defined by the generator $G$).
  • $D^*$ is the optimal discriminator.
  • $\mathbb{E}_{x \sim p}$ denotes the expectation over the distribution $p$.
  • $D_{KL}(p \| q)$ is the Kullback-Leibler divergence between distributions $p$ and $q$.
  • $D_{JS}(p \| q)$ is the Jensen-Shannon divergence between distributions $p$ and $q$.

Key Takeaway:

The maximum value of the GAN’s discriminator loss is related to the Jensen-Shannon divergence between the real data distribution and the distribution of the generated samples. Minimizing the GAN loss (for the generator) corresponds to minimizing the Jensen-Shannon divergence between these two distributions, ideally leading $p_m(x)$ to become equal to $p_d(x)$.

Tranining Issues 🟩

  • Vanishing Gradients (the same problem we have seen in Recurrent Neural Networks).
    • Using non-saturation loss.
  • Model Collapse: this is a problem where the generator produces a limited variety of samples, leading to a lack of diversity in the generated images. This can happen when the generator finds a small set of images that fool the discriminator, but doesn’t explore other possibilities.
    • Unrolled GAN: they basically do less updates, more diffused iterations (unrolling parameter with gradient accumulation).
    • It is very easy to see that when the discriminator is too good, then the signal is very low for the generator to learn fast, leading to slow convergence.
      • One idea to circumvent this is to have smoother discriminator lines, so that the signal to the generator is stronger (See Mao et al 2016 or Sønderby)
      • Another is changing the signal to maximize the real image discrimination (changes the error signal by a lot!)
  • Training Instability: GANs can be sensitive to hyperparameters, and small changes in the learning rate or architecture can lead to large changes in performance. This can make training GANs difficult and unpredictable
    • Gradient penalty: basically adding a penalty to the gradient of the discriminator’s output with respect to its input, which helps stabilize the training process. This is also known as Wasserstein GAN, since it is inspired by that distance.

GAN applications

StyleGAN 🟨–

(Karras et al. 2019). They gradually trained stacked model from low resolution to high resolution, both generator and discriminator jointly. downsample the original dataset to train at that resolution.

$$ c' = \gamma c + \beta $$

This is called AdaIN (Instance normalization and feature modulation), different kind of normalization, close to batch normalization.

Generative Adversarial Networks-20250420223803048

Image from StyleGAN paper

#### Image to Image translation 🟩-- Introduced in Pix2Pix (Isola et al. 2018). With this problem we want to start from one kind of image, like a segmentation, and output the original image that created it, or from real image to a segmentation image, which is a cool idea.

We add here a L1 loss to the original loss. One drawback is that we need paired images in this case, cycleGAN solves this problem, see (Zhu et al. 2020) (guarantee that we can map back to the original image, which is another added loss), then extended to bycycleGAN (one to many, instead of one-to-one, others extended to video-to-video and autoregressive modelling).

$$ \mathcal{L}(G, F, D_{X}, D_{Y}) = \mathcal{L}_{GAN}(G, D_{Y}, X, Y) + \mathcal{L}_{GAN}(F, D_{X}, Y, X) + \lambda \mathcal{L}_{cyc}(G, F) $$

Where we have a cycle consistent loss, and $X$ and $Y$ are start and finish images. The idea is that when a starting image is translated to an end image and back, they should be close to each other.

Regarding CycleGAN: Here we have pairs of generators and discriminators. Ones should generate the first kind of image, others the second kind of image, same for the generators.

References

[1] Karras et al. “A Style-Based Generator Architecture for Generative Adversarial Networks” arXiv preprint arXiv:1812.04948 2019

[2] Isola et al. “Image-to-Image Translation with Conditional Adversarial Networks” arXiv preprint arXiv:1611.07004 2018

[3] Zhu et al. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks” arXiv preprint arXiv:1703.10593 2020