The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector $\mathbf{x} \in \mathbb{R}^n$ to an output $y \in \{0,1\}$ using a weighted sum followed by a threshold function.

Introduction to the Perceptron

A mathematical model

Given an input vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ and a weight vector $\mathbf{w} = (w_1, w_2, \dots, w_n)$, the perceptron computes:

$$ z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b $$$$ y = f(z) = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{otherwise} \end{cases} $$

Learning Rule

Given a labeled dataset $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, the perceptron uses the following weight update rule for misclassified samples ($y^{(i)} \neq f(\mathbf{w}^\top \mathbf{x}^{(i)} + b)$):

$$ \mathbf{w} \leftarrow \mathbf{w} + \eta (y^{(i)} - f(z^{(i)})) \mathbf{x}^{(i)} $$$$ b \leftarrow b + \eta (y^{(i)} - f(z^{(i)})) $$

where $\eta > 0$ is the learning rate. You continue to update until there are no more errors. If the data is linearly separable, this converges in finite time.

The Perceptron Model-20250519144755347

You can observe the following: If there is an error, then you basically add the value of the $x$ scaled by the learning part to the theta.

Key Properties

  • Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
  • Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
  • Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.

Perceptron Convergence Theorem

The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.

Setup and Notation

  • Let the training set be $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, where $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y^{(i)} \in \{-1, +1\}$.
  • The perceptron updates its weight vector $\mathbf{w}$ as follows for each misclassified point $(\mathbf{x}^{(i)}, y^{(i)})$: $$ \mathbf{w} \leftarrow \mathbf{w} + y^{(i)} \mathbf{x}^{(i)} $$ where we assume the bias is absorbed into $\mathbf{x}$ by appending an extra dimension with $x_0 = 1$.

We assume the update rate is $1$, the same argument can be done with any update rate.

  • Assumption (Linear Separability): There exists a weight vector $\mathbf{w}^*$ and a margin $\gamma > 0$ such that for all training points $y^{(i)} (\mathbf{w}^* \cdot \mathbf{x}^{(i)}) \geq \gamma$ where $\|\mathbf{w}^*\| = 1$. This is the same condition we use for Support Vector Machines.

Proof

$$ \begin{align} \mathbf{w}_{t+1} \cdot \mathbf{w}^* &= (\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}) \cdot \mathbf{w}^* \\ &= \mathbf{w}_t \cdot \mathbf{w}^* + y^{(i)} (\mathbf{x}^{(i)} \cdot \mathbf{w}^*) \\ &\geq \mathbf{w}_t \cdot \mathbf{w}^* + \gamma \end{align} $$$$ \mathbf{w}_T \cdot \mathbf{w}^* \geq T \gamma $$

Where $T$ is the number of updates till now. We then bound $\|\mathbf{w}_t\|^2$: The norm squared of $\mathbf{w}$ evolves as:

$$ \begin{align} \|\mathbf{w}_{t+1}\|^2 &= \|\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}\|^2 \\ &= \|\mathbf{w}_t\|^2 + 2 y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) + \|\mathbf{x}^{(i)}\|^2 \end{align} $$

Since the update happens only for misclassified points, $y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) < 0$, so we drop it to get:

$$ \|\mathbf{w}_{t+1}\|^2 \leq \|\mathbf{w}_t\|^2 + R^2 $$

where $R = \max_i \|\mathbf{x}^{(i)}\|$. Iterating over $T$ updates,

$$ \|\mathbf{w}_T\|^2 \leq T R^2 $$

We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,

$$ \mathbf{w}_T \cdot \mathbf{w}^* \leq \lVert \mathbf{w}_T\rVert^{2} \lVert \mathbf{w}^*\rVert^{2} \implies T \gamma \leq \lVert \mathbf{w}_T \rVert^{2} \lVert \mathbf{w}^*\rVert^{2} = \lVert \mathbf{w} \rVert^{2} \cdot 1 \leq \sqrt{T R^2} $$$$ \sqrt{T} \geq \frac{T \gamma}{R} \implies T \leq \frac{R^2}{\gamma^2} $$

Since $R$ and $\gamma$ are constants, this implies that the perceptron makes at most $\frac{R^2}{\gamma^2}$ updates, proving convergence in $\mathcal{O}(\frac{R^2}{\gamma^2})$.

References

[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958