The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector $\mathbf{x} \in \mathbb{R}^n$ to an output $y \in \{0,1\}$ using a weighted sum followed by a threshold function.

The Mathematical Definition

Given an input vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ and a weight vector $\mathbf{w} = (w_1, w_2, \dots, w_n)$, the perceptron computes:

$$ z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b $$

where $b$ is the bias term. The output is determined by the Heaviside step function:

$$ y = f(z) = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{otherwise} \end{cases} $$

Learning Rule

Given a labeled dataset $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, the perceptron uses the following weight update rule for misclassified samples ($y^{(i)} \neq f(\mathbf{w}^\top \mathbf{x}^{(i)} + b)$):

$$ \mathbf{w} \leftarrow \mathbf{w} + \eta (y^{(i)} - f(z^{(i)})) \mathbf{x}^{(i)} $$$$ b \leftarrow b + \eta (y^{(i)} - f(z^{(i)})) $$

where $\eta > 0$ is the learning rate. You continue to update until there are no more errors.

Key Properties

  • Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
  • Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
  • Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.

Perceptron Convergence Theorem

The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.

Setup and Notation

  • Let the training set be $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, where $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y^{(i)} \in \{-1, +1\}$.

  • The perceptron updates its weight vector $\mathbf{w}$ as follows for each misclassified point $(\mathbf{x}^{(i)}, y^{(i)})$:

    $$ \mathbf{w} \leftarrow \mathbf{w} + y^{(i)} \mathbf{x}^{(i)} $$

    where we assume the bias is absorbed into $\mathbf{x}$ by appending an extra dimension with $x_0 = 1$.

We assume the update rate is $1$, the same argument can be done with any update rate.

  • Assumption (Linear Separability): There exists a weight vector $\mathbf{w}^*$ and a margin $\gamma > 0$ such that for all training points,

    $$ y^{(i)} (\mathbf{w}^* \cdot \mathbf{x}^{(i)}) \geq \gamma $$

    where $\|\mathbf{w}^*\| = 1$.

Proof

We first bound the Growth of $\mathbf{w} \cdot \mathbf{w}^*$ Define $\mathbf{w}_t$ as the weight vector after $t$ updates. Initially, let $\mathbf{w}_0 = \mathbf{0}$. Each update modifies $\mathbf{w}$ as:

$$ \mathbf{w}_{t+1} = \mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)} $$

Taking the dot product with $\mathbf{w}^*$,

$$ \begin{align} \mathbf{w}_{t+1} \cdot \mathbf{w}^* &= (\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}) \cdot \mathbf{w}^* \

&= \mathbf{w}_t \cdot \mathbf{w}^* + y^{(i)} (\mathbf{x}^{(i)} \cdot \mathbf{w}^) \end{align} $$ Since $y^{(i)} (\mathbf{x}^{(i)} \cdot \mathbf{w}^*) \geq \gamma$, summing over all updates, $$ \mathbf{w}_T \cdot \mathbf{w}^ \geq T \gamma $$ Where $T$ is the number of updates till now. We then bound $\|\mathbf{w}_t\|^2$: The norm squared of $\mathbf{w}$ evolves as:

$$ \begin{align} \|\mathbf{w}_{t+1}\|^2 &= \|\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}\|^2 \\ &= \|\mathbf{w}_t\|^2 + 2 y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) + \|\mathbf{x}^{(i)}\|^2 \end{align} $$

Since the update happens only for misclassified points, $y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) < 0$, so we drop it to get:

$$ \|\mathbf{w}_{t+1}\|^2 \leq \|\mathbf{w}_t\|^2 + R^2 $$

where $R = \max_i \|\mathbf{x}^{(i)}\|$. Iterating over $T$ updates,

$$ \|\mathbf{w}_T\|^2 \leq T R^2 $$

We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,

$$ \mathbf{w}_T \cdot \mathbf{w}^* \leq \lVert \mathbf{w}_T\rVert \lVert \mathbf{w}^*\rVert \implies T \gamma \leq \lVert \mathbf{w}_T \rVert \leq \sqrt{T R^2} $$

$$

$$ Dividing by $\sqrt{T}$, $$

\sqrt{T} \geq \frac{T \gamma}{R} \implies T \leq \frac{R^2}{\gamma^2} $$ Since $R$ and $\gamma$ are constants, this implies that the perceptron makes at most $\frac{R^2}{\gamma^2}$ updates, proving convergence in $\mathcal{O}(\frac{R^2}{\gamma^2})$

References

[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958