The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector $\mathbf{x} \in \mathbb{R}^n$ to an output $y \in \{0,1\}$ using a weighted sum followed by a threshold function.
The Mathematical Definition
Given an input vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ and a weight vector $\mathbf{w} = (w_1, w_2, \dots, w_n)$, the perceptron computes:
$$ z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b $$where $b$ is the bias term. The output is determined by the Heaviside step function:
$$ y = f(z) = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{otherwise} \end{cases} $$Learning Rule
Given a labeled dataset $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, the perceptron uses the following weight update rule for misclassified samples ($y^{(i)} \neq f(\mathbf{w}^\top \mathbf{x}^{(i)} + b)$):
$$ \mathbf{w} \leftarrow \mathbf{w} + \eta (y^{(i)} - f(z^{(i)})) \mathbf{x}^{(i)} $$$$ b \leftarrow b + \eta (y^{(i)} - f(z^{(i)})) $$where $\eta > 0$ is the learning rate. You continue to update until there are no more errors.
Key Properties
- Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
- Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
- Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.
Perceptron Convergence Theorem
The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.
Setup and Notation
-
Let the training set be $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, where $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y^{(i)} \in \{-1, +1\}$.
-
The perceptron updates its weight vector $\mathbf{w}$ as follows for each misclassified point $(\mathbf{x}^{(i)}, y^{(i)})$:
$$ \mathbf{w} \leftarrow \mathbf{w} + y^{(i)} \mathbf{x}^{(i)} $$where we assume the bias is absorbed into $\mathbf{x}$ by appending an extra dimension with $x_0 = 1$.
We assume the update rate is $1$, the same argument can be done with any update rate.
-
Assumption (Linear Separability): There exists a weight vector $\mathbf{w}^*$ and a margin $\gamma > 0$ such that for all training points,
$$ y^{(i)} (\mathbf{w}^* \cdot \mathbf{x}^{(i)}) \geq \gamma $$where $\|\mathbf{w}^*\| = 1$.
Proof
We first bound the Growth of $\mathbf{w} \cdot \mathbf{w}^*$ Define $\mathbf{w}_t$ as the weight vector after $t$ updates. Initially, let $\mathbf{w}_0 = \mathbf{0}$. Each update modifies $\mathbf{w}$ as:
$$ \mathbf{w}_{t+1} = \mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)} $$Taking the dot product with $\mathbf{w}^*$,
$$ \begin{align} \mathbf{w}_{t+1} \cdot \mathbf{w}^* &= (\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}) \cdot \mathbf{w}^* \
&= \mathbf{w}_t \cdot \mathbf{w}^* + y^{(i)} (\mathbf{x}^{(i)} \cdot \mathbf{w}^) \end{align} $$ Since $y^{(i)} (\mathbf{x}^{(i)} \cdot \mathbf{w}^*) \geq \gamma$, summing over all updates, $$ \mathbf{w}_T \cdot \mathbf{w}^ \geq T \gamma $$ Where $T$ is the number of updates till now. We then bound $\|\mathbf{w}_t\|^2$: The norm squared of $\mathbf{w}$ evolves as:
$$ \begin{align} \|\mathbf{w}_{t+1}\|^2 &= \|\mathbf{w}_t + y^{(i)} \mathbf{x}^{(i)}\|^2 \\ &= \|\mathbf{w}_t\|^2 + 2 y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) + \|\mathbf{x}^{(i)}\|^2 \end{align} $$Since the update happens only for misclassified points, $y^{(i)} (\mathbf{w}_t \cdot \mathbf{x}^{(i)}) < 0$, so we drop it to get:
$$ \|\mathbf{w}_{t+1}\|^2 \leq \|\mathbf{w}_t\|^2 + R^2 $$where $R = \max_i \|\mathbf{x}^{(i)}\|$. Iterating over $T$ updates,
$$ \|\mathbf{w}_T\|^2 \leq T R^2 $$We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,
$$ \mathbf{w}_T \cdot \mathbf{w}^* \leq \lVert \mathbf{w}_T\rVert \lVert \mathbf{w}^*\rVert \implies T \gamma \leq \lVert \mathbf{w}_T \rVert \leq \sqrt{T R^2} $$$$
$$ Dividing by $\sqrt{T}$, $$\sqrt{T} \geq \frac{T \gamma}{R} \implies T \leq \frac{R^2}{\gamma^2} $$ Since $R$ and $\gamma$ are constants, this implies that the perceptron makes at most $\frac{R^2}{\gamma^2}$ updates, proving convergence in $\mathcal{O}(\frac{R^2}{\gamma^2})$
References
[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958