The Perceptron Model

Huang, Xuanqiang Angelo

The Perceptron Model

May 31, 2025 · Reading Time: 5 minutes · By Xuanqiang Angelo Huang

Table of Contents

Introduction to the Perceptron
Perceptron Convergence Theorem
- Setup and Notation
  - Proof

The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector $x \in R^{n}$ to an output $y \in {0, 1}$ using a weighted sum followed by a threshold function.

Introduction to the Perceptron

A mathematical model

Given an input vector $x = (x_{1}, x_{2}, \dots, x_{n})$ and a weight vector $w = (w_{1}, w_{2}, \dots, w_{n})$ , the perceptron computes:

z = w^{⊤} x + b = i = 1 \sum n w_{i} x_{i} + b

where $b$ is the bias term. The output is determined by the Heaviside step function:

y = f (z) = {1, 0, if z \geq 0 otherwise

Learning Rule

Given a labeled dataset ${(x^{(i)}, y^{(i)})}_{i = 1}^{m}$ , the perceptron uses the following weight update rule for misclassified samples ( $y^{(i)} \neq = f (w^{⊤} x^{(i)} + b)$ ):

w \leftarrow w + η (y^{(i)} - f (z^{(i)})) x^{(i)}

b \leftarrow b + η (y^{(i)} - f (z^{(i)}))

where $η > 0$ is the learning rate. You continue to update until there are no more errors. If the data is linearly separable, this converges in finite time.

The Perceptron Model-20250519144755347

You can observe the following: If there is an error, then you basically add the value of the $x$ scaled by the learning part to the theta.

Key Properties

Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.

Perceptron Convergence Theorem

The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.

Setup and Notation

Let the training set be ${(x^{(i)}, y^{(i)})}_{i = 1}^{m}$ , where $x^{(i)} \in R^{n}$ and $y^{(i)} \in {- 1, + 1}$ .
The perceptron updates its weight vector $w$ as follows for each misclassified point $(x^{(i)}, y^{(i)})$ : $w \leftarrow w + y^{(i)} x^{(i)}$ where we assume the bias is absorbed into $x$ by appending an extra dimension with $x_{0} = 1$ .

We assume the update rate is $1$ , the same argument can be done with any update rate.

Assumption (Linear Separability): There exists a weight vector $w^{*}$ and a margin $γ > 0$ such that for all training points $y^{(i)} (w^{*} \cdot x^{(i)}) \geq γ$ where $∥ w^{*} ∥ = 1$ . This is the same condition we use for Support Vector Machines.

Proof

We first bound the Growth of $w \cdot w^{*}$ Define $w_{t}$ as the weight vector after $t$ updates. Initially, let $w_{0} = 0$ . Each update modifies $w$ as $w_{t + 1} = w_{t} + y^{(i)} x^{(i)}$ Taking the dot product with $w^{*}$ we have:

w_{t + 1} \cdot w^{*} = (w_{t} + y^{(i)} x^{(i)}) \cdot w^{*} = w_{t} \cdot w^{*} + y^{(i)} (x^{(i)} \cdot w^{*}) \geq w_{t} \cdot w^{*} + γ

Since $y^{(i)} (x^{(i)} \cdot w^{*}) \geq γ$ , summing over all updates,

w_{T} \cdot w^{*} \geq T γ

Where $T$ is the number of updates till now. We then bound $∥ w_{t} ∥^{2}$ : The norm squared of $w$ evolves as:

∥ w_{t + 1} ∥^{2} = ∥ w_{t} + y^{(i)} x^{(i)} ∥^{2} = ∥ w_{t} ∥^{2} + 2 y^{(i)} (w_{t} \cdot x^{(i)}) + ∥ x^{(i)} ∥^{2}

Since the update happens only for misclassified points, $y^{(i)} (w_{t} \cdot x^{(i)}) < 0$ , so we drop it to get:

∥ w_{t + 1} ∥^{2} \leq ∥ w_{t} ∥^{2} + R^{2}

where $R = max_{i} ∥ x^{(i)} ∥$ . Iterating over $T$ updates,

∥ w_{T} ∥^{2} \leq T R^{2}

We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,

w_{T} \cdot w^{*} \leq ∥ w_{T} ∥^{2} ∥ w^{*} ∥^{2} ⟹ T γ \leq ∥ w_{T} ∥^{2} ∥ w^{*} ∥^{2} = ∥ w ∥^{2} \cdot 1 \leq T R^{2}

From this we conclude:

T \geq \frac{T γ}{R} ⟹ T \leq \frac{R ^{2}}{γ ^{2}}

Since $R$ and $γ$ are constants, this implies that the perceptron makes at most $\frac{R ^{2}}{γ ^{2}}$ updates, proving convergence in $O (\frac{R ^{2}}{γ ^{2}})$ .

References

[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958

Introduction to the Perceptron#

A mathematical model#

Learning Rule#

Key Properties#

Perceptron Convergence Theorem#

Setup and Notation#

Proof#

References#