The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector to an output using a weighted sum followed by a threshold function.

Introduction to the Perceptron

A mathematical model

Given an input vector and a weight vector , the perceptron computes:

where is the bias term. The output is determined by the Heaviside step function:

Learning Rule

Given a labeled dataset , the perceptron uses the following weight update rule for misclassified samples ():

where is the learning rate. You continue to update until there are no more errors. If the data is linearly separable, this converges in finite time.

The Perceptron Model-20250519144755347

You can observe the following: If there is an error, then you basically add the value of the scaled by the learning part to the theta.

Key Properties

  • Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
  • Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
  • Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.

Perceptron Convergence Theorem

The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.

Setup and Notation

  • Let the training set be , where and .
  • The perceptron updates its weight vector as follows for each misclassified point : where we assume the bias is absorbed into by appending an extra dimension with .

We assume the update rate is , the same argument can be done with any update rate.

  • Assumption (Linear Separability): There exists a weight vector and a margin such that for all training points where . This is the same condition we use for Support Vector Machines.

Proof

We first bound the Growth of Define as the weight vector after updates. Initially, let . Each update modifies as Taking the dot product with we have:

Since , summing over all updates,

Where is the number of updates till now. We then bound : The norm squared of evolves as:

Since the update happens only for misclassified points, , so we drop it to get:

where . Iterating over updates,

We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,

From this we conclude:

Since and are constants, this implies that the perceptron makes at most updates, proving convergence in .

References

[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958