The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector to an output using a weighted sum followed by a threshold function.
Introduction to the Perceptron
A mathematical model
Given an input vector and a weight vector , the perceptron computes:
where is the bias term. The output is determined by the Heaviside step function:
Learning Rule
Given a labeled dataset , the perceptron uses the following weight update rule for misclassified samples ():
where is the learning rate. You continue to update until there are no more errors. If the data is linearly separable, this converges in finite time.

You can observe the following: If there is an error, then you basically add the value of the scaled by the learning part to the theta.
Key Properties
- Linear separability: The perceptron converges if and only if the data is linearly separable (perceptron convergence theorem).
- Limitations: It cannot solve problems like XOR due to its inability to learn non-linearly separable functions.
- Extension: The multi-layer perceptron (MLP) overcomes this limitation using hidden layers and nonlinear activation functions.
Perceptron Convergence Theorem
The perceptron learning algorithm converges in a finite number of updates if the training data is linearly separable.
Setup and Notation
- Let the training set be , where and .
- The perceptron updates its weight vector as follows for each misclassified point : where we assume the bias is absorbed into by appending an extra dimension with .
We assume the update rate is , the same argument can be done with any update rate.
- Assumption (Linear Separability): There exists a weight vector and a margin such that for all training points where . This is the same condition we use for Support Vector Machines.
Proof
We first bound the Growth of Define as the weight vector after updates. Initially, let . Each update modifies as Taking the dot product with we have:
Since , summing over all updates,
Where is the number of updates till now. We then bound : The norm squared of evolves as:
Since the update happens only for misclassified points, , so we drop it to get:
where . Iterating over updates,
We wrap up with the convergence bound. Using the Cauchy-Schwarz inequality,
From this we conclude:
Since and are constants, this implies that the perceptron makes at most updates, proving convergence in .
References
[1] Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” 1958