In the past week I was studying the patterns of crime and punishment in the italian penal code. The most common form of punishment I read there was imprisonment: physically putting the criminal into a separated zone, which follow properties of deterrence through containment (separating criminals with not criminals), neutralization (removing the dangerous tools that a criminal has), rehabilitation (hoping that the criminal won't commit crime again when out), denunciation (assuming it has an identity, harming the reputation for the agent).
Imprisonment Equivalent for AI Agents?
But which of these properties of classical punishments for humans make sense in the context of AI agents? In many contexts neutralization and containment for AI agents makes it not useful anymore. Furthermore, if a irremediable harm was committed, as it is common for crimes, containing the model ex-post would make little to repair that harm. It would be still an eye-for-eye justice, which for AI agents is not much clear because of their ephemeral memory. Denunciation works as far as they harm the developer company's brand since AI has no identity on its own and rehabilitation becomes just further or more specific training of the model to include the current harm behaviour as off-limits. Most probably, for AI agents as they are in the current moment, this form of punishment does not make much sense. Yet many other forms of punishment exist. Imprisonment is just one form of deprivation of some liberty. The closest analogue could be putting harder monitors/guardrails to LLMs that are known to commit the equivalent of crimes in this context, which are, but not limited to, scheming, sandbagging, reward tampering, instrumental convergence, collusion.
Financial Punishments
In other contexts, it is a much more viable model assuming this agent has some financial role, in such a manner that the currency given to the model is in some way directly related to one notion of "wellbeing" for the current model. When this is true, fines for the crimes are life-threatening for the model. Yet, first we need to make agents always play with currency, which could be something we don't want, and embeed them with some notion of identity to play out as coherently as possible with time. Both of these choices are generally not true, but could be true in some specific settings, or depending on the environments we create and embed the agents in. For example, blockchain systems take currency presence and operation on it as some primitive.
Emergence of Norms
Norms have two main properties: an individual element of a belief and a social element that the belief is shared in a group, leading to a certain regularity of intentional behavior (Chopra et al. 2018), meaning the rule is followed by the specific actor and common knowledge within the society of agents where the actor is embedded in. I don't have a clear history of codification of the codes, but from general readings online, it looks like it usually followed this direction
- Implicit norms when interacting with other humans
- Codification of the rules (first ones were lex talionis in the Hammurabi code), with legitimization of some form (in the past periods from God, now from legitimacy of the chosen government), allowing it to apply sanctions to specific
- Evolution of the present code with respect of the circumstances. In this sense, the code evolves following the actual needs of the community. For example, if it's impossible to commit a crime, that crime would not be codified. Another clearer example: internet crimes were not a thing before the invention of the internet. The notions of ownership for human objects have much less meaning in that sphere when the same thing could be copied with zero cost.
I think we are currently on the stage of mapping out what kinds of crimes can language model based agents commit, and writing specific codes to detect and deter such behaviours only needs time, such that we as humans can observe such novel cases and decide which ones should be allowed and at what cost.
References
[1] Chopra et al. “Handbook of Normative Multiagent Systems” 2018