The Dark Side of Alignment

John Burden
Feb 15, 2021
4 min read

When we think of aligning AI, we typically imagine methods for preventing AIs from causing harm or destroying the world. We imagine methods for getting AIs to coexist with us and help us to flourish as a species. We aren't there yet --- AI alignment is still an open problem, but solutions could prevent catastrophic disaster from unaligned intelligent machines. The sort of methods we normally imagine for alignment, such as novel decision theories and so on, I want to refer to as "hard alignment". These are technical methods for ensuring that an AI system is aligned, preventing bad behaviour and limiting incentives to perform bad behaviour.

Less clear are some of the ethical ramifications of hard alignment. To explore this, let's assume that AI has gone very well, and we are near or at the point of reaching human-level intelligence in AI systems. Additionally, suppose these AI systems are sentient in some way, and not simply automata mimicking human behaviour.

When thinking of AIs of this level of intelligence, hard alignment becomes a little murkier. We have severely limited the AI's range of behaviours and possibly thoughts. Were this a human that had been "aligned" this way, we would be horrified, and would likely consider it tantamount to slavery and condemn it as anathema. What is so different about an appropriately intelligent AI?

Of course, we are in danger of anthropomorphising the AI. Just because it demonstrates human-like competencies and intelligence, that does not necessarily imply that the system is similar in other areas. The space of all possible minds that human-level intelligences could exist in is potentially rather large, and for some of these types of intelligence, this darker view of hard alignment might not be an issue.

In Omohundro's "Basic AI Drives" it is put forward that there are a set of behaviours that would constitute instrumental goals for many AI systems, such as self preservation and resource acquisition. While this doesn't help us to understand whether they would share other human characteristics, it does indicate that there are behaviours and desires that we can predict most AI systems would have, regardless of how different they are from us.

Additionally, many animals are worthy of moral consideration, yet are not as intelligent as we are. When does the same apply to artificial systems? If there is even a small chance of AI developing the capabilities to be deserving of moral status (whatever those capabilities and criteria may be) then we ought to at least think about the ethical ramifications of hard alignment more seriously.

With humans we handle this question of "alignment" in a different way. We enact laws to deter bad behaviour, often with lengthy prison sentences to effectively remove people who exhibit this bad behaviour from society for some period of time. The reasons for doing this are twofold: rehabilitation of the offender and prevention of further bad behaviour. Additionally, and more crucially, we educate children and try to instil moral values into them. I want to refer to this as "soft alignment". In this system, we don't restrict a person's actions until they have already exhibited bad behaviour --- there are simply consequences for actions. More subtly, one could argue that the type of moral education a child receives will alter their thoughts and desires. Although, in this case, thoughts are never forcefully restricted, and given the wide diversity of philosophies and political stances it seems that humans broadly have freedom of thought.

Historically, however, this soft alignment hasn't prevented all bad behaviour. Murders do occur. Wars are started. Atrocities are committed. But I would wager that on the whole, our soft alignment does prevent far more harm from occurring than would otherwise happen. When we introduce the drastic power imbalances caused by a human-level or higher AI, it's likely that the issues with soft alignment would be exacerbated. Suppose we try to use soft alignment to make a safe AI. Even if we could adequately teach such an AI morals (whatever they may be), there would be many incentives for the AI to subvert them temporarily in the name of the "greater good" and abiding by these morals would be optional. Given the vast power imbalance, it also seems unlikely that humanity would be able to "imprison" such an AI and prevent it from carrying out further undesirable activity. Given the large effects such an AI may have on our world and ourselves, relying on soft alignment is an unacceptable risk.

How, then, do we approach the problem of aligning AI? On one hand, the ethical ramifications of hard alignment are rather murky, and on the other hand the threat posed by AIs is too high to leave to soft alignment. I don't have the answer here. A possibility to explore may be slowly reducing the "hardness" of the alignment procedure over time (whatever that may mean), eventually allowing an AI to think and act freely. But any sufficiently malevolent and intelligent AI could simply bide its time and wait to be free before acting maliciously.

While there is no clear answer to the ethics of hard alignment, or the efficacy of soft alignment, what is clear to me is that I would much rather inhabit a future where artificial intelligences are seen as partners and companions rather than slaves and tools. To ignore the issues surrounding hard alignment could lead to a dark future in which we inflict atrocities upon the only other entities of human-level intelligence we may ever know.

The Dark Side of Alignment

Recent Posts

Comments