aises_10_1

<!-- Appendix B - Utility Functions -->

<style>
    .visionbox{
    border-radius: 15px;
    border: 2px solid #3585d4;
    background-color: #ebf3fb;
    text-align: left;
    padding: 10px;
    }
</style>

<style>
    .visionboxlegend{
    border-bottom-style: solid;
    border-bottom-color: #3585d4;
    border-bottom-width: 0px;
    margin-left: -12px;
    margin-right: -12px; margin-top: -13px;
    padding: 0.01em 1em; color: #ffffff;
    background-color: #3585d4;
    border-radius: 15px 15px 0px 0px}
</style>

<h1 id="utility-and-utility-functions">B.1 Utility and Utility
Functions</h1>
<h2 id="fundamentals">B.1.1 Fundamentals</h2>
<p><strong>A utility function is a mathematical representation of
preferences.</strong> A utility function, <span
class="math inline"><em>u</em></span>, takes inputs like goods or
situations and outputs a value called <em>utility</em>. Utility is a
measure of how much an agent prefers goods and situations relative to
other goods and situations.<p>
Suppose we offer Alice some apples, bananas, and cherries. She might
have the following utility function for fruits:<p>
<span
class="math display"><em>u</em>(fruits) = 12<em>a</em> + 10<em>b</em> + 2<em>c</em>,</span>
where <span class="math inline"><em>a</em></span> is the number of
apples, <span class="math inline"><em>b</em></span> is the number of
bananas, and <span class="math inline"><em>c</em></span> is the number
of cherries that she consumes. Suppose Alice consumes no apples, one
banana, and five cherries. The amount of utility she gains from her
consumption is calculated as <span
class="math display"><em>u</em>(0 apples,1 banana,5 cherries) = (12⋅0) + (10⋅1) + (2⋅5) = 20.</span>
The output of this function is read as “20 units of utility” for short.
These units are arbitrary and reflect the level of Alice’s utility. We
can use utility functions to quantitatively represent preferences over
different combinations of goods and situations. For example, we can rank
Alice’s preferences over fruits as <span
class="math display">apple ≻ banana ≻ cherry,</span> where <span
class="math inline">≻</span> represents <em>preference</em>, such that
what comes before the symbol is preferred to what comes after it. This
follows from the fact that Alice gains 12 units from an apple, 10 units
from a banana, and 2 units from a cherry. The advantage of having a
utility function as opposed to just an explicit ranking of goods is that
we can directly infer information about more complex goods. For example,
we know <span class="math display"><em>u</em>(1 banana,5
cherries) = 20 &gt; <em>u</em>(1 apple) = 12 &gt; <em>u</em>(1
banana) = 10.</span> <p><strong>Utility functions, if accurate, reveal what
options agents would prefer and choose.</strong> If told to choose only
one of the three fruits, Alice would pick the apple, since it gives her
the most utility. Her preference follows from <em>rational choice
theory</em>, which proposes that individuals, acting in their own
self-interest, make decisions that maximize their self-interest. This
view is only an approximation to human behavior. In this chapter we will
discuss how rational choice theory is an imperfect but useful way to
model choices. We will also refer to individuals who behave in coherent
ways that help maximize utility as <em>agents</em>.</p>
<p><strong>We explore concepts about utility functions that are useful
for thinking about AIs, humans, and organizations like companies and
states.</strong> First, we introduce <em>Bernoulli utility
functions</em>, which are conventional utility functions that define
preferences over certain outcomes like the example above. We later
discuss <em>von Neumann-Morgenstern utility functions</em>, which extend
preferences to probabilistic situations, in which we cannot be sure
which outcome will occur. <em>Expected utility theory</em> suggests that
rationality is the ability to maximize preferences. We consider the
relevance of utility functions to <em>AI corrigibility</em>—the property
of being receptive to corrections—and see how this might be a source of
tail risk. Much of this chapter focuses on how utility functions help
understand and model agents’ <em>attitudes toward risk</em>. Finally, we
examine <em>non-expected utility theories</em>, which seek to rectify
some shortcomings of conventional expected utility theory when modeling
real-life behavior.</p>
<h2 id="motivations-for-learning-about-utility-functions">B.1.2 Motivations
for Learning About Utility Functions</h2>
<p><strong>Utility functions are a central concept in economics and
decision theory.</strong> Utility functions can be applied to a wide
range of problems and agents, from rats finding cheese in a maze to
humans making investment decisions to countries stockpiling nuclear
weapons. Conventional economic theory assumes that people are rational
and well-informed, and make decisions that maximize their self-interest,
as represented by their utility function. The view that individuals will
choose options that are likely to maximize their utility functions,
referred to as <em>expected utility theory</em>, has been the major
paradigm in real-world decision making since the Second World War <span
class="citation" data-cites="schoemaker1982expected">[1]</span>. It is
useful for modeling, predicting, and encouraging desired behavior in a
wide range of situations. However, as we will discuss, this view does
not perfectly capture reality, because individuals can often be
irrational, lack relevant knowledge, and frequently make mistakes.</p>
<p><strong>The objective of maximizing a utility function can cause
intelligence.</strong> The <em>reward hypothesis</em> suggests that the
objective of maximizing some reward is sufficient to drive behavior that
exhibits intelligent traits like learning, knowledge, perception, social
awareness, language, generalization, and more <span class="citation"
data-cites="silver2021reward">[2]</span>. The reward hypothesis implies
that artificial agents in rich environments with simple rewards could
develop sophisticated general intelligence. For example, an artificial
agent deployed with the goal of maximizing the number of successful food
deliveries may develop relevant geographical knowledge, an understanding
of how to move between destinations efficiently, and the ability to
perceive potential dangers. Therefore, the construction and properties
of the utility function that agents maximize are central to guiding
intelligent behavior.</p>
<p><strong>Certain artificial agents may be approximated as expected
utility maximizers.</strong> Some artificial intelligences are
agent-like. They are programmed to consider the potential outcomes of
different actions and to choose the option that is most likely to lead
to the optimal result. It is a reasonable approximation to say that many
artificial agents make choices that they predict will give them the
highest utility. For instance, in reinforcement learning, artificial agents explore their environment and
are rewarded for desirable behavior. These agents are explicitly
constructed to maximize reward functions, which strongly shape an
agent’s internal utility function, should it exist, and its
dispositions. This view of AI has implications for how we design and
evaluate these systems—we need to ensure that their value functions
promote human values. Utility functions can help us reason about the
behavior of AIs, as well as the behavior of powerful actors that direct
AIs, such as corporations or governments.</p>
<p><strong>Utility functions are a key concept in AI safety.</strong>
Utility functions come up explicitly and implicitly at various times
throughout this book, and are useful for understanding the behavior of
reward-maximizing agents, as well as humans and organizations involved
in the AI ecosystem. They will also come up in our chapter on Machine Ethics, when we
consider that some advanced AIs may have utility functions make up the
social welfare function they seek to increase. In the Collective Action Problems chapter, we will
continue our discussion of rational agents that seek to maximize their
own utility.</p>

<h1 id="properties-of-utility-functions">B.2 Properties of Utility
Functions</h1>
<p><strong>Overview.</strong> In this section, we will formalize our
understanding of utility functions. First, we will introduce
<em>Bernoulli utility functions</em>, which are simple utility functions
that allow an agent to select between different choices with known
outcomes. Then we will discuss <em>von Neumann-Morgenstern utility
functions</em>, which model how rational agents select between choices
with probabilistic outcomes based on the concept of <em>expected
utility</em>, to make these tools more generally applicable to the
choices under uncertainty. Finally, we will describe a solution to a
famous puzzle applying expected utility—the <em>St. Petersburg
Paradox</em>—to see why expected utility is a useful tool for decision
making.<p>
Establishing these mathematical foundations will help us understand how
to apply utility functions to various actors and situations.</p>
<h2 id="bernoulli-utility-functions">B.2.1 Bernoulli Utility Functions</h2>
<p><strong>Bernoulli utility functions represent an individual’s
preferences over potential outcomes.</strong> Suppose we give people the
choice between an apple, a banana, and a cherry. If we already know each
person’s utility function, we can deduce, predict, and compare their
preferences In the introduction, we met Alice, whose preferences are
represented by the utility function over fruits:<p>
<span
class="math display"><em>u</em>(<em>f</em>) = 12<em>a</em> + 10<em>b</em> + 2<em>c</em>.</span>
This is a Bernoulli utility function.</p>
<p><strong>Bernoulli utility functions can be used to convey the
strength of preferences across opportunities.</strong> In their most
basic form, Bernoulli utility functions express ordinal preferences by
ranking options in order of desirability. For more information, we can
consider cardinal representations of preferences. With cardinal utility
functions, numbers matter: while the units are still arbitrary, the
relative differences are informative.<p>
To illustrate the difference between ordinal and cardinal comparisons,
consider how we talk about temperature. When we want to precisely convey
information about temperature, we use a cardinal measure like Celsius or
Fahrenheit: “Today is five degrees warmer than yesterday.” We could have
also accurately, but less descriptively, used an ordinal descriptor:
“Today is warmer than yesterday.” Similarly, if we interpret Alice’s
utility function as cardinal, we can conclude that she feels more
strongly about the difference between a banana and a cherry (8 units of
utility) than she does about the difference between an apple and a
banana (2 units). We can gauge the relative strength of Alice’s
preferences from a utility function.</p>
<h2 id="von-neumann-morgenstern-utility-functions">B.2.2 Von
Neumann-Morgenstern Utility Functions</h2>
<p><strong>Von Neumann-Morgenstern utility functions help us understand
what people prefer when outcomes are uncertain.</strong> We do not yet
know how Alice values an uncertain situation, such as a coin flip. If
the coin lands on heads, Alice gets both a banana and an apple. But if
it lands on tails, she gets nothing. Now let’s say we give Alice a
choice between getting an apple, getting a banana, or flipping the coin.
Since we know her fruit Bernoulli utility function, we know her
preferences between apples and bananas, but we do not know how she
compares each fruit to the coin flip. We’d like to convert the possible
outcomes of the coin flip into a number that represents the utility of
each outcome, which can then be compared directly against the utility of
receiving the fruits with certainty. The von Neumann-Morgenstern (vNM)
utility functions help us do this <span class="citation"
data-cites="vonneumann1947theory">[3]</span>. They are extensions of
Bernoulli utility functions, and work specifically for situations with
uncertainty, represented as <em>lotteries</em> (denoted <strong><span
class="math inline"><em>L</em></span></strong>), like this coin flip.
First, we work through some definitions and assumptions that allow us to
construct utility functions over potential outcomes, and then we explore
the relation between von Neumann-Morgenstern utility functions and
expected utility.</p>
<p><strong>A lottery assigns a probability to each possible
outcome.</strong> Formally, a lottery <span
class="math inline"><em>L</em></span> is any set of possible outcomes,
denoted <span
class="math inline"><em>o</em><sub><em>i</em></sub></span>, and their
associated probabilities, denoted <span
class="math inline"><em>p</em><sub><em>i</em></sub></span>. Consider a
simple lottery: a coin flip where Alice receives an apple on heads, and
a banana on tails. This lottery has possible outcomes <span
class="math inline"><em>a</em><em>p</em><em>p</em><em>l</em><em>e</em></span>
and <span
class="math inline"><em>b</em><em>a</em><em>n</em><em>a</em><em>n</em><em>a</em></span>,
each with probability <span class="math inline">0.5</span>. If a
different lottery offers a cherry with certainty, it would have only the
possible outcome <span
class="math inline"><em>c</em><em>h</em><em>e</em><em>r</em><em>r</em><em>y</em></span>
with probability <span class="math inline">1</span>. Objective
probabilities are used when the probabilities are known, such as when
calculating the probability of winning in casino games like roulette. In
other cases where objective probabilities are not known, like predicting
the outcome of an election, an individual’s subjective best-guess could
be used instead. So, both uncertain and certain outcomes can be
represented by lotteries.<p>
</p>
<br>
<div class="visionbox">
<legend class="visionboxlegend">
<p><span><b>A Note on Expected Value vs. Expected Utility</b></span></p>
</legend>
<p>An essential distinction in this chapter is that between expected
value and expected utility.</p>
<p><strong>Expected value is the average outcome of a random
event.</strong> While most lottery tickets have negative expected value,
in rare circumstances they have positive expected value. Suppose a
lottery has a jackpot of 1 billion dollars. Let the probability of
winning the jackpot be 1 in 300 million, and let the price of a lottery
ticket be $2. Then the expected value is calculated by adding together
each possible outcome by its probability of occurrence. The two outcomes
are (1) that we win a billion dollars, minus the cost of $2 to play the
lottery, which happens with probability one in 300 million, and (2) that
we are $2 in debt. We can calculate the expected value with the formula:
<span class="math display">$$\frac{1}{300 \text{ million}} \cdot
\left(\$ 1 \text{ billion}-\$ 2\right)+\left(1-\frac{1}{300 \text{
million}}\right) \cdot \left(-\$ 2\right)\approx \$ 1.33.$$</span> The
expected value of the lottery ticket is positive, meaning that, on
average, buying the lottery ticket would result in us receiving <span
class="math inline">$</span>1.33.<p>
Generally, we can calculate expected value by multiplying each outcome
value, <span class="math inline"><em>o</em><em>i</em></span>, with its
probability <span class="math inline"><em>p</em>,</span> and sum
everything up over all <span class="math inline"><em>n</em></span>
possibilities: <span
class="math display"><em>E</em>[<em>L</em>] = <em>o</em><sub>1</sub> ⋅ <em>p</em><sub>1</sub> + <em>o</em><sub>2</sub> ⋅ <em>p</em><sub>2</sub> +  ⋅ <em>s</em> + <em>o</em><sub><em>n</em></sub>·<sub><em>n</em></sub>.</span>
<strong>Expected utility is the average utility of a random
event.</strong> Although the lottery has positive expected value, buying
a lottery ticket may still not increase its expected utility. Expected
utility is distinct from expected value: instead of summing over the
monetary outcomes (weighing each outcome by its probability), we sum
over the utility the agent receives from each outcome (weighing each
outcome by its probability).<p>
If the agent’s utility function indicates that one “util” is just as
valuable as one dollar, that is <span
class="math inline"><em>u</em>($<em>x</em>) = <em>x</em></span>, then
expected utility and expected value would be the same. But suppose the
agent’s utility function were a different function, such as <span
class="math inline"><em>u</em>($<em>x</em>) = <em>x</em><sup>1/3</sup></span>.
This utility function means that the agent values each additional dollar
less and less as they have more and more money.<p>
For example, if an agent with this utility function already has <span
class="math inline">$</span>500, an extra dollar would increase their
utility by 0.05, but if they already have <span
class="math inline">$</span>200,000, an extra dollar would increase
their utility by only 0.0001. With this utility function, the expected
utility of this lottery example is negative: <span
class="math display">$$\frac{1}{300 \text{ million}} \cdot \left(1
\text{ billion}-2\right)^{1/3}+\left(1-\frac{1}{300 \text{
million}}\right) \cdot \left(-2\right)^{1/3}\approx -1.26.$$</span>
Consequently, expected value can be positive while expected utility can
be negative, so the two concepts are distinct.<p>
Generally, expected utility is calculated as: <span
class="math display"><em>E</em>[<em>u</em>(<em>L</em>)] = <em>u</em>(<em>o</em><sub>1</sub>) ⋅ <em>p</em><sub>1</sub> + <em>u</em>(<em>o</em><sub>2</sub>) ⋅ <em>p</em><sub>2</sub> + ⋯ + <em>u</em>(<em>o</em><sub><em>n</em></sub>) ⋅ <em>p</em><sub><em>n</em></sub>.</span></p>
</div>
<br>
<p><strong>According to expected utility theory, rational agents make
decisions that maximize expected utility.</strong> Von Neumann and
Morgenstern proposed a set of basic propositions called <em>axioms</em>
that define an agent with rational preferences. When an agent satisfies
these axioms, their preferences can be represented by a von
Neumann-Morgenstern utility function, which is equivalent to using
expected utility to make decisions. While expected utility theory is
often used to model human behavior, it is important to note that it is
an imperfect approximation. In the final section of this chapter, we
present some criticisms of expected utility theory and the vNM
rationality axioms as they apply to humans. However, artificial agents
might be designed along these lines, resulting in an explicit expected
utility maximizer, or something approximating an expected utility
maximizer. The von Neumann-Morgenstern rationality axioms are listed
below with mathematically precise notation for sake of completeness, but
a technical understanding of them is not necessary to proceed with the
chapter.</p>
<p><strong>Von Neumann-Morgenstern Rationality Axioms.</strong> When the
following axioms are satisfied, we can assume a utility function of an
expected utility form, where agents prefer lotteries that have higher
expected utility <span class="citation"
data-cites="vonneumann1947theory">[3]</span>. <span
class="math inline"><em>L</em></span> is a lottery. <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>B</em></sub></span>
means that the agent prefers lottery A to lottery B, whereas <span
class="math inline"><em>L</em><sub><em>A</em></sub> ∼ <em>L</em><sub><em>B</em></sub></span>
means that the agent is indifferent between lottery A and lottery B.
These axioms and conclusions that can be derived from them are
contentious, as we will see later on in this chapter. There are six such
axioms, that we can split into two groups.<p>
</ol>
<p><strong></strong> The classic four axioms are:</p>
<ol>
<li><p>Completeness: The agent can rank their preferences over all
lotteries. For any two lotteries, it must be that <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>B</em></sub></span>
or <span
class="math inline"><em>L</em><sub><em>B</em></sub> ≽ <em>L</em><sub><em>A</em></sub></span>.</p></li>
<li><p>Transitivity: If <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>B</em></sub></span>
and <span
class="math inline"><em>L</em><sub><em>B</em></sub> ≽ <em>L</em><sub><em>C</em></sub></span>,
then <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>C</em></sub></span>.</p></li>
<li><p>Continuity: For any three lotteries, <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>B</em></sub> ≽ <em>L</em><sub><em>C</em></sub></span>,
there exists a probability <span
class="math inline"><em>p</em> ∈ [0,1]</span> such that <span
class="math inline"><em>p</em><em>L</em><sub><em>A</em></sub> + (1−<em>p</em>)<em>L</em><sub><em>C</em></sub> ∼ <em>L</em><sub><em>B</em></sub></span>.
This means that the agent is indifferent between <span
class="math inline"><em>L</em><sub><em>B</em></sub></span> and some
combination of the worse lottery <span
class="math inline"><em>L</em><sub><em>C</em></sub></span> and the
better lottery <span
class="math inline"><em>L</em><sub><em>A</em></sub></span>. In practice,
this means that agents’ preferences change smoothly and predictably with
changes in options.</p></li>
<li><p>Independence: The preference between two lotteries is not
impacted by the addition of equal probabilities of a third, independent
lottery to each lottery. That is, <span
class="math inline"><em>L</em><sub><em>A</em></sub> ≽ <em>L</em><sub><em>B</em></sub></span>
is equivalent to <span
class="math inline"><em>p</em><em>L</em><sub><em>A</em></sub> + (1−<em>p</em>)<em>L</em><sub><em>C</em></sub> ≽ <em>p</em><em>L</em><sub><em>B</em></sub> + (1−<em>p</em>)<em>L</em><sub><em>C</em></sub></span>
for any <span
class="math inline"><em>L</em><sub><em>C</em></sub></span>.
&gt;</p></li>
</ol>
The final two axioms represent relatively obvious characteristics of rational decision-making
, although actual decision-making processes sometimes deviate from these. These axioms are 
relatively ``weak'' and are implied by the previous four.</p>
<ol>
<li><p><span>Monotonicity</span>: Agents prefer higher probabilities of
preferred outcomes.</p></li>
<li><p><span>Decomposability</span>: The agent is indifferent between
two lotteries that share the same probabilities for all the same
outcomes, even if they are described differently.</p></li>
</ol>
<p><strong>Form of von Neumann-Morgenstern utility functions.</strong>
If an agent’s preferences are consistent with the above axioms, their
preferences can be represented by a vNM utility function. This utility
function, denoted by a capital <span
class="math inline"><em>U</em></span>, is simply the expected Bernoulli
utility of a lottery. That is, a vNM utility function takes the
Bernoulli utility of each outcome, multiplies each with its
corresponding probability of occurrence, and then adds everything up.
Formally, an agent’s expected utility for a lottery <span
class="math inline"><em>L</em></span> is calculated as: <span
class="math display"><em>U</em>(<em>L</em>) = <em>u</em>(<em>o</em><sub>1</sub>) ⋅ <em>p</em><sub>1</sub> + <em>u</em>(<em>o</em><sub>2</sub>) ⋅ <em>p</em><sub>2</sub> + ⋯ + <em>u</em>(<em>o</em><sub><em>n</em></sub>) ⋅ <em>p</em><sub><em>n</em></sub>,</span>
so expected utility can be thought of as a weighted average of the
utilities of different outcomes.<p>
This is identical to the expected utility formula we discussed above—we
sum over the utilities of all the possible outcomes, each multiplied by
its probability of occurrence. With Bernoulli utility functions, an
agent prefers <span class="math inline"><em>a</em></span> to <span
class="math inline"><em>b</em></span> if and only if their utility from
receiving <span class="math inline"><em>a</em></span> is greater than
their utility from receiving <span
class="math inline"><em>b</em></span>. With expected utility, an agent
prefers lottery <span
class="math inline"><em>L</em><sub><em>A</em></sub></span> to lottery
<span class="math inline"><em>L</em><sub><em>B</em></sub></span> if and
only if their expected utility from lottery <span
class="math inline"><em>L</em><sub><em>A</em></sub></span> is greater
than from lottery <span
class="math inline"><em>L</em><sub><em>B</em></sub></span>. That is:
<span
class="math display"><em>L</em><sub><em>A</em></sub> ≻ <em>L</em><sub><em>B</em></sub> ⇔ <em>U</em>(<em>L</em><sub><em>A</em></sub>) &gt; <em>U</em>(<em>L</em><sub><em>B</em></sub>).</span>
where the symbol <span class="math inline">≻</span> indicates
preference. The von Neumann-Morgenstern utility function models the
decision making of an agent considering two lotteries as just
calculating the expected utilities and choosing the larger resulting
one.<p>
</p>
<br>
<div class="visionbox">
<legend class="visionboxlegend">
<p><span><b>A Note on Logarithms</b></span></p>
</legend>
<p><strong>Logarithmic functions are commonly used as utility
functions.</strong> A logarithm is a mathematical function that
expresses the power to which a given number (referred to as the base)
must be raised in order to produce a value. The logarithm of a number
<span class="math inline"><em>x</em></span> with respect to base <span
class="math inline"><em>b</em></span> is denoted as <span
class="math inline">log<sub><em>b</em></sub><em>x</em></span>, and is
the exponent to which <span class="math inline"><em>b</em></span> must
be raised to produce the value <span
class="math inline"><em>x</em></span>. For example, <span
class="math inline">log<sub>2</sub>8 = 3</span>, because <span
class="math inline">2<sup>3</sup> = 8</span>.<p>
One special case of the logarithmic function, the natural logarithm, has
a base of <span class="math inline"><em>e</em></span> (which is Euler’s
constant, roughly 2.718); in this chapter, it is referred to simply as
<span class="math inline">log </span>. Logarithms have the following
properties, independent of base: <span
class="math inline">log 0 →  − ∞</span>, <span
class="math inline">log 1 = 0,</span> <span
class="math inline">log<sub><em>b</em></sub><em>b</em> = 1,</span> and
<span
class="math inline">log<sub><em>b</em></sub><em>b</em><sup><em>a</em></sup> = <em>a</em></span>.<p>
Logarithms have a downward, concave shape, meaning the output increases
slower than the input. This shape resembles how humans value resources:
we generally value a good less if we already have more of it.
Logarithmic functions value goods in inverse proportion to how much of
the resource we already have.<p>
</p>

<figure id="fig:logarithms">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/logarithms.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.1: Logarithmic functions share several properties, such as being concave and crossing
the y-axis at one. </p>
</figure>

</div>
<br>
<h2 id="st.-petersburg-paradox">B.2.3 St. Petersburg Paradox</h2>
<p>An old man on the streets of St. Petersburg offers gamblers the
following game: he will flip a fair coin repeatedly until it lands on
tails. If the first flip lands tails, the game ends and the pension fund
gets $2. If the coin first lands on heads and then lands on tails, the
game ends and the gambler gets $4. The amount of money (the “return”)
will double for each consecutive flip landing heads before the coin
ultimately lands tails. The game concludes when the coin first lands
tails, and the gambler receives the appropriate returns. Now, the
question is, how much should a gambler be willing to pay from the
pension fund to play this game <span class="citation"
data-cites="peterson2019paradox">[4]</span>?<p>
With probability <span class="math inline">$\frac{1}{2}$</span>, the
first toss will land on tails, in which case the gambler wins two
dollars. With probability <span
class="math inline">$$\frac{1}{4}$$</span>, the first toss lands heads and
the second lands tails, and the gambler wins four dollars.
Extrapolating, this game offers a maximum possible payout of: <span
class="math display">$$\$ 2^{n} = \$ \overbrace{2 \cdot 2 \cdot 2\cdots
2 \cdot 2 \cdot 2}^{n \text{ times}},$$</span> where <span
class="math inline"><em>n</em></span> is the number of flips until and
including when the coin lands on tails. As offered, though, there is no
limit to the size of <span class="math inline"><em>n</em></span>, since
the company promises to keep flipping the coin until it lands on tails.
The expected payout of this game is therefore: <span
class="math display">$$E\left[L\right] =\frac{1}{2} \cdot \$
2+\frac{1}{4} \cdot \$ 4+\frac{1}{8} \cdot \$ 8+\cdots  = \$ 1+\$ 1+\$
1+\cdots  = \$ \infty.$$</span> Bernoulli described this situation as a
paradox because he believed that, despite it having infinite expected
value, anyone would take a large but finite amount of money over the
chance to play the game. While paying <span
class="math inline">$</span>10,000,000 to play this game would not be
inconsistent with its expected value, we would think it highly
irresponsible! The paradox reveals a disparity between expected value
calculations and reasonable human behavior.<p>
</p>


<figure id="fig:stpetersburg">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/stpetersburg.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.2: Winnings from the St. Petersburg Paradox double after each coin toss, offering small
likelihoods of big prizes.</p>
</figure>

<p><strong>Logarithmic utility functions can represent decreasing
marginal utility.</strong> A number of ways have been proposed to
resolve the St. Petersburg paradox. We will focus on the most popular:
representing the player with a utility function instead of merely
calculating expected value. As we discussed in the previous section, a
logarithmic utility function seems to resemble how humans think about
wealth. As a person becomes richer, each additional dollar gives them
less satisfaction than before. This concept, called decreasing marginal
utility, makes sense intuitively: a billionaire would not be as
satisfied winning $1000 as someone with significantly less money.
Wealth, and many other resources like food, have such diminishing
returns. While a first slice of pizza is incredibly satisfying, a second
one is slightly less so, and few people would continue eating to enjoy a
tenth slice of pizza.<p>
Assuming an agent with a utility function <span
class="math inline"><em>u</em>($<em>x</em>) = log<sub>2</sub>(<em>x</em>)</span>
over <span class="math inline"><em>x</em></span> dollars, we can
calculate the expected utility of playing the St. Petersburg game as:
<span class="math display">$$E\left[U\left(L\right)\right] =\frac{1}{2}
\cdot \log_{2}(2)+\frac{1}{4} \cdot \log_{2}(4)+\frac{1}{8} \cdot
\log_{2}(8)+\cdots  = 2.$$</span> That is, the expected utility of the
game is 2. From the logarithmic utility function over wealth, we know
that: <span
class="math display">2 = log<sub>2</sub><em>x</em> ⇒ <em>x</em> = 4,</span>
which implies that the player is indifferent between playing this game
and having $4: the level of wealth that gives them the same utility as
what they expect playing the lottery.</p>
<p><strong>Expected utility is more reasonable than expected
value.</strong> The previous calculation explains why an agent with
<span
class="math inline"><em>u</em>($<em>x</em>) = log<sub>2</sub><em>x</em></span>
should not pay large amounts of money to play the St. Petersburg game.
The log utility function implies that the player receives diminishing
returns to wealth, and cares less about situations with small chances of
winning huge sums of money. Figure 5 shows how the large payoffs with
small probability, despite having the same expected value, contribute
little to expected utility. This feature captures the human tendency
towards risk aversion, explored in the next section. Note that while
logarithmic utility functions are a useful model (especially in
resolving such paradoxes), they do not perfectly describe human behavior
across choices, such as the tendency to buy lottery tickets, which we
will explore in the next chapter.<p>
</p>

<figure id="fig:flip">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/stpetersburg_ev_eu.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.3:  In the St. Petersburg Paradox, each subsequent flip has the same expected value but
expected utility falls sharply.</p>
</figure>

<p><strong>Summary.</strong> In this section, we examined the properties
of Bernoulli utility functions, which allow us to compare an agent’s
preferences across different outcomes. We then introduced von
Neumann-Morgenstern utility functions, which calculate the average, or
expected, utility over different possible outcomes. From there, we
derived the idea that rational agents are able to make decisions that
maximize expected utility. Through the St. Petersburg Paradox, we showed
that taking the expected utility of a logarithmic function leads to more
reasonable behavior. Having understood some properties of utility
functions, we can now examine the problem of incorrigibility, where AI
systems do not accept corrective interventions because of rigid
preferences.</p>

<h1 id="tail-risk-corrigibility">B.3 Tail Risk: Corrigibility</h1>
<p><strong>Overview.</strong> In this section, we will explore how
utility functions provide insight into whether an AI system is open to
corrective interventions and discuss related implications for AI risks.
The von Neumann-Morgenstern (vNM) axioms of completeness and
transitivity can lead to strict preferences over shutting down or being
shut down, which affects how easily an agent can be corrected. We will
emphasize the importance of developing corrigible AI systems that are
responsive to human feedback and that can be safely controlled to
prevent unwanted AI behavior.</p>
<p><strong>Corrigibility measures our ability to correct an AI if and
when things go wrong.</strong> An AI system is <em>corrigible</em> if it
accepts and cooperates with corrective interventions like being shut
down or having its utility function changed <span class="citation"
data-cites="pacecorrigibility">[5]</span>. Without many assumptions, we
can argue that typical rational agents will resist corrective measures:
changing an agent’s utility function necessarily means that the agent
will pursue goals that result in less utility relative to their current
preferences.</p>
<p><strong>Suppose we own an AI that fetches coffee for us every
morning.</strong> Its utility function assigns “10 utils” to getting us
coffee quickly, “5 utils” to getting us coffee slowly, and “0 utils” to
not getting us coffee at all. Now, let’s say we want to change the AI’s
objective to instead make us breakfast. A regular agent would resist
this change, reasoning that making breakfast would mean it is less able
to efficiently make coffee, resulting in lower utility. However, a
corrigible AI would recognize that making breakfast could be just as
valuable to humans as fetching coffee and would be open to the change in
objective. The AI would move on to maximizing its new utility function.
In general, corrigible AIs are more amenable to feedback and
corrections, rather than stubbornly adhering to their initial goals or
directives. When AIs are corrigible, humans can more easily correct
rogue actions and prevent any harmful or unwanted behavior.</p>
<p><strong>Completeness and transitivity imply that an AI has strict
preferences over shutting down.</strong> Assume that an agent’s
preferences satisfy the vNM axioms of completeness, such that it can
rank all options, as well as transitivity, such that its preferences are
consistent. For instance, the AI can see that preferring an apple to a
banana and a banana to a cherry implies that we prefer an apple to a
cherry. Then, we know that the agent’s utility function ranks every
option.<p>
Consider again the coffee-fetching AI. Suppose that in addition to
getting us coffee quickly (10 utils), getting us coffee slowly (5
utils), and not getting us coffee (0 utils), there is a fourth option,
where the agent gets shut down immediately. The AI expected that
immediate shutdown will result in its owner getting coffee slowly
without AI assistance, which appears to be valued at 5 units of utility
(the same as it getting us coffee slowly). The agent thus strictly
prefers getting us coffee quickly to shutting down, and strictly prefers
shutting down to us not having coffee at all.<p>
Generally, unless indifferent between everything, completeness and
transitivity imply that the AI has unspecified preferences about
potentially shutting down <span class="citation"
data-cites="thornley2023shutdown">[6]</span>. Without completeness, the
agent could have no preference between shutting down immediately and all
other actions. Without transitivity, the agent could be indifferent
between shutting down immediately and all other possible actions without
that implying that the agent is indifferent between all possible
actions.</p>
<p><strong>It is bad if an AI either increases or reduces the
probability of immediate shutdown.</strong> Suppose that in trying to
get us coffee quickly, the AI drives at unsafe speeds. We’d like to shut
down the AI until we can reprogram it safely. A corrigible AI would
recognize our intention to shut down as a signal that it is misaligned.
However, an incorrigible AI would instead stay the course with what it
wanted to do initially—get us coffees—since that results in the most
utility. If possible, the AI would decrease the probability of immediate
shutdown, say by disabling its off-switch or locking the entrance to its
server rooms. Clearly, this would be bad.<p>
Consider a different situation where the AI realizes that making coffee
is actually quite difficult and that we would make coffee faster
manually, but fails to realize that we don’t want to exert the effort to
do so. The AI may then try to shut down, so that we’d have to make the
coffee ourselves. Suppose we tell the AI to continue making coffee at
its slow pace, rather than shut down. A corrigible AI would recognize
our instruction as a signal that it is misaligned and would continue to
make coffee. However, an incorrigible AI would instead stick with its
decision to shut down without our permission, since shutting down
provides it more utility. Clearly, this is also bad. We’d like to be
able to alter AIs without facing resistance.</p>
<p><strong>Summary.</strong> In this section, we introduced the concept
of corrigibility in AI systems. We discussed the relevance of utility
functions in determining corrigibility, particularly challenges that
arise if an AI’s preferences are complete and transitive, which can lead
to strict preferences over shutting down. We explored the potential
problems of an AI system reducing or increasing the probability of
immediate shutdown. The takeaway is that developing corrigible AI
systems—systems that are responsive and adaptable to human feedback and
changes—is essential in ensuring safe and effective control over AIs’
behavior. Examining the properties of utility functions illuminates
potential problems in implementing corrigibility.<p>
</p>
<br>
<div class="visionbox">
<legend class="visionboxlegend">
<p><span><b>A Note on Utility Functions vs. Reward Functions</b></span></p>
</legend>
<p> Utility
functions and reward functions are two interrelated yet distinct
concepts in understanding agent behavior. Utility functions represent an
agent’s preferences about states or the choice-worthiness of a state,
while rewards functions represent externally imposed reinforcement. The
fact that an outcome is rewarded externally does not guarantee that it
will become part of an agent’s internal utility function.<p>
An example where utility and reinforcement comes apart can be seen with
Galileo Galilei. Despite the safety and societal acceptance he could
gain by conforming to the widely accepted geocentric model, Galileo
maintained his heliocentric view. His environment provided ample
reinforcement to conform, yet he deemed the pursuit of scientific truth
more choiceworthy, highlighting a clear difference between environmental
reinforcement and the concepts of choice-worthiness or utility.<p>
As another example, think of evolutionary processes as selecting or
reinforcing some traits over others. If we considered taste buds as
components that help maximize fitness, we would expect more people to
want the taste of salads over cheeseburgers. However, it is more
accurate to view taste buds as “adaptation executors” rather than
“fitness maximizers,” as taste buds evolved in our ancestral environment
where calories were scarce. This illustrates the concept that agents act
on adaptations without necessarily adopting behavior that reliably helps
maximize reward.<p>
The same could be true for reinforcement learning agents. RL agents
might execute learned behaviors without necessarily maximizing reward;
they may form <em>decision procedures</em> that are not fully aligned
with its reinforcement. The fact that what is rewarded is not
necessarily what an agent thinks is choiceworthy could lead to AIs that
are not fully aligned with externally designed rewards. The AI might not
inherently consider reinforced behaviors as choiceworthy or of high
utility, so its utility function may differ from the one we want it to
have.<p>
</p>
</div>
<br>

<h1 id="attitudes-to-risk">B.4 Attitudes to Risk</h1>
<p><strong>Overview.</strong> The concept of risk is central to the
discussion of utility functions. Knowing an agent’s attitude towards
risk—whether they like, dislike, or are indifferent to risk—gives us a
good idea of what their utility function looks like. Conversely, if we
know an agent’s utility function, we can also understand their attitude
towards risk. We will first outline the three attitudes towards risk:
risk aversion, risk neutrality, and risk seeking. Then, we will consider
some arguments for why we might adopt each attitude, and provide
examples of situations where each attitude may be suitable to
favor.<p>
It is crucial to understand what risk attitudes are appropriate in which
contexts. To make AIs safe, we will need to give them safe risk
attitudes, such as by favoring risk-aversion over risk-neutrality. Risk
attitudes will help explain how people do and should act in different
situations. National governments, for example, will differ in risk
outlook from rogue states, and big tech companies will differ from
startups. Moreover, we should know how risk averse we should be with AI
development, as it has both large upsides and downsides.</p>
<h2 id="what-are-the-different-attitudes-to-risk">B.4.1 What Are the Different
Attitudes to Risk?</h2>
<p><strong>There are three broad types of risk preferences.</strong>
Agents can be risk averse, risk neutral, or risk seeking. In this
section, we first explore what these terms mean. We consider a few
equivalent definitions by examining different concepts associated with
risk <span class="citation" data-cites="dixitslides">[7]</span>. Then,
we analyze what the advantages to adopting each certain attitude toward
risk might be.</p>
<p><strong>Let’s consider these in the context of a bet on a coin
toss.</strong> Suppose agents are given the opportunity to bet <span
class="math inline">$</span>1000 on a fair coin toss—upon guessing
correctly, they would receive <span class="math inline">$</span>2000 for
a net gain of <span class="math inline">$</span>1000. However, if they
guess incorrectly, they would receive nothing and lose their initial bet
of <span class="math inline">$</span>1000. The expected value of this
bet is <span class="math inline">$</span>0, irrespective of who is
playing: the player gains or loses <span
class="math inline">$</span>1000 with equal probabilities. However, a
particular player’s willingness to take this bet, reflecting their risk
attitude, depends on how they calculate expected utility.</p>
<ol>
<li><p><em>Risk aversion</em> is the tendency to prefer a certain
outcome over a risky option with the same expected value. A risk-averse
agent would not want to participate in the coin toss. The individual is
unwilling to take the risk of a potential loss in order to potentially
earn a higher reward. Most humans are instinctively risk averse. A
common example of a risk-averse utility function is <span
class="math inline"><em>u</em>(<em>x</em>) = log <em>x</em></span> (red
line in Figure B.4).</p></li>
<li><p><em>Risk neutrality</em> is the tendency to be indifferent
between a certain outcome and a risky option with the same expected
value. For such players, expected utility is proportional to expected
value. A risk-neutral agent would not care whether they were offered
this coin toss, as its expected value is zero. If the expected value was
negative, they would prefer not to participate in the lottery, since the
lottery has negative expected value. Conversely, if the expected value
was positive, they would prefer to participate, since it would then have
positive expected value. The simplest risk-neutral utility function is
<span class="math inline"><em>u</em>(<em>x</em>) = <em>x</em></span>
(blue line in Figure B.4).</p></li>
<li><p><em>Risk seeking</em> is the tendency to prefer a risky option
over a sure thing with the same expected value. A risk-seeking agent
would be happy to participate in this lottery. The individual is willing
to risk a negative expected value to potentially earn a higher reward.
We tend to associate risk seeking with irrationality, as it leads to
lower wealth through repeated choices made over time. However, this is
not necessarily the case. An example of a risk-seeking utility function
is <span
class="math inline"><em>u</em>(<em>x</em>) = <em>x</em><sup>2</sup></span>
(green line in Figure B.4).</p></li>
</ol>
<p>We can define each risk attitude in three equivalent ways. Each draws
on a different aspect of how we represent an agent’s preferences.</p>

<figure id="fig:risk-att">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/utilityfunctionsrisk.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.4: Concave, linear, and convex utility functions model risk averse, risk neutral, and risk
seeking agents’ preferences.</p>
</figure>

<p><strong>Risk attitudes are fully explained by how an agent values
uncertain outcomes.</strong> According to expected utility theory, an
agent’s risk preferences can be understood from the shape of their
utility function, and vice-versa. We will illustrate this point by
showing that concave utility functions necessarily imply risk aversion.
An agent with a concave utility function faces decreasing marginal
utility. That is, the jump from <span class="math inline">$</span>1000
to <span class="math inline">$</span>2000 is less satisfying than the
jump from wealth <span class="math inline">$</span>0 to wealth <span
class="math inline">$</span>1000. Conversely, the agent dislikes
dropping from wealth <span class="math inline">$</span>1000 to wealth
<span class="math inline">$</span>0 more than they like jumping from
wealth <span class="math inline">$</span>1000 to wealth <span
class="math inline">$</span>2000. Thus, the agent will not enter the
aforementioned double-or-nothing coin toss, displaying risk
aversion.</p>
<p><strong>Preferences over outcomes may not fully explain risk
attitudes.</strong> It may seem unintuitive that risk attitudes are
entirely explained by how humans calculate utility of outcomes. As we
just saw, in expected utility theory, it is assumed that agents are risk
averse only because they have diminishing returns to larger outcomes.
Many economists and philosophers have countered that people also have an
inherent aversion to risk that is separate from preferences over
outcomes. At the end of this chapter, we will explore how non-expected
utility theories have attempted to more closely capture human behavior
in risky situations.</p>
<h2 id="risk-and-decision-making">B.4.2 Risk and Decision Making</h2>
<p><strong>Overview.</strong> Having defined risk attitudes, we will now
consider situations where it is appropriate to act in a risk-averse,
risk-neutral, or risk-seeking manner. Often, our risk approach in a
situation aligns with our overall risk preference—if we are risk averse
in day-to-day life, then we will also likely be risk averse when
investing our money. However, sometimes we might want to make decisions
as if we have a different attitude towards risk than we truly do.</p>
<p><strong>Criterion of rightness vs. decision procedure.</strong>
Philosophers distinguish between a <em>criterion of rightness</em>, the
way of judging whether an outcome is good, and a <em>decision
procedure</em>, the method of making decisions that lead to the good
outcomes. A good criterion of rightness may not be a good decision
procedure. This is related to the gap between theory and practice, as
explicitly pursuing an ideal outcome may not be the best way to achieve
it. For example, a criterion of rightness for meditation might be to
have a mind clear of thoughts. However, as a decision procedure,
thinking about not having thoughts may not help the meditator achieve a
clear mind—a better decision procedure would be to focus on the
breath.<p>
As another example, the <em>hedonistic paradox</em> reminds us that
people who directly aim at pleasure rarely secure it <span
class="citation" data-cites="sidgwick2019methods">[8]</span>. While a
person’s pleasure level could be a criterion of rightness, it is not
necessarily a good guide to increasing pleasure—that is, not necessarily
a good decision procedure. Whatever one’s vision of pleasure looks
like—lying on a beach, buying a boat, consuming drugs—people who
directly aim at pleasure often find these things are not as pleasing as
hoped. People who aim at meaningful experiences, helping others and
engaging in activities that are intrinsically worthwhile, are more
likely to be happy. People tend to get more happiness out of life when
not aiming explicitly for happiness but for some other goal. Using the
criterion of rightness of happiness as a decision procedure can
predictably lead to unhappiness.<p>
Maximizing expected value can be a criterion of rightness, but it is not
always a good decision procedure. In the context of utility, we observe
a similar discrepancy where explicitly pursuing the criterion of
rightness (maximizing the utility function) may not lead to the best
outcome. Suppose an agent is risk neutral, such that their criterion of
rightness is maximizing a linear utility function. In the first
subsection, we will explore how they might be best served by making
decisions as if they are risk averse, such that their decision procedure
is maximizing a concave utility function.</p>
<h3 id="why-be-risk-averse">Why Be Risk Averse?</h3>
<p><strong>Risk-averse behavior is ubiquitous.</strong> In this section,
we will explore the advantages of risk aversion and how it can be a good
way to advance goals across different domains, from evolutionary fitness
to wealth accumulation. It might seem that by behaving in a risk-averse
way, thereby refusing to participate in some positive expected value
situations, agents leave a lot of value on the table. Indeed, extreme
risk aversion may be counterproductive—people who keep all their money
as cash under their bed will lose value to inflation over time. However,
as we will see, there is a sweet spot that balances the safety of
certainty and value maximization: risk-averse agents with logarithmic
utility almost surely outperform other agents over time, under certain
assumptions.</p>
<p><strong>Response to computational limits.</strong> In complex
situations, decision makers may not have the time or resources to
thoroughly analyze all options to determine the one with the highest
expected value. This problem is further complicated when the outcomes of
some risks we take have effects on other decisions down the line, like
how risky investments may affect retirement plans. To minimize these
complexities, it may be rational to be risk averse. This helps us avoid
the worst effects of our incomplete estimates when our uncertain
calculations are seriously wrong.<p>
Suppose Martin is deciding between purchasing a direct flight or two
connecting flights with a tight layover. The direct flight is more
expensive, but Martin is having trouble estimating the likelihood and
consequences of missing his connecting flight. He may prefer to play the
situation safe and pay for the more expensive direct flight, even though
the true value-for-money of the connected route may have been higher.
Now Martin can confidently make future decisions like booking a bus from
the airport to his hotel. Risk-averse decision making not only reduces
computational burden, but can also increase decision-making speed.
Instead of constantly making difficult calculations, an agent may prefer
to have a bias against risk.</p>
<p><strong>Behavioral advantage.</strong> Risk aversion is not only a
choice but a fundamental psychological phenomenon, and is influenced by
factors such as past experiences, emotions, and cognitive biases. Since
taking risks could lead to serious injury or death, agents undergoing
natural selection usually develop strategies to avoid such risks
whenever possible. Humans often shy away from risk, prioritizing safety
and security over more risky ventures, even if the potential rewards are
higher.<p>
Studies have shown that animals across diverse species exhibit
risk-averse behaviors. In a study conducted on bananaquits, a
nectar-drinking bird, researchers presented the birds with a garden
containing two types of flowers: one with consistent amounts of nectar
and one with variable amounts. They found that the birds never preferred
the latter, and that their preference for the consistent variety was
intensified when the birds were provided fewer resources in total <span
class="citation" data-cites="wunderle1987risk">[9]</span>. This risk
aversion helps the birds survive and procreate, as risk-neutral or
risk-seeking species are more likely to die out over time: it is much
worse to have no nectar than it is better to have double the nectar.
Risk aversion is often seen as a survival mechanism.</p>
<p><strong>Natural selection favors risk aversion.</strong> Just as
individual organisms demonstrate risk aversion, entire populations are
pushed by natural selection to act risk averse in a manner that
maximizes the expected logarithm of their growth, rather than the
expected value. Consider the following, highly simplified example.
Suppose there are three types of animals—antelope, bear, crocodile—in an
area where each year is either scorching or freezing with probability
0.5. Every year, the populations grow or shrink depending on the
weather—some animals are better suited to the hot weather, and some to
the cold. The populations’ per-capita offspring, or equivalently the
populations’ growth multipliers, are shown in the table below.<p>
</p>
<p>Antelope have the same growth in each state, bears grow faster in the
warmth but slower in the cold when they hibernate, and crocodiles grow
rapidly when it is scorching and animals gather near water sources but
die out when their habitats freeze over. However, notice that the three
populations have the same average growth ratio of 1.1.<p>
However, “average growth” is misleading. Suppose we observe this
population over two periods, one hot followed by one cold. The average
growth multiplier over these two periods would be 1.1 for every animal.
However, this does not mean that they all grow the same amount. In the
table below, we can see the animals’ growth over time.<p>
</p>
<p>Adding the logarithm of each species’ hot and cold growth rates
indicates its long term growth trajectory. The antelope population will
continue growing no matter what, compounding over time. However, the
crocodile population will not—as soon as it enters a cold year, the
crocodiles will become permanently extinct. The bear population is not
exposed to immediate extinction risk, but over time it will likely
shrink towards extinction. Notice that maximizing long-run growth in
this case is equivalent to maximizing the sum of the logarithm of the
growth rates—this is risk aversion. The stable growth population, or
equivalently the risk-averse population, is favored by natural selection
<span class="citation" data-cites="okasha2007rational">[10]</span>.</p>
<p><strong>Avoid risk of ruin.</strong> Risk aversion’s key benefit is
that it avoids risk of ruin. Consider a repeated game of equal
probability “triple-or-nothing” bets. That is, players are offered a
<span class="math inline">$$\frac{1}{2}$$</span> probability of tripling
their initial wealth <span class="math inline"><em>w</em></span>, and a
<span class="math inline">$$\frac{1}{2}$$</span> probability of losing it
all. A risk-neutral player can calculate the expected value of a single
round as:<p>
<span class="math display">$$\frac{1}{2} \cdot 0+\frac{1}{2} \cdot 3w =
1.5w.$$</span> Since the expected value is greater than the player’s
initial wealth, a risk-neutral player would bet their entire wealth on
the game. Additionally, if offered this bet repeatedly, they would
reinvest everything they had in it each time. The expected value of
taking this bet <span class="math inline"><em>n</em></span> times in a
row, reinvesting all winnings, would be:<p>
<span class="math display">$$\frac{1}{2} \cdot 0+\frac{1}{4} \cdot
0+\cdots +\frac{1}{2^{n}} \cdot 0+\frac{1}{2^{n}} \cdot 3^{n} \cdot w =
(1.5)^{n}w.$$</span> If the agent was genuinely offered this bet as many
times as they wanted, then they would continue to invest everything
infinitely many times, which gives them expected value of:<p>
<span
class="math display">lim<sub><em>n</em> → ∞</sub>1.5<sup><em>n</em></sup><em>w</em> = ∞.</span>
This is another infinite expected value game—just like in the St.
Petersburg Paradox! However, notice that this calculation is again
heavily skewed by a single, low-probability branch in which an extremely
lucky individual continues to win, exponentially increasing their
wealth. In the figure below, we show the first four bets in this
strategy with a starting wealth of 16. Only along the cyan branch does
the player win any money, and this branch increasingly becomes
astronomically improbable. We would rarely choose to repeatedly play
triple-or-nothing games with everything we owned in real life. We are
risk averse when dealing with high probabilities of losing all our
money. Acting risk neutral and relying on expected value would be a poor
decision-making strategy.<p>
</p>
<figure id="fig:risk-neutral-betting">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/riskaversebetting.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.5: Risk-neutral betting can lead to ruin.</p>
</figure>
<p><strong>Maximizing logarithmic utility is a better decision
procedure.</strong> Agents might want to act as if maximizing the
logarithm of their wealth instead of maximizing the expected value. A
logarithmic function avoids risk of ruin because it assigns a utility
value of negative infinity to the outcome of zero wealth, since <span
class="math inline">log 0 →  − ∞</span>. Therefore an agent with a
logarithmic utility function in wealth will never participate in a
lottery that could, however unlikely the case, land them at zero wealth.
The logarithmic function also grows slowly, placing less weight on very
unlikely, high-payout branches, a property that we used to resolve the
St. Petersburg Paradox. While we might have preferences that are linear
over wealth (which is our criterion of rightness) we might be better
served by a different decision procedure: maximizing the logarithm of
wealth rather than maximizing wealth directly.</p>
<p><strong>Maximizing the logarithm of wealth maximizes every percentile
of wealth.</strong> Maximizing the logarithmic utility valuation avoids
risk of ruin since investors never bet their entire wealth on one
opportunity, much like how investors seek to avoid over-investing in one
asset by diversifying investments over multiple assets. Instead of
maximizing average wealth (as expected value does), maximizing the
logarithmic utility of wealth maximizes other measures associated with
the distribution of wealth. In fact, doing so maximizes the median,
which is the 50th percentile of wealth, and it also delivers the highest
value at any arbitrary percentile of wealth. It even maximizes the
mode—the most likely outcome. Mathematically, maximizing a logarithmic
utility function in wealth outperforms any other investment strategy in
the long run, with probability one (certainty) <span class="citation"
data-cites="kelly1956new">[11]</span>. Thus, variations on maximizing the
logarithm of wealth are widely used in the financial sector.<p>
</p>
<h3 id="why-be-risk-neutral">Why Be Risk Neutral?</h3>
<p><strong>Risk neutrality is equivalent to acting on the expected
value.</strong> Since averages are straightforward and widely taught,
expected value is the mostly widely known explicit decision-making
procedure. However, despite expected value calculations being a common
concept in popular discourse, situations where agents do and should act
risk neutral are limited. In this section, we will first look at the
conditions under which risk neutrality might be a good decision
procedure—in such cases, maximizing expected value can be a significant
improvement over being too cautious. However, being mistaken about
whether the conditions hold is entirely possible. We will examine two
scenarios: one when these conditions hold, and one situation in which
incorrectly assuming that they held led to ruin.</p>
<p><strong>Risk neutrality is undermined by the possibility of
ruin.</strong> In the previous section, we examined the
triple-or-nothing game, where a risk-neutral approach can lead to zero
wealth in the long term. The risk of ruin, or the loss of everything, is
a major concern when acting risk neutral. In order for a situation to be
free of risk of ruin, several conditions must be met. First, risks must
be <em>local</em>, meaning they affect only a part of a system, unlike
<em>global</em> <em>risks</em>, which affect an entire system. Second,
risks must be <em>uncorrelated</em>, which means that the outcomes do
not increase or decrease together, so that local risks do not combine to
cause a global risk. Third, risks must be <em>tractable</em>, which
means the consequences and probabilities can be estimated reasonably
easily. Finally, there should be no <em>black swans</em>, unlikely and
unforeseen events that have a significant impact. As we will see, all of
these conditions are rarely met in a high-stakes environment, and there
can be dire consequences to underestimating the severity of risks.</p>
<p><strong>Risk neutrality is useful when the downside is
small.</strong> It can be appropriate to act in a risk-neutral manner
with regards to relatively inconsequential decisions. Suppose we’re
considering buying tickets to a movie that might not be any good. The
upside is an enjoyable viewing experience, and the downsides are all
local: <span class="math inline">$</span>20 and a few wasted hours.
Since the stakes of this decision are minimal, it is reasonable not to
overthink our risk attitude and just attend the movie if we think that,
on average, we won’t regret this decision. However, if the decision at
hand were that of purchasing a car on credit, we likely would not act
hastily. The risk might not be localized but instead affect one’s entire
life; if we can’t afford to make payments, we could go bankrupt.
However, when potential losses are small, extreme risk aversion may be
too safe a strategy. We would prefer not to leave expected value on the
table.</p>
<p><strong>Dangers of risk neutrality.</strong> Often, agents
incorrectly assume that there is no risk of ruin. The failure of
financial institutions during the 2008 financial crisis, which sparked
the Great Recession, is a famous example of poor risk assessment. Take
the American International Group (AIG), a multinational insurance
company worth hundreds of billions of dollars <span class="citation"
data-cites="mcdonald2015went">[12]</span>. By 2008, they had accumulated
billions of dollars worth of financial products related to the real
estate sector. AIG believed that their investments were sufficiently
uncorrelated, and therefore ruled out risk of ruin. However, AIG had not
considered a black swan: in 2008, many financial products related to the
housing market crashed. AIG’s investments were highly correlated with
the housing market, and the firm needed to be bailed out by the Federal
Reserve for <span class="math inline">$</span>180 billion dollars. Even
institutions with sophisticated mathematical analysis fail to identify
risk of ruin—playing it safe might, unsurprisingly, be safer. Artificial
agents may operate in environments where risk of ruin is a real and not
a far-fetched possibility. We would not want a risk-neutral artificial
agent endangering human lives because of a naive expected value
calculation.<p>
</p>
<h3 id="why-be-risk-seeking">Why Be Risk Seeking?</h3>
<p><strong>Risk-seeking behavior is not always unreasonable.</strong> As
we previously defined, risk-seeking agents prefer to gamble for the
chance of a larger outcome rather than settle for the certainty of a
smaller one. In some cases, a risk-seeking agent’s behavior may be
regarded as unreasonable. For example, gambling addicts take frequent
risks that lower their utility and wellbeing in the long run. On the
other hand, many individuals and organizations may be motivated to seek
risks for a number of strategic reasons, which is the focus of this
section. We will consider four situations as examples where agents might
want to be risk seeking.</p>
<p><strong>In games with many players and few winners, risk-seeking
behavior can be justified.</strong> Consider a multi-player game where a
thousand participants compete for a single grand prize, which is given
to the player who accumulates the most points. An average player expects
to only win <span class="math inline">$$\frac{1}{1000}^{th}$$</span>of the
time. Even skilled players would reason that due to random chance, they
are unlikely to be the winner. Therefore, participants may seek risks,
in order to increase the <em>variance</em> of their point totals, while
sacrificing the mean, so that they either end up with loads of points
(thereby winning with higher probability) or no points at all (which is
no worse for them than having some points). In the real world, selling
products in a highly competitive marketplace is an analogous situation,
where vendors may take risks to attract customers. If a hundred firms
are selling effectively identical products, they might consider unusual
or provocative forms of advertising. Bold marketing strategies can
attract some customers but potentially alienate others. However, the
vendor may feel that without taking this chance to stand out, they
likely will not do enough business to turn a profit. Such agents would
accept a negative expected value strategy.</p>
<p><strong>Daring to rise.</strong> Agents in bad and deteriorating
situations with a low chance of escaping can pursue risk-seeking
strategies to great effect. Someone with a serious terminal illness may
consider experimental treatments with uncertain outcomes. They may take
the treatment even if told that it is most likely ineffective and might
have serious side effects. Sports teams on the verge of losing often
attempt risky strategies such as the “Hail Mary” in American football
(long, high passes that are difficult to catch) or sending a goalkeeper
forward in soccer towards the end of the game. Such strategies are
likely to backfire and leave them in an even worse position overall—and
therefore have a negative expected value, where value is the number of
goals—but might also create the only possibility of winning.</p>
<p><strong>Harnessing stressors through risk exposure.</strong> Instead
of collapsing (or merely enduring) when unlikely, bad events occur, an
<em>antifragile system</em> is one in which the risk taker actually
benefits, becoming stronger and more resilient to future challenges
<span class="citation" data-cites="taleb2012antifragile">[13]</span>.
Antifragility is therefore the property of being able to benefit from
risk. Systems, institutions, or individuals that exhibit antifragility
may thus seek risk exposure. The human body is antifragile in many
contexts, including its response to pathogens. Illness is usually
temporarily uncomfortable, but carries a small risk of greater
complications. In combating a pathogen, the immune system not only
responds to the active threat, but prepares itself to combat future
illnesses quickly and effectively. The immune system becomes stronger
through illness. We encourage children to step out of the house instead
of solely living in sterilized environments. To a reasonable extent, we
help them to face and deal with risks so that they become stronger.</p>
<p><strong>Startups aiming to capture significant upside
potential.</strong> When losing a small amount has a negligible or
tolerable effect on an agent’s wellbeing, the agent may be willing to
risk this small loss for a low-probability outcome of very large gain,
even if the probability of success is sufficiently small to make this
negative expected monetary value. This line of thinking is exemplified
by early stage startups, which we now describe.</p>
<h3 id="the-lifecycle-of-a-company">The Lifecycle of a Company</h3>
<p><strong>A company’s appetite for risk over time is captured by a
<em>sigmoid</em> (s-shaped) curve, which is initially convex and later
concave.</strong> Knowing that an agent has a concave utility function
if and only if they are risk averse, and a convex utility function if
and only if they are risk seeking, we understand that an agent with a
sigmoid utility function is initially risk seeking, and later becomes
risk averse. This model may give us an idea of how AI developers will
behave, depending on the scale of the organization and the maturity of
the technology.<p>
</p>
<figure id="fig:company-life">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/companylifecycle.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.6:  Start-ups often take big risks to try and get explosive growth, modeled by a convex
utility function. As a company matures, it gets more risk averse, prioritizing survival over extreme
growth. </p>
</figure>
<p><strong>A startup is by nature a risk-seeking venture.</strong> An
entrepreneur has typically sacrificed a stable income, savings, and much
of their leisure time in order to pursue a business. The new business,
starting with little traction and few customers, has little to lose and
much to gain. Given their sacrifices and the startup’s position, the
entrepreneur is willing to fail repeatedly and return to baseline,
prioritizing chances at rapid growth. By this logic, we may expect AI
startups to prioritize chances of success <em>over</em> safety and
avoiding reputational damage. However, since such companies operate on a
smaller scale, risks are localized, meaning societal concerns are
reduced. This is the convex part of the curve.</p>
<p><strong>When a startup begins to gain traction, it grows rapidly,
gaining customers and revenue.</strong> Gradually, more stakeholders
like employees and investors begin to rely on the company, and its focus
begins to shift toward preserving its proven success and preventing
future losses, rather than risking a return to baseline to pursue more
growth. During this transition, the curve begins to shift from convex to
concave.</p>
<p><strong>In the concave stage of a company, its growth tapers
off.</strong> Eventually, the company nears the limits of its market and
transformative resources, and is unable to risk its bottom line for
further growth opportunities. Mature companies are risk averse, since
employees, shareholders, and customers depend on their stability. They
have much to lose and little to gain. A mature company’s consolation is
that it may develop a project portfolio, with riskier projects sprinkled
in amongst its core business operations that are not typically exposed
to risk. Compared to startups, big tech companies may thus take a more
meticulous approach to ensuring the safety of AI products prior to
release, though this is not always the case.</p>
<p><strong>A company’s lifecycle demonstrates that an agent does not
have only one unchanging risk attitude.</strong> An agent’s approach to
risk is affected by their situation and decision context. Indeed,
describing a person as risk averse or risk seeking will always be an
oversimplification. As we will explore in the final section, people have
dynamic risk attitudes that are influenced by many factors including
biases, context, initial wealth, and more.</p>
<p><strong>Summary.</strong> In this section, we defined the three
attitudes towards risk—risk aversion, risk neutrality, and risk
seeking—and examined their properties and shapes. There are reasons to
favor each attitude, depending on the agent’s circumstances. We saw that
risk aversion in the form of maximizing a logarithmic utility function
outperforms all other investment strategies in the long run, that risk
neutrality may be favorable when there is no risk of ruin, and that
risk-seeking behavior is useful when we have little to lose. Humans, and
the organizations we form, adopt different risk attitudes in different
situations.<p>
When designing AIs, developers may have significant ability to define
and influence the utility function. As demonstrated in the corrigibility
problem, issues in the utility function may be later difficult to
rectify. Thus, we must carefully consider what risk attitudes a
particular utility function embodies, and how that risk attitude would
play out in different contexts.In particular, designing AIs to be
risk-averse may help avoid many of the pitfalls of risk-neutrality that
we discussed.<p>
We are interested in describing a more accurate model of human decision
making. In the next section, we will analyze some reasons why expected
utility theory fails in this regard, and how we might improve our model.
We will see that humans are systematically irrational and can be
influenced by the context, and even the wording, in which choices are
presented. These models are helpful for better understanding people,
like those leading AI development or countries or those using AIs, who
will behave in irrational ways.</p>

<style type="text/css">
    table.tableLayout{
        margin: auto;
        border: 1px solid;
        border-collapse: collapse;
        border-spacing: 1px;
        caption-side: bottom;
    }

    table.tableLayout tr{
        border: 1px solid;
        border-collapse: collapse;
        padding: 5px;
    }

    table.tableLayout th{
        border: 1px solid;
        border-collapse: collapse;
        padding: 3px;
    }

    table.tableLayout td{
        border: 1px solid;
        padding: 5px;
    }
</style>

<h1 id="beyond-expected-utility-theories">B.5 Beyond Expected Utility
Theories</h1>
<p><strong>Overview.</strong> Von Neumann-Morgenstern utility functions
are supposed to capture how an agent chooses among different options,
but they do not always explain how humans actually behave. In reality,
humans are not ideally rational agents. We’d like to model how people,
like those leading the development and regulation of AI, will behave,
and they will often behave in ways that perfect expected utility
maximizers would not. If AIs learn from and interact with humans, they
too might exhibit some aspects of human irrationality. Therefore, we
like to model human behavior more accurately.<p>
Economists have proposed alternative theories, such as Daniel Kahneman
and Amos Tversky’s Prospect theory, to explain why humans deviate from
rationality, as defined under the von Neumann-Morgenstern axioms. In
this section, we examine some major ways that humans break rationality,
as defined under the von Neumann-Morgenstern model, and how non-expected
utility theories help us better understand human choices. Having a
stronger model of human behavior will ultimately help us design AIs that
behave in ways more aligned with humans.</p>
<h2 id="humans-and-rationality">B.5.1 Humans and Rationality</h2>
<p><strong>Humans are not ideally rational agents.</strong> As we
discussed before, rational agents have a thorough understanding of their
own preferences, have complete and stable preferences, and are able to
make which decisions will help them maximally satisfy these preferences.
Human decision-making deviates from ideal rationality. For example, we
prioritize fairness, may have incommensurable values, and value the
desires of others even when these mean we must compromise self-interest.
Human preferences are also unstable, susceptible to persuasion, and can
change over the course of our lives or in light of new information.</p>
<p><strong>Humans often satisfice rather than maximize.</strong>
According to the theory of <em>bounded rationality</em>, humans often
make choices that result in outcomes that are “good enough” rather than
the most ideal. Human rationality is limited by cognitive abilities,
time, and available information, meaning we must frequently make
decisions without considering all possible scenarios and outcomes. Take
an everyday example: suppose a group of friends is deciding what
restaurant to dine at. They will likely choose the first satisfactory
option they come across, rather than methodically consider all possible
places in the city. Thus, humans are said to <em>satisfice</em>—choose
the first option that is satisfactory—rather than exhaustively
maximize.</p>
<p><strong>AIs that are not ideally rational can have varying
safety.</strong> It is plausibly safer for AI systems to be satisficers,
since maximizers may behave in undesirable ways while trying to
relentlessly maximize some metric <span class="citation"
data-cites="taylor2016quantilizers">[14]</span>. As discussed in Proxies,
maximizers may optimize proxies in ways that differ from the idealized
result. However, when AI agents are trained on human data and trained to
interact with humans, they may pick up some of our biases and thereby
not be ideally rational. AIs with many irrational behaviors could be
harder to predict and therefore be harder to control. We will now
proceed to formalize our understanding of why expected utility theory
fails to capture human behavior.</p>
<h2 id="evidence-against-expected-utility-theory">B.5.2 Evidence Against
Expected Utility Theory</h2>
<p><strong>Overview.</strong> Humans violate the von Neumann-Morgenstern
axioms in many different ways. We consider three examples in this
section: the <em>Allais Paradox</em>, <em>fairness</em>, and the
<em>problem of framing</em>, in which humans make inconsistent choices
over lotteries that agents following von Neumann-Morgenstern would be
indifferent between. These problems show violations of von
Neumann-Morgenstern rationality: specifically, the independence and
decomposability axioms.</p>
<h3 id="allais-paradox">Allais Paradox</h3>
<p><strong>Humans are not perfect expected utility maximizers.</strong>
The Allais Paradox, described in 1953 to highlight an inconsistency
between utility theory and human behavior, presents two scenarios where
players must make a choice between two gambles <span class="citation"
data-cites="allais1953comportement">[15]</span>.<p>
</p>
<br>
<table class="tableLayout">
<caption>Table B.3: Allais Paradox: Scenario A</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Choice:</th>
<th style="text-align: left;">Gamble 1</th>
<th style="text-align: left;">Gamble 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario A</td>
<td style="text-align: left;">100% chance of $1 million</td>
<td style="text-align: left;">10% chance of $5 million <br>
>89% chance of $1 million <br>
>1% chance of $0</td>
</tr>
</tbody>
</table>
<br>
<br>
<table class="tableLayout">
<caption>Table B.4: Allais Paradox: Scenario B</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Choice:</th>
<th style="text-align: left;">Gamble 1</th>
<th style="text-align: left;">Gamble 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario B</td>
<td style="text-align: left;">11% chance of $1 million <br>
>89% chance of $0</td>
<td style="text-align: left;">10% chance of $5 million <br>
>90% chance of $0</td>
</tr>
</tbody>
</table>
<br> <br>
<p>In reality, most players favor Gamble 1 in Scenario A, due to the
certainty of a large payout of $1 million. Simultaneously, they favor
Gamble 2 in Scenario B, since the larger potential payout of $5 million
outweighs its slightly lower likelihood. Both these preferences are
individually sensible and unproblematic. However, holding both
preferences simultaneously is in violation of von Neumann-Morgenstern’s
independence axiom, and thus inconsistent with expected utility
theory.</p>
<p><strong>Humans violate the independence axiom.</strong> We explained
above that the independence axiom holds that preferences between two
lotteries are not impacted by the addition of equal probabilities of a
third, independent lottery to each lottery. It follows that we can
subtract the same lottery from two equivalent lotteries and preserve the
original preference.<p>
In Scenario A, an 89<span class="math inline">%</span> chance of winning
<span class="math inline">$</span>1 million is common to both choices.
Therefore, the decision between Gamble 1 and Gamble 2 in Scenario A is
equivalent to the reduced game described, in which the 100<span
class="math inline">%</span> chance of winning <span
class="math inline">$</span>1 million has simply been divided into an
89<span class="math inline">%</span> chance and an 11<span
class="math inline">%</span> chance of winning the same <span
class="math inline">$</span>1 million.<p>
</p>
<br>
<table class="tableLayout">
<caption>Table B.5: Allais Paradox: Scenario A reduced</caption>
<thead>
<tr class="header">
<th style="text-align: left;"></th>
<th style="text-align: left;">Gamble 1</th>
<th style="text-align: left;">Gamble 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario A</td>
<td style="text-align: left;">89% chance of $1 million <br>
>11% chance of $1 million</td>
<td style="text-align: left;">10% chance of $5 million <br>
>89% chance of $ 1 million <br>
>1% chance of $0</td>
</tr>
<tr class="even">
<td style="text-align: left;">Scenario A reduced</td>
<td style="text-align: left;">11% chance of 1 million</td>
<td style="text-align: left;">10% chance of $ 5 million <br>
>1% chance of $0</td>
</tr>
</tbody>
</table>
<br> <br>
<p>Similarly, there is a common lottery between Gamble 1 and Gamble 2 in
Scenario B. We can ignore an 89<span class="math inline">%</span> chance
of winning <span class="math inline">$</span>0 from both choices and we
will be left with the reduced game described, in which the 89<span
class="math inline">%</span> chance of winning <span
class="math inline">$</span>0 has simply been ignored.<p>
</p>
<table class="tableLayout">
<caption>Table B.6: Allais Paradox: Scenario B reduced</caption>
<thead>
<tr class="header">
<th style="text-align: left;"></th>
<th style="text-align: left;">Gamble 1</th>
<th style="text-align: left;">Gamble 2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario B</td>
<td style="text-align: left;">11% chance of $1 million <br>
>89% chance of $0</td>
<td style="text-align: left;">10% chance of $5 million <br>
>90% chance of $0</td>
</tr>
<tr class="even">
<td style="text-align: left;">Scenario B reduced</td>
<td style="text-align: left;">11% chance of $1 million</td>
<td style="text-align: left;">10% chance of $5 million <br>
>1% chance of $0</td>
</tr>
</tbody>
</table>
<br><br>
<p>Scenario A Reduced and Scenario B Reduced are exactly the same!
Therefore, a rational agent should be consistent and select the same
gamble in either simplified game. Since the simplified scenarios are
equivalent to the original scenarios via the independence axiom—adding
third lotteries to both options should make no difference to the choice
in either scenario—we can conclude that a rational agent should also
select the same gamble in Scenario A and Scenario B. In reality, people
overwhelmingly favor Gamble 1 in Scenario A and Gamble 2 in Scenario B.
This behavior is inconsistent with rational behavior under the
independence axiom.<p>
We can interpret this discrepancy in two ways. We could assume that
humans are making an error of judgment. Or, we could conclude that human
choice isn’t unreasonable, but that the von Neumann-Morgenstern axioms
are a flawed characterization of rationality. Indeed, Kahneman and
Tversky concluded that the expected utility model fails to capture
important nuances in human decision making <span class="citation"
data-cites="kahneman2011thinking">[16]</span>.</p>
<h3 id="fairness">Fairness</h3>
<p><strong>It is sometimes thought that the independence axiom is at
odds with concepts of fairness.</strong> The independence axiom tells us
that adding equal probabilities of a third lottery makes no difference
to an agent’s preference between two other lotteries. However, in some
cases, we do care what else could happen: we make all-things-considered
judgments of how we value outcomes. In particular, we care about
fairness <span class="citation"
data-cites="mccarthy2016probability">[17]</span>.</p>
<p><strong>If an agent cares about fairness, they may be forced to
abandon independence.</strong> Suppose Rachel, on her deathbed, has to
leave everything she owns to one person. She is indifferent between
everything going to her son and everything going to her daughter, but
would like to treat them equally. There is a 50<span
class="math inline">%</span> chance that the law will change such that
everything goes to her son, no matter what her will says. She is now
considering two options, which we can write down as lotteries.</p>
<ol>
<li><p>She leaves everything to her daughter. Let this be the “fair
lottery” <span
class="math inline"><em>L</em><sub><em>F</em></sub></span>, wherein her
daughter receives everything with probability 0.5 (once her will is
executed), and her son receives everything with probability 0.5 (once
the laws change).</p></li>
<li><p>She leaves everything to her son. Let this be the “unfair
lottery” <span
class="math inline"><em>L</em><sub><em>U</em></sub></span>, wherein her
son receives everything with probability 1, since he receives it once
her will is executed or once the laws change.</p></li>
</ol>
<p>According to the independence axiom, if she is indifferent between
her son and daughter receiving her possessions for sure, then she should
also be indifferent between the above two lotteries because we can
obtain them by adding equal probabilities of a different outcome—a
50<span class="math inline">%</span> chance her son gets everything—to
the original choice. However, if she cares at all about fairness, then
she should prefer <span
class="math inline"><em>L</em><sub><em>F</em></sub></span> over <span
class="math inline"><em>L</em><sub><em>U</em></sub></span>—a preference
that is incompatible with the independence axiom!</p>
<p><strong>Again, we see that the von Neumann-Morgenstern axioms lack
descriptive power.</strong> While, in principle, adding alternatives
should be irrelevant, they are not always so. In the case of the Allais
paradox above, we see that it changes how we think about monetary
lotteries—this is sometimes attributed to factors like avoiding regret.
In this case, we are concerned with fairness. Next, we will consider how
humans systematically violate rationality based on the description of
options.</p>
<h3 id="framing">Framing</h3>
<p><strong>Human decision making is swayed by the presentation of
choices.</strong> Kahneman and Tversky noticed that humans will make
different decisions depending on how options are presented, even when
the underlying lotteries remain unchanged <span class="citation"
data-cites="tversky1981framing">[18]</span>. While a vNM-rational
decision maker would ignore the presentation of options and focus only
on its probabilities and outcomes, humans tend to be unaware of the
large degree of influence that <em>framing</em> has on their decision
making. Furthermore, human decision making usually uses vague feelings
of subjective probability rather than well-considered mathematically
defined lotteries. Consider the following example:<p>
An illness is expected to kill 600 people if left untreated.
Participants playing the role of health officials must choose between
two policy options in two different scenarios. These options are
presented in the table below.<p>
</p>
<br>
<table class="tableLayout">
<caption>Table B.7: Framing effects: Scenario A</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Choice:</th>
<th style="text-align: left;">Policy A1</th>
<th style="text-align: left;">Policy A2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario A</td>
<td style="text-align: left;">Save 200 people with certainty.</td>
<td style="text-align: left;">1/3 chance save 600 people. /3 chance
saves no one.</td>
</tr>
</tbody>
</table>
<br> <br>
<table class="tableLayout">
<caption>Table B.8: Framing effects: Scenario B</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Choice:</th>
<th style="text-align: left;">Policy B1</th>
<th style="text-align: left;">Policy B2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario B</td>
<td style="text-align: left;">400 people die with certainty.</td>
<td style="text-align: left;">1/3 chance save of no deaths. /3 chance of
600 deaths.</td>
</tr>
</tbody>
</table>
<br>
<p><strong>Framing effects violate von Neumann-Morgenstern
rationality.</strong> Participants tend to choose Policy A1 in Scenario
A, because of the certainty of saving lives. Participants also tend to
choose Policy B2 in Scenario B, because of the possibility of averting
more death and not condemning people to certain death. However, the only
difference between Scenario A and B is the manner of presentation, or
framing. Scenario A is presented in terms of lives saved, and Scenario B
in terms of deaths caused. The equivalence is shown in this table.</p>
<br>
<table class="tableLayout">
<caption>Table B.9: Framing effects: Scenarios A and B compared</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Choice:</th>
<th style="text-align: left;">Policy A1/B1</th>
<th style="text-align: left;">Policy A2 B2</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Scenario A</td>
<td style="text-align: left;">Save 200 people with certainty (and so
causes 400 deaths for sure).</td>
<td style="text-align: left;">1/3 chance save 600 people (or no deaths).
/3 chance saves no one (or 600 deaths).</td>
</tr>
<tr class="even">
<td style="text-align: left;">Scenario B</td>
<td style="text-align: left;">Causes 400 deaths with certainty (and so
saves 200 people for sure)</td>
<td style="text-align: left;">1/3 chance save of no deaths (or saving
600 people). /3 chance of 600 deaths (or saving no one at all).</td>
</tr>
</tbody>
</table>
<br> <br>
<p>Together, the choice of Policy A1 in Scenario A and Policy B2 in
Scenario B violate a requirement of von Neumann-Morgenstern rationality:
that an agent be indifferent between two lotteries with the same
probabilities and outcomes, even if they are presented differently. In
reality, an individual’s habits, environment, and cognitive biases can
cause different responses to framings of the same underlying lottery.
These are ways individuals can deviate from expected utility theory.</p>
<h2 id="prospect-theory">B.5.3 Prospect Theory</h2>
<p><strong>Prospect theory is a prominent non-expected utility model of
human behavior.</strong> Prospect theory seeks to accurately describe
how people make choices in risky situations by providing a psychological
model of decision making. The model’s features are designed to more
accurately describe human behavior in situations when facing risk,
rather than provide a description of how humans “should” behave, like in
the vNM utility model. Thus, prospect theory explains common behavioral
patterns that are considered irrational in expected utility theory <span
class="citation" data-cites="kahneman2013prospect">[19]</span>.</p>
<p><strong>Prospect theory is a multi-stage decision-making
model.</strong> Prospect theory views decision making over two stages:
editing and evaluation. In the editing stage, outcomes are re-framed as
either gains or losses instead of just final wealth. This is important
because people can perceive the same outcome differently based on how
the problem is presented and their own biases. This stage also accounts
for framing effects. In the evaluation stage, gains and losses are
multiplied by weighted probabilities to determine the preferred outcome.
This stage takes into account two factors: a <em>value function</em>,
how people value different outcomes, and a <em>weighting function</em>,
how people weigh the probabilities of each outcome. Next, we will
explore how these two functions help to explain how people make
decisions in uncertain situations.</p>

<figure id="fig:company-life">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/prospecttheory.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.7: Prospect theory’s value function is steeper in losses than in gains.</p>
</figure>

<p><strong>Prospect theory’s value function approximates how humans
think about wealth.</strong> A value function assigns value to an
outcome, much like a Bernoulli utility function assigns utility to an
outcome. In prospect theory, the value function has a few key
characteristics. The input is defined in gains and losses, modeling the
idea that people are sensitive to changes in wealth, not only to their
level of wealth. The curve is sigmoid to reflect that people are risk
seeking towards losses and risk averse towards gains. That is, people
like a chance at avoiding losses but dislike a chance of losing gains.
The curve is steeper in losses since people are modeled more sensitive
to a loss compared to a gain of equal amount.<p>
</p>
<p><strong>Prospect theory’s decision weights describe how humans think
about probabilities.</strong> A decision weight is the scaling factor
applied to an outcome that quantifies how much that outcome contributes
to the decision. In expected utility theory, decision weights are just
probabilities of outcomes. Instead of assuming people accurately assess
the likelihood of outcomes, however, prospect theory accounts for the
way humans actually process probabilities. People often overestimate the
risk of unlikely events with extreme consequences. Humans significantly
overestimate the risk of dying in a shark attack, possibly because of
the graphic nature of the attacks and their over-representation in pop
media, when in reality the true probability of death by shark attack is
quite low. Humans also place a greater weight on relative certainty:
people are usually willing to pay much more to improve their odds from
0<span class="math inline">%</span> to 1<span
class="math inline">%</span> than from 1<span
class="math inline">%</span> to 2<span class="math inline">%</span>,
since certainty of failure is removed. Prospect theory’s proposed
weighting for probabilities is shown in the next figure. This curve
illustrates how people tend to persistently overestimate relatively
small probabilities while persistently underestimating relatively large
probabilities.<p>
</p>

<figure id="fig:human-value">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/humanvalue.png" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure B.8:  Humans often perceive probabilities incorrectly, overestimating the likelihood of low-probability events and underestimating the likelihood of high-probability events.</p>
</figure>


<h2 id="models-of-decision-making">B.5.4 Models of Decision Making</h2>
<p><strong>Generalized decision-making models.</strong> We can put the
above together to describe many different types of decision-making
models more than expected utility theory. Recall that expected utility
is of the form: <span
class="math display"><em>U</em>(<em>L</em>) = <em>p</em><sub>1</sub> ⋅ <em>u</em>(<em>o</em><sub>1</sub>) + <em>p</em><sub>2</sub> ⋅ <em>u</em>(<em>o</em><sub>2</sub>) + ⋯ + <em>p</em><sub><em>n</em></sub> ⋅ <em>u</em>(<em>o</em><sub><em>n</em></sub>).</span>
for a lottery <span class="math inline"><em>L</em></span> with outcomes
<span class="math inline"><em>o</em><sub><em>i</em></sub></span> and
associated probabilities <span class="math inline"><em>p</em></span>,
over a utility function <span class="math inline"><em>u</em></span>. We
can incorporate value functions and weighting functions to construct a
theory of decision making that is more general than expected utility
theory while following the same structure of adding together the
products of decision weights and values: <span
class="math display"><em>V</em>(<em>L</em>) = <em>w</em>(<em>p</em><sub>1</sub>) ⋅ <em>v</em>(<em>o</em><sub>1</sub>) + <em>w</em>(<em>p</em><sub>2</sub>) ⋅ <em>v</em>(<em>o</em><sub>2</sub>) + ⋯ + <em>w</em>(<em>p</em><sub><em>n</em></sub>) ⋅ <em>v</em>(<em>o</em><sub><em>n</em></sub>).</span>
Here, <span class="math inline"><em>V</em></span> is the value assigned
to the lottery. We have also added <span
class="math inline"><em>w</em></span><em><sub>i</sub></em>, a weighting
function that transforms each probability into a corresponding decision
weight, and <span
class="math inline"><em>v</em></span><em><sub>i</sub></em>, a value
function which—much like a utility function—values each outcome.</p>
<p><strong>Prospect theory, formalized.</strong> Prospect theory uses
this structure while using a slightly different value function which
evaluates gains and losses in wealth, rather than total wealth. The
S-shaped prospect theory value function is represented as <span
class="math inline"><em>v</em><sub><em>p</em></sub>(<em>o</em><sub><em>i</em></sub>−<em>o</em><sub>0</sub>)</span>,
with <span class="math inline"><em>o</em><sub><em>i</em></sub></span>
representing the outcome being evaluated, and <span
class="math inline"><em>o</em><sub>0</sub></span> representing the
agent’s initial state: <span
class="math display"><em>V</em><sub><em>p</em></sub> = <em>w</em><sub><em>p</em></sub>(<em>p</em><sub>1</sub>) ⋅ <em>v</em><sub><em>p</em></sub>(<em>o</em><sub>1</sub>−<em>o</em><sub>0</sub>) + ⋯ + <em>w</em><sub><em>p</em></sub>(<em>p</em><sub><em>n</em></sub>) ⋅ <em>v</em><sub><em>p</em></sub>(<em>o</em><sub><em>n</em></sub>−<em>o</em><sub>0</sub>).</span>
In monetary lotteries, this would contain <span
class="math inline"><em>v</em><sub><em>p</em></sub>(<em>w</em><sub><em>i</em></sub>−<em>w</em><sub>0</sub>)</span>,
which tells us that the value function considers the loss or gain in
wealth in each possible outcome. Prospect theory incorporates the
specific weighting and value functions determined by Kahneman and
Tversky’s research on human behavior. We can modify the model by
substituting each function with functions of our own choosing.</p>
<p><strong>Summary.</strong> In this section, we considered the Allais
Paradox, a thought experiment on fairness, and two instances where the
von Neumann-Morgenstern axioms fail to capture human preferences. We
also examined a study conducted by Kahneman and Tversky on framing
effects, where humans behave in a clearly irrational manner. We then
examined the value function and the weighting function, which are the
two main innovations that comprise prospect theory and other
non-expected utility theories that seek to capture insights into human
behavior where conventional theory fails.<p>
Understanding the limitations of expected utility theory, and having
more descriptive models of human behavior helps us understand how
humans, human-led organizations, and AIs imitating humans will behave.
In this chapter, and in the Single Agent Safety chapter, we present reasons that
expected-utility maximizers can be dangerous. Non-expected utility
theories provide some concepts to consider when designing agents that
are not expected-utility maximizers.</p>

<h1 id="conclusion">B.6 Conclusion</h1>
<p>In this chapter, we studied the properties of utility functions and
how agents use utility functions to make decisions. Utility functions
have been part of a significant paradigm within decision theory in
economics, psychology, and other fields, and are increasingly relevant
to understanding and designing artificial intelligence. Artificial
agents in many cases are expressly designed to optimize objects (such as
reward functions) that strongly shape their utility functions.<p>
We outlined the properties of Bernoulli utility functions, which allow
us to express preferences over goods and situations with precise
numbers, and von-Neumann-Morgenstern utility functions, which extend
utility functions over probabilistic situations. From von
Neumann-Morgenstern utility functions, we derive the idea of expected
utility theory: the idea that rational agents do and should make choices
that maximize the expectation of their utility function. This
simple-sounding idea helps us understand decision making, but also often
fails to perfectly describe human behavior.<p>
We applied utility functions to the problem of AI corrigibility—whether
AI systems are receptive to corrections. AIs with complete and
transitive preferences will establish preferences about ceasing to
pursue their current objective, and consequently may attempt to thwart
corrective measures. Non-corrigible AI systems are a significant
concern, since they create difficulties in making them safe.<p>
We worked through examples of when it may be advisable to behave in
risk-averse, risk-neutral, and risk-seeking manners, which correspond to
concave, linear, and convex utility functions respectively. Risk
aversion is a natural instinct for animals and humans, and helps
maximize median value in the long run. Risk neutrality maximizes
expected value, but faces risk of ruin. Risk-seeking behavior is often
applied in situations where an agent has little to lose and a lot to
gain. People and organizations adopt different risk attitudes depending
on the context and situation of the decision.<p>
However, expected utility theory is a flawed theory—human behavior that
we consider to be reasonable often violates the strict rationality
outlined by the von Neumann-Morgenstern axioms. Paradigms outside
expected utility theories, such as prospect theory, attempt to more
accurately describe human decision-making processes by incorporating
additional functions that describe how humans think about wealth and
subjectively weigh perceived probabilities.<p>
An essential concern in designing artificial agents is that they must
reflect human values. The broader study of utility functions, and how
humans and other agents do and should make decisions, is essential
context for ensuring that artificial agents avoid catastrophic risks and
behave in accordance with human values.<p>
</p>

<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-schoemaker1982expected" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] P.
Schoemaker, <span>“The expected utility model: Its variants, purposes,
evidence and limitations,”</span> <em>Journal of Economic
Literature</em>, vol. 20, pp. 529–63, Feb. 1982.</div>
</div>
<div id="ref-silver2021reward" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] D.
Silver, S. Singh, D. Precup, and R. S. Sutton, <span>“Reward is
enough,”</span> <em>Artificial Intelligence</em>, vol. 299, p. 103535,
2021, doi: <a
href="https://doi.org/10.1016/j.artint.2021.103535">https://doi.org/10.1016/j.artint.2021.103535</a>.</div>

<div id="ref-vonneumann1947theory" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] J.
V. Neumann and O. Morgenstern, <em>Theory of games and economic
behavior</em>. Princeton University Press, 1947.</div>
</div>
<div id="ref-peterson2019paradox" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] M.
Peterson, <span>“<span>The St. Petersburg Paradox</span>,”</span> in
<em>The <span>Stanford</span> encyclopedia of philosophy</em>,
<span>F</span>all 2023., E. N. Zalta and U. Nodelman, Eds., <a
href="https://plato.stanford.edu/archives/fall2023/entries/paradox-stpetersburg/"
class="uri">https://plato.stanford.edu/archives/fall2023/entries/paradox-stpetersburg/</a>;
Metaphysics Research Lab, Stanford University, 2023.</div>

<div id="ref-pacecorrigibility" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] Nate Soares et al., “Corrigibility.”</span> Machine Intelligence Research Institute.
Available: <a
href="https://intelligence.org/files/Corrigibility.pdf">Intelligence Corrigibility</a></div>
</div>
<div id="ref-thornley2023shutdown" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] E.
Thornley, <span>“The shutdown problem: Two theorems, incomplete
preferences as a solution,”</span> <em>AI Alignment Awards</em>,
2023.</div>

<div id="ref-dixitslides" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] A.
Dixit, <span>“Slides and notes.”</span> Available: <a
href="https://www.princeton.edu/~dixitak/Teaching/EconomicsOfUncertainty/Slides&amp;Notes/ ‌">https://www.princeton.edu/~dixitak/Teaching/EconomicsOfUncertainty/Slides&amp;Notes/
‌</a></div>
</div>
<div id="ref-sidgwick2019methods" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] H.
Sidgwick, <em>The methods of ethics</em>. Good Press, 2019.</div>
</div>
<div id="ref-wunderle1987risk" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] J.
Wunderle Jr, M. Castro, and N. Fetcher, <span>“Risk-averse foraging by
bananaquits on negative energy budgets,”</span> <em>Behavioral Ecology
and Sociobiology</em>, vol. 21, pp. 249–255, Jan. 1987, doi: <a
href="https://doi.org/10.1007/BF00292506">10.1007/BF00292506</a>.</div>
</div>
<div id="ref-okasha2007rational" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] S.
Okasha, <span>“Rational choice, risk aversion, and evolution,”</span>
<em>The Journal of philosophy</em>, vol. 104, no. 5, pp. 217–235,
2007.</div>
</div>
<div id="ref-kelly1956new" class="csl-entry" role="listitem">
<div class="csl-left-margin">[11] J.
L. Kelly, <span>“A new interpretation of information rate,”</span>
<em>the bell system technical journal</em>, vol. 35, no. 4, pp. 917–926,
1956.</div>
</div>
<div id="ref-mcdonald2015went" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] R.
McDonald and A. Paulson, <span>“What went wrong at AIG,”</span>
<em>Kellogg Insight</em>, vol. 3, 2015.</div>
</div>
<div id="ref-taleb2012antifragile" class="csl-entry" role="listitem">
<div class="csl-left-margin">[13] N.
N. Taleb, <em>Antifragile: Things that gain from disorder</em>. in
Incerto. Random House Publishing Group, 2012. Available: <a
href="https://books.google.com.au/books?id=5fqbz_qGi0AC">https://books.google.com.au/books?id=5fqbz_qGi0AC</a></div>

<div id="ref-taylor2016quantilizers" class="csl-entry" role="listitem">
<div class="csl-left-margin">[14] J.
Taylor, <span>“Quantilizers: A safer alternative to maximizers for
limited optimization.”</span> in <em>AAAI workshop: AI, ethics, and
society</em>, 2016.</div>
</div>
<div id="ref-allais1953comportement" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] M.
Allais, <span>“Le comportement de l’homme rationnel devant le risque:
Critique des postulats et axiomes de l’ecole americaine,”</span>
<em>Econometrica</em>, vol. 21, no. 4, pp. 503–546, 1953, Accessed: Oct.
16, 2023. [Online]. Available: <a
href="http://www.jstor.org/stable/1907921">http://www.jstor.org/stable/1907921</a></div>
</div>
<div id="ref-kahneman2011thinking" class="csl-entry" role="listitem">
<div class="csl-left-margin">[16] D.
Kahneman, <em>Thinking, fast and slow</em>. Farrar, Straus; Giroux,
2011.</div>
</div>
<div id="ref-mccarthy2016probability" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] D.
McCarthy, <span>“Probability in ethics,”</span> in <em>The oxford
handbook of philosophy and probability</em>, A. Hajek and C. Hitchcock,
Eds., Oxford University Press, 2016, pp. 705–737.</div>
</div>
<div id="ref-tversky1981framing" class="csl-entry" role="listitem">
<div class="csl-left-margin">[18] A.
Tversky and D. Kahneman, <span>“The framing of decisions and the
psychology of choice,”</span> <em>science</em>, vol. 211, no. 4481, pp.
453–458, 1981.</div>
</div>
<div id="ref-kahneman2013prospect" class="csl-entry" role="listitem">
<div class="csl-left-margin">[19] D.
Kahneman and A. Tversky, <span>“Prospect theory: An analysis of decision
under risk,”</span> in <em>Handbook of the fundamentals of financial
decision making: Part i</em>, World Scientific, 2013, pp. 99–127.</div>
</div>
</div>