The two-envelope paradox and the limitations of Bayesian inference

January 8, 2017

In this post, I use the two envelope paradox as a tool to explore high level properties of human probabilistic inference.


I was getting drinks with some friends of mine in the AI department the other day when the infamous two-envelope paradox came up. Most people were stumped by it, and I got to thinking about how the inherent difficulty of the problem for humans presents a good opportunity to analyze properties of human probabilistic inference, much as optical illusions allow us to investigate properties of human vision [1]. But before we explore what the problem tells us about human probabilistic inference, let's see the problem itself:

The Problem

You're in middle school and your neighbor asks you to come by and help him mow his lawn. When you show up, he's not there, and you realize you'd forgotten to negotiate a price up front. Nonetheless, you decide to get started before he comes back rather than sit idly, and he arrives just about the time you finish. When you ask for payment, he decides to get clever. He sets out two indistinguishable white envelopes on a table, and tells you that you can pick one and take the cash in it. Having already done the work, you are in no position to negotiate, and walk towards the table to pick up one of the envelopes. But as you start walking to the table, your neighbor says, "I will tell you that one envelope has twice the amount of cash as the other. After you open the first one, you can switch if you want". You open the envelope on the left and find $10. But now you start eyeing the other envelope and imagine all the things you could do if it contained $20. But of course, you would be quite disappointed to leave with only $5. Assuming that you want to maximize your expected reward, the question is, should you switch?

The Paradox

Argument a: The argument for switching says that you had a 50% chance of picking up an envelope with the greater sum of money, and a 50% chance of picking up the lesser envelope. Thus, there is a 50% chance you will get $20 by switching, and a 50% chance you'll get $5. That means the expected value if you switch is $12.50. If you stay with your current envelope, you'll leave with a guaranteed $10 - so switching is a no-brainer, right?

Argument b: But something goes wrong if we assume that argument (a) is true. Imagine that instead, you had picked up the letter on the right first, and opened it to find X dollars. You note that argument (a) had no dependence on the exact value of X, so it would apply just as well and you would rationally choose to switch. But now things get suspicious - if you didn't even care what X was for the right envelope, maybe you could have decided to switch without opening it at all - and then you'd be right back where you started, opening the left envelope first, and switching to the right. If we're never using any information from the opened envelope, switching can't have higher expected value than than not switching, and we might as well stick with our first choice. Switching is an asymmetric strategy that can't always be the right answer due to the inherent symmetry of the problem.

So which is it. Does switching increase the expected value of your payday over sticking with your guns?

You'll appreciate what follows more if you try to answer this question yourself. I didn't figure out what was going on with this paradox for more than a month after I first heard it in undergrad, so don't worry if you're less than 100% confident in your answer.

Spoiler below

The above is actually a trick question. While "not switching" is in some sense a better answer, because it at least avoids the flawed argument for switching, the real answer is that the problem itself is not well posed - there is simply not enough information to make a choice.

To see why, let's add some information and see how the problem would change. Suppose you had prior knowledge that one envelope contained $5, and the other $10. After opening the left envelope (with $10 in it), you would know the other envelope contains $5 and choose not to switch. With this simple prior, the problem becomes easy. But there are many other prior distributions one could imagine as well. Prior knowledge could have told you that the lower envelope could has either $10 or $20 with equal probability, and the higher envelope $20 or $40. In this case, conditioned on observing that the first envelope opened had $20, you would switch (why?). If it had $40, you wouldn't switch.

Given either of these specific priors, we can see that the decision to switch is dependent on the observed value. When we opened the left envelope and found X dollars in the original problems, we would used this value to compute a posterior probability that we had opened the lesser or greater value envelope, p = Pr[e = lesser | value = X], (1 - p) = Pr[e = greater | value = X]. If we have opened the lesser envelope, switching will gain us X dollars. If we have opened the greater envelope, switching will lose us 0.5 * X dollars. Thus, we will switch whenever 2 * p \geq (1 - p) \Leftrightarrow p \geq 1/3 , noting that we can compute p using Bayes' law according to:

p = Pr[e = lesser | value = X] = \frac{Pr[value = X | e = lesser]}{Pr[value = X | e = lesser] + Pr[value = X | e = greater]}

Thus, given any prior distribution Pr[value = X | e = lesser] over the value of the lesser envelope, we can compute the distribution over the greater envelope using a change of variables [2]. With both distributions, given any observed value of the first envelope we open, we can decide whether or not to switch based on the posterior probability, p, that we have chosen the lesser envelope.

So the biggest problem with argument (a) appears to be that it uses the prior probabilities of having chosen the lesser or the greater envelope, rather than the posterior probabilities. But how is it that the brain, which is supposed to be so smart, can make such a simple mistake? The answer is that the brain does not make this mistake, but instead gets tripped up on something more subtle - it uses an common inference technique for dealing with unspecified priors, which happens to go horribly wrong for this adversarially designed problem.

The implicit uniform prior

The technique for missing information is that if one is told nothing about the distribution over the amount of money in the lesser envelope, one just assumes that it gives equal probability to all values it can take. No value is said to be more likely than any other, so this seems inherently reasonable, and taken alone is quite an innocuous assumption. But looking at the greater envelope distribution, we realize that we know nothing about it either, and make the same assumption - that it assigns all it's values equal probability. But this is where everything goes wrong - as we will now see, it is impossible to simultaneously induce the uniform prior over these two intricately related distributions.

To keep things simple, we'll assume our probability distributions are discrete, but a slight modification on the following analysis would also work for continuous distributions. From our uniform prior assumption, and the fact that both distributions can take the same cardinality of distinct values, we find that all observable values are equally likely in either the greater or lower distributions, i.e. c = Pr[value = X | e = lesser] = Pr[value = X | e = greater] for all values X occurring with nonzero probability. If some value X occurs in the lower distribution, we know from the problem description that 2X will occur in the upper distribution. But then because both distributions assign equal probability to all values, 2X must occur in the lower distribution as well. And herein lies the problem - the lower distribution will then contain an infinite sequence of discrete values [X, 2X, 4X, ...] all occurring with probability c > 0, and the total probability of this sequence cannot sum to 1. There exists no pair of distributions satisfying our seemingly innocuous assumptions [3], and the Bayesian inference used to compute p = \frac{c}{c + c} = 0.5 regardless of the value of X is unfounded.

So the brain fails because it applies the implicit uniform prior jointly over two related distributions at the same time. Under this model, it does the usual Bayesian inference, but the nonsense result occurs because the earlier technique is inappropriate in this probabilistic illusion - this scenario is specifically constructed to play games with the techniques for dealing with missing information that normally work quite well.

The big picture

Stepping back, what does this failure of inference tell us about probabilistic inference in the brain? First we can observe that were we actually the kid in the real world scenario, we wouldn't face such a paradox at all. We'd be able to incorporate all the other available evidence about the going rate for mowing a lawn or the skimpiness of our neighbor. We'd even be able to factor in the likelihood of the other envelope containing change if the first has an odd number of dollars. Maybe we wouldn't be able to theoretically justify our intuition to switch or stay, but the problem would be at least well posed enough for us to make an informed decision. And this is the first lesson that we can draw from this problem - that constructing good priors is somewhat of an art, and that whenever we apply Bayesian statistics to real world scenarios, we should expect the process of forming priors to get messy (the same applies to predicting the probability of life on other planets or figuring out if we're all living in one of Elon Musk's simulations).

Second, we can see that in spite of our base tendency to incorporate intuition from context into problems, we can actually take natural language instructions to formulate more specific notions of what a prior should be. When posed with the above problem, almost no one makes the argument to switch or stay based on the nuances of being a middle schooler mowing a lawn. We actually parse the instructions as telling us that there is a particular type of prior, and try to use that to solve the problem. Parsing priors that others specify through natural language is clearly a valuable skill, and we don't have anything quite like this yet in artificial intelligence (although, see similar work that has explore related issues). Other well known inference illusions in psychology have even exploited the same idea to create problems that trick the brain into falsely forming uniform priors over hypotheses that actually have logical constraints implied by their descriptions.

Finally, the fact that the brain can even be tricked here at all implies that it is not using traditional Bayesian statistics in the ordinary computer science sense of instantiating a prior, computing a posterior (even if approximately), and then producing an answer. The existence of more abstract or symbolic shortcuts to proper Bayesian inference means that many problems might be more efficiently approached by more direct discriminative methods. Just because Bayesian inference is correct doesn't mean it's computationally efficient enough to be useful. The most appropriate representation of data and computation for probabilistic inference in AI systems thus remains a largely open question. But the two-envelope paradox illuminates at least some of the challenges that we will face in constructing machines with the ability to solve higher level inference tasks.


[1] See Galileo's illusion, or the Checker shadow illusion, for example.
[2] Computing the distribution over the greater envelope given a distribution over the lesser envelope has a little nuance. Given a discrete distribution over the lesser envelope, we can simply say for every value X in the lesser distribution, we get a new value at 2*X in the greater distribution having the same probability. But if the lesser distribution is continuous probability density function e.g. a uniform distribution l(x) = 1 between $0 and $1, then we must compute the greater distribution according to the rules for change of variables in PDFs, i.e. g(x) = \frac{1}{2} l(x / 2), such that g(x) = \frac{1}{2} between $0 and $2.
[3] Although, surprisingly, there do exist proper distributions that motivate always switching. One can prove though that all such distributions have an infinite expected value.