Monday, 11 February 2019

Coherent behaviour in the real world is an incoherent concept

(Edit: I'm no longer confident that the two definitions I used below are useful. I still stand by the broad thrust of this post, but am in the process of rethinking the details).
Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function. In this post I dig deeper into this disagreement, concluding that Rohin is broadly correct, although the issue is more complex than he makes it out to be. Here’s Eliezer’s summary of his original argument:
Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors. Coherence violations so easily computed as to be humanly predictable should be eliminated by optimization strong enough and general enough to reliably eliminate behaviors that are qualitatively dominated by cheaply computable alternatives. From our perspective this should produce agents such that, ceteris paribus, we do not think we can predict, in advance, any coherence violation in their behavior.
First we need to clarify what Eliezer means by coherence. He notes that there are many formulations of coherence constraints: restrictions on preferences which imply that an agent which obeys them is maximising the expectation of some utility function. I’ll take the standard axioms of VNM utility as one representative set of constraints. In this framework, we consider a set O of disjoint outcomes. A lottery is some assignment of probabilities to the elements of O such that they sum to 1. For any pair of lotteries, an agent can either prefer one to the other, or to be indifferent between them; let P be the function (from pairs of lotteries to a choice between them) defined by these preferences. The agent is incoherent if P violates any of the following axioms: completeness, transitivity, continuity, and independence. Eliezer gives several examples of how an agent which violates these axioms can be money-pumped, which is an example of the “destructive or dominated” behaviour he mentions in the quote above. And any agent which doesn’t violate these axioms has behaviour which corresponds to maximising the expectation of some utility function over O (a function mapping the outcomes in O to real numbers).

It’s crucial to note that, in this setup, coherence is a property of an agent’s preferences at a single point in time. The outcomes that we are considering are all mutually exclusive, so an agent’s preferences over other outcomes are irrelevant after one outcome has already occurred. In addition, preferences are not observed but rather hypothetical: since outcomes are disjoint, we can’t actually observe the agent choosing a lottery and receiving a corresponding outcome (more than once).¹ But Eliezer’s argument above makes use of a concept of coherence which differs in two ways: it is a property of the observed behaviour of agents over time. VNM coherence is not well-defined in this setup, so if we want to formulate a rigorous version of this argument, we’ll need to specify a new definition of coherence which extends the standard instantaneous-hypothetical one. Here are two possible ways of doing so:
  • Definition 1: Let O be the set of all possible “snapshots” of the state of the universe at a single instant (which I shall call world-states). At each point in time when an agent chooses between different actions, that can be interpreted as a choice between lotteries over states in O. Its behaviour is coherent iff the set of all preferences revealed by those choices is consistent with some coherent preference function P over all pairs of lotteries over O AND there is a corresponding utility function which assigns values to each state that are consistent with the relevant Bellman equations. In other words, an agent’s observed behaviour is coherent iff there’s some utility function such that the utility of each state is some fixed value assigned to that state + the expected value of the best course of action starting from that state, and the agent has always chosen the action with the highest expected utility.²
  • Definition 2: Let O be the set of all possible ways that the entire universe could play out from beginning to end (which I shall call world-trajectories). Again, at each point in time when an agent chooses between different actions, that can be interpreted as a choice between lotteries over O. However, in this case no set of observed choices can ever be “incoherent” - because, as Rohin notes, there is always a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made.
To be clear on the difference between them, under definition 1 an outcome is a world-state, one of which occurs every timestep, and a coherent agent makes every choice without reference to any past events (except insofar as they provide information about its current state or future states). Whereas under definition 2 an outcome is an entire world-trajectory (composed of a sequence of world-states), only one of which ever occurs, and a coherent agent’s future actions may depend on what happened in the past in arbitrary ways. To see how this difference plays out in practice, consider the following example of non-transitive travel preferences: an agent which pays $50 to go from San Francisco to San Jose, then $50 to go from San Jose to Berkeley, then $50 to go from Berkeley to San Francisco (note that the money in this example is just a placeholder for anything the agent values). Under 2, this isn’t evidence that the agent is incoherent, but rather just an indication that it assigns more utility to world-trajectories in which it travels round in a circle than to other available world-trajectories. Since Eliezer uses this situation as an example of incoherence, he clearly doesn’t intend to interpret behaviour as a choice between lotteries over world-trajectories. So let’s examine definition 1 in more detail. But first note that there is no coherence theorem which says that an agent’s utility function needs to be defined over world-states instead of world-trajectories, and so it’ll take additional arguments to demonstrate that sufficiently optimised agents will care about the former instead of the latter. I’m not aware of any particularly compelling arguments for this conclusion - indeed, as I’ll explain later, I think it’s more plausible to model humans as caring about the latter.

Okay, so what about definition 1? This is a more standard interpretation of having preferences over time: requiring choices under uncertainty to move between different states makes this setup very similar to POMDPs, which are often used in reinforcement learning. It would be natural to now interpret the non-transitive travel example as follows: let F, J and B be the states of being in San Francisco, San Jose and Berkeley respectively. Then paying to go from F to J to B to F demonstrates incoherent preferences over states (assuming there’s also an option to just stay put in any of those states).

First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time. In fact, there are plenty of cases where you might choose to change your utility function (or have that change thrust upon you). I like Nate Soares’ example of wanting to become a rockstar; other possibilities include being blackmailed to change it, or sustaining brain damage. However, it seems unlikely that a sufficiently intelligent AGI will face these particular issues - and in fact the more capable it is of implementing its utility function, the more valuable it will consider the preservation of that utility function.³ So I’m willing to accept that, past a certain high level of intelligence, changes significant enough to affect what utility function a human would infer from that AGI’s behaviour seem unlikely.

Here’s a more important problem, though: we’ve now ruled out some preferences which seem to be reasonable and natural ones. For example, suppose you want to write a book which is so timeless that at least one person reads it every year for the next thousand years. There is no single point at which the state of the world contains enough information to determine whether you’ve succeeded or failed in this goal: in any given year there may be no remaining record of whether somebody read it in a previous year (or the records could have been falsified, etc). This goal is fundamentally a preference over world-trajectories.⁴ In correspondence, Rohin gave me another example: a person whose goal is to play a great song in its entirety, and who isn’t satisfied with the prospect of playing the final note while falsely believing that they’ve already played the rest of the piece.⁵ More generally, I think that virtue-ethicists and deontologists are more accurately described as caring about world-trajectories than world-states - and almost all humans use these theories to some extent when choosing their actions. Meanwhile Eric Drexler’s CAIS framework relies on services which are bounded in time taken and resources used - another constraint which can’t be expressed just in terms of individual world-states.

There’s a third issue with this framing: in examples like non-transitive travel, we never actually end up in quite the same state we started in. Perhaps we’ve gotten sunburned along the journey. Perhaps we spent a few minutes editing our next blog post. At the very least, we’re now slightly older, and we have new memories, and the sun’s position has changed a little. So really we’ve ended up in state F’, which differs in many ways from F. You can presumably see where I’m going with this: just like with definition 2, no series of choices can ever demonstrate incoherent revealed preferences in the sense of definition 1, since every choice actually made is between a different set of possible world-state outcomes. (At the very least, they differ in the agent’s memories of which path it took to get there.⁶ And note that outcomes which are identical except for slight differences in memories should sometimes be treated in very different ways, since having even a few bits of additional information from exploration can be incredibly advantageous.)

Now, this isn’t so relevant in the human context because we usually abstract away from the small details. For example, if I offer to sell you an ice-cream and you refuse it, and then I offer it again a second later and you accept, I’d take that as evidence that your preferences are incoherent - even though technically the two offers are different because accepting the first just leads you to a state where you have an ice-cream, while accepting the second leads you to a state where you both have an ice-cream and remember refusing the first offer. Similarly, I expect that you don’t consider two outcomes to be different if they only differ in the precise pattern of TV static or the exact timing of leaves rustling. But again, there are no coherence constraints saying that an agent can’t consider such factors to be immensely significant, enough to totally change their preferences over lotteries when you substitute in one such outcome for the other.

So for the claim that sufficiently optimised agents appear coherent to be non-trivially true under my first definition of coherence, we’d need to clarify that such coherence is only with respect to outcomes when they’re categorised according to the features which humans consider important, except for the ones which are intrinsically temporally extended. But then the standard arguments from coherence constraints no longer apply. At this point I think it’s better to abandon the whole idea of formal coherence as a predictor of real-world behaviour, and replace it with Rohin’s notion of “goal-directedness”, which is more upfront about being inherently subjective, and doesn’t rule out any of the goals that humans actually have.


Thanks to Tim Genewein, Ramana Kumar, Victoria Krakovna and Rohin Shah for discussions which led to this post, and helpful comments.

[1] Disjointedness of outcomes makes this argument more succinct, but it’s not actually a necessary component, because once you’ve received one outcome, your preferences over all other outcomes are allowed to change. For example, having won $1000000, the value you place on other financial prizes will very likely go down. This is related to my later argument that you never actually have multiple paths to ending up in the “same” state.

[2] Technical note: I’m assuming an infinite time horizon and no discounting, because removing either of those conditions leads to weird behaviour which I don’t want to dig into in this post. In theory this leaves open the possibility of states with infinite expected utility, as well as lotteries over infinitely many different states, but I think we can just stipulate that neither of those possibilities arises without changing the core idea behind my argument. The underlying assumption here is something like: whether we model the universe as finite or infinite shouldn’t significantly affect whether we expect AI behaviour to be coherent over the next few centuries, for any useful definition of coherent.

[3] Consider the two limiting cases: if I have no power to implement my utility function, then it doesn’t make any difference what it changes to. By comparison, if I am able to perfectly manipulate the world to fulfil my utility function, then there is no possible change in it which will lead to better outcomes, and many which will lead to worse (from the perspective of my current utility function).

[4] At this point you could object on a technicality: from the unitarity of quantum mechanics, it seems as if the laws of physics are in fact reversible, and so the current state of the universe (or multiverse, rather) actually does contain all the information you theoretically need to deduce whether or not any previous goal has been satisfied. But I’m limiting this claim to macroscopic-level phenomena, for two reasons. Firstly, I don’t think our expectations about the behaviour of advanced AI should depend on very low-level features of physics in this way; and secondly, if the objection holds, then preferences over world-states have all the same problems as preferences over world-trajectories.

[5] In a POMDP, we don’t usually include an agent’s memories (i.e. a subset of previous observations) as part of the current state. However, it seems to me that in the context of discussing coherence arguments it’s necessary to do so, because otherwise going from a known good state to a known bad state and back in order to gain information is an example of incoherence. So we could also formulate this setup as a belief MDP. But I prefer talking about it as a POMDP, since that makes the agent seem less Cartesian - for example, it makes more sense to ask what happens after the agent “dies” in a POMDP than a belief MDP.

[6] Perhaps you can construct a counterexample involving memory loss, but this doesn’t change the overall point, and if you’re concerned with such technicalities you’ll also have to deal with the problems I laid out in footnote 4.

Friday, 8 February 2019

Arguments for moral indefinability

Epistemic status: I endorse the core intuitions behind this post, but am only moderately confident in the specific claims made. Also, while I do have a degree in philosophy, I am not a professional ethicist, and I’d appreciate feedback on how these ideas relate to existing literature.

Moral indefinability is the term I use for the idea that there is no ethical theory which provides acceptable solutions to all moral dilemmas, and which also has the theoretical virtues (such as simplicity, precision and non-arbitrariness) that we currently desire. I think this is an important and true perspective on ethics, and in this post will explain why I hold it, with the caveat that I'm focusing more on airing these ideas than constructing a watertight argument.

Here’s another way of explaining moral indefinability: let’s think of ethical theories as procedures which, in response to a moral claim, either endorse it, reject it, or do neither. Moral philosophy is an attempt to find the theory whose answers best match our intuitions about what answers ethical theories should give us (e.g. don’t cause unnecessary suffering), and whose procedure for generating answers best matches our meta-level intuitions about what ethical theories should look like (e.g. they should consistently apply impartial principles rather than using ad-hoc, selfish or random criteria). None of these desiderata are fixed in stone, though - in particular, we sometimes change our intuitions when it’s clear that the only theories which match those intuitions violate our meta-level intuitions. My claim is that eventually we will also need to change our meta-level intuitions in important ways, because it will become clear that the only theories which match them violate key object-level intuitions. In particular, this might lead us to accept theories which occasionally evince properties such as:
  • Incompleteness: for some claim A, the theory neither endorses nor rejects either A or ~A, even though we believe that the choice between A and ~A is morally important.
  • Vagueness: the theory endorses an imprecise claim A, but rejects every way of making it precise.
  • Contradiction: the theory endorses both A and ~A (note that this is a somewhat provocative way of framing this property, since we can always add arbitrary ad-hoc exceptions to remove the contradictions. So perhaps a better term is arbitrariness of scope: when we have both a strong argument for A and a strong argument for ~A, the theory can specify in which situations each conclusion should apply, based on criteria which we would consider arbitrary and unprincipled. Example: when there are fewer than N lives at stake, use one set of principles; otherwise use a different set).

Why take moral indefinability seriously? The main reason is that ethics evolved to help us coordinate in our ancestral environment, and did so not by giving us a complete decision procedure to implement, but rather by ingraining intuitive responses to certain types of events and situations. There were many different and sometimes contradictory selection pressures driving the formation of these intuitions - and so, when we construct generalisable principles based on our intuitions, we shouldn't expect those principles to automatically give useful or even consistent answers to very novel problems. Unfortunately, the moral dilemmas which we grapple with today have in fact "scaled up" drastically in at least two ways. Some are much greater in scope than any problems humans have dealt with until very recently. And some feature much more extreme tradeoffs than ever come up in our normal lives, e.g. because they have been constructed as thought experiments to probe the edges of our principles.

Of course, we're able to adjust our principles so that we are more satisfied with their performance on novel moral dilemmas. But I claim that in some cases this comes at the cost of those principles conflicting with the intuitions which make sense on the scales of our normal lives. And even when it's possible to avoid that, there may be many ways to make such adjustments whose relative merits are so divorced from our standard moral intuitions that we have no good reason to favour one over the other. I'll give some examples shortly.

A second reason to believe in moral indefinability is the fact that human concepts tend to be open texture: there is often no unique "correct" way to rigorously define them. For example, we all know roughly what a table is, but it doesn’t seem like there’s an objective definition which gives us a sharp cutoff between tables and desks and benches and a chair that you eat off and a big flat rock on stilts. A less trivial example is our inability to rigorously define what entities qualify as being "alive": edge cases include viruses, fires, AIs and embryos. So when moral intuitions are based on these sorts of concepts, trying to come up with an exact definition is probably futile. This is particularly true when it comes to very complicated systems in which tiny details matter a lot to us - like human brains and minds. It seems implausible that we’ll ever discover precise criteria for when someone is experiencing contentment, or boredom, or many of the other experiences that we find morally significant.

I would guess that many anti-realists are sympathetic to the arguments I’ve made above, but still believe that we can make morality precise without changing our meta-level intuitions much - for example, by grounding our ethical beliefs in what idealised versions of ourselves would agree with, after long reflection. My main objection to this view is, broadly speaking, that there is no canonical “idealised version” of a person, and different interpretations of that term could lead to a very wide range of ethical beliefs. I explore this objection in much more detail in this post. (In fact, the more general idea that humans aren’t really “utility maximisers”, even approximately, is another good argument for moral indefinability.) And even if idealised reflection is a coherent concept, it simply passes the buck to your idealised self, who might then believe my arguments and decide to change their meta-level intuitions.

So what are some pairs of moral intuitions which might not be simultaneously satisfiable under our current meta-level intuitions? Here’s a non-exhaustive list - the general pattern being clashes between small-scale perspectives, large-scale perspectives, and the meta-level intuition that they should be determined by the same principles:
  • Person-affecting views versus non-person-affecting views. Small-scale views: killing children is terrible, but not having children is fine, even when those two options lead to roughly the same outcome. Large-scale view: extinction is terrible, regardless of whether it comes about from people dying or people not being born.
  • The mere addition paradox, aka the repugnant conclusion. Small-scale views: adding happy people and making people more equal can't make things worse. Large-scale view: a world consisting only of people whose lives are barely worth living is deeply suboptimal. (Note also Arrhenius' impossibility theorems, which show that you can't avoid the repugnant conclusion without making even greater concessions).
  • Weighing theories under moral uncertainty. I personally find OpenPhil's work on cause prioritisation under moral uncertainty very cool, and the fundamental intuitions behind it seem reasonable, but some of it (e.g. variance normalisation) has reached a level of abstraction where I feel almost no moral force from their arguments, and aside from an instinct towards definability I'm not sure why I should care.
  • Infinite and relativistic ethics. Same as above. See also this LessWrong post arguing against applying the “linear utility hypothesis” at vast scales.
  • Whether we should force future generations to have our values. On one hand, we should be very glad that past generations couldn't do this. But on the other, the future will probably disgust us, like our present would disgust our ancestors. And along with "moral progress" there'll also be value drift in arbitrary ways - in fact, I don't think there's any clear distinction between the two.

I suspect that many readers share my sense that it'll be very difficult to resolve all of the dilemmas above in a satisfactory way, but also have a meta-level intuition that they need to be resolved somehow, because it's important for moral theories to be definable. But perhaps at some point it's this very urge towards definability which will turn out to be the weakest link. I do take seriously Parfit's idea that secular ethics is still young, and there's much progress yet to be made, but I don't see any principled reason why we should be able to complete ethics, except by raising future generations without whichever moral intuitions are standing in the way of its completion (and isn't that a horrifying thought?). From an anti-realist perspective, I claim that perpetual indefinability would be better. That may be a little more difficult to swallow from a realist perspective, of course. My guess is that the core disagreement is whether moral claims are more like facts, or more like preferences or tastes - if the latter, moral indefinability would be analogous to the claim that there’s no (principled, simple, etc) theory which specifies exactly which foods I enjoy.

There are two more plausible candidates for moral indefinability which were the original inspiration for this post, and which I think are some of the most important examples:
  • Whether to define welfare in terms of preference satisfaction or hedonic states.
  • The problem of "maximisation" in utilitarianism.
I've been torn for some time over the first question, slowly shifting towards hedonic utilitarianism as problems with formalising preferences piled up. While this isn't the right place to enumerate those problems (see here for a previous relevant post), I've now become persuaded that any precise definition of which preferences it is morally good to satisfy will lead to conclusions which I find unacceptable. After making this update, I can either reject a preference-based account of welfare entirely (in favour of a hedonic account), or else endorse a "vague" version of it which I think will never be specified precisely.

The former may seem the obvious choice, until we take into account the problem of maximisation. Consider that a true (non-person-affecting) hedonic utilitarian would kill everyone who wasn't maximally happy if they could replace them with people who were (see here for a comprehensive discussion of this argument). And that for any precise definition of welfare, they would search for edge cases where they could push it to extreme values. In fact, reasoning about a "true utilitarian" feels remarkably like reasoning about an unsafe AGI. I don't think that's a coincidence: psychologically, humans just aren't built to be maximisers, and so a true maximiser would be fundamentally adversarial. And yet many of us also have strong intuitions that there are some good things, and it's always better for there to be more good things, and it’s best if there are most good things.

How to reconcile these problems? My answer is that utilitarianism is pointing in the right direction, which is “lots of good things”, and in general we can move in that direction without moving maximally in that direction. What are those good things? I use a vague conception of welfare that balances preferences and hedonic experiences and some of my own parochial criteria - importantly, without feeling like it's necessary to find a perfect solution (although of course there will be ways in which my current position can be improved). In general, I think that we can often do well enough without solving fundamental moral issues - see, for example, this LessWrong post arguing that we’re unlikely to ever face the true repugnant dilemma, because of empirical facts about psychology.

To be clear, this still means that almost everyone should focus much more on utilitarian ideas, like the enormous value of the far future, because in order to reject those ideas it seems like we’d need to sacrifice important object- or meta-level moral intuitions to a much greater extent than I advocate above. We simply shouldn’t rely on the idea that such value is precisely definable, nor that we can ever identify an ethical theory which meets all the criteria we care about.