Monday, 11 February 2019

Coherent behaviour in the real world is an incoherent concept

(Edit: I'm no longer confident that the two definitions I used below are useful. I still stand by the broad thrust of this post, but am in the process of rethinking the details).
Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function. In this post I dig deeper into this disagreement, concluding that Rohin is broadly correct, although the issue is more complex than he makes it out to be. Here’s Eliezer’s summary of his original argument:
Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors. Coherence violations so easily computed as to be humanly predictable should be eliminated by optimization strong enough and general enough to reliably eliminate behaviors that are qualitatively dominated by cheaply computable alternatives. From our perspective this should produce agents such that, ceteris paribus, we do not think we can predict, in advance, any coherence violation in their behavior.
First we need to clarify what Eliezer means by coherence. He notes that there are many formulations of coherence constraints: restrictions on preferences which imply that an agent which obeys them is maximising the expectation of some utility function. I’ll take the standard axioms of VNM utility as one representative set of constraints. In this framework, we consider a set O of disjoint outcomes. A lottery is some assignment of probabilities to the elements of O such that they sum to 1. For any pair of lotteries, an agent can either prefer one to the other, or to be indifferent between them; let P be the function (from pairs of lotteries to a choice between them) defined by these preferences. The agent is incoherent if P violates any of the following axioms: completeness, transitivity, continuity, and independence. Eliezer gives several examples of how an agent which violates these axioms can be money-pumped, which is an example of the “destructive or dominated” behaviour he mentions in the quote above. And any agent which doesn’t violate these axioms has behaviour which corresponds to maximising the expectation of some utility function over O (a function mapping the outcomes in O to real numbers).

It’s crucial to note that, in this setup, coherence is a property of an agent’s preferences at a single point in time. The outcomes that we are considering are all mutually exclusive, so an agent’s preferences over other outcomes are irrelevant after one outcome has already occurred. In addition, preferences are not observed but rather hypothetical: since outcomes are disjoint, we can’t actually observe the agent choosing a lottery and receiving a corresponding outcome (more than once).¹ But Eliezer’s argument above makes use of a concept of coherence which differs in two ways: it is a property of the observed behaviour of agents over time. VNM coherence is not well-defined in this setup, so if we want to formulate a rigorous version of this argument, we’ll need to specify a new definition of coherence which extends the standard instantaneous-hypothetical one. Here are two possible ways of doing so:
  • Definition 1: Let O be the set of all possible “snapshots” of the state of the universe at a single instant (which I shall call world-states). At each point in time when an agent chooses between different actions, that can be interpreted as a choice between lotteries over states in O. Its behaviour is coherent iff the set of all preferences revealed by those choices is consistent with some coherent preference function P over all pairs of lotteries over O AND there is a corresponding utility function which assigns values to each state that are consistent with the relevant Bellman equations. In other words, an agent’s observed behaviour is coherent iff there’s some utility function such that the utility of each state is some fixed value assigned to that state + the expected value of the best course of action starting from that state, and the agent has always chosen the action with the highest expected utility.²
  • Definition 2: Let O be the set of all possible ways that the entire universe could play out from beginning to end (which I shall call world-trajectories). Again, at each point in time when an agent chooses between different actions, that can be interpreted as a choice between lotteries over O. However, in this case no set of observed choices can ever be “incoherent” - because, as Rohin notes, there is always a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made.
To be clear on the difference between them, under definition 1 an outcome is a world-state, one of which occurs every timestep, and a coherent agent makes every choice without reference to any past events (except insofar as they provide information about its current state or future states). Whereas under definition 2 an outcome is an entire world-trajectory (composed of a sequence of world-states), only one of which ever occurs, and a coherent agent’s future actions may depend on what happened in the past in arbitrary ways. To see how this difference plays out in practice, consider the following example of non-transitive travel preferences: an agent which pays $50 to go from San Francisco to San Jose, then $50 to go from San Jose to Berkeley, then $50 to go from Berkeley to San Francisco (note that the money in this example is just a placeholder for anything the agent values). Under 2, this isn’t evidence that the agent is incoherent, but rather just an indication that it assigns more utility to world-trajectories in which it travels round in a circle than to other available world-trajectories. Since Eliezer uses this situation as an example of incoherence, he clearly doesn’t intend to interpret behaviour as a choice between lotteries over world-trajectories. So let’s examine definition 1 in more detail. But first note that there is no coherence theorem which says that an agent’s utility function needs to be defined over world-states instead of world-trajectories, and so it’ll take additional arguments to demonstrate that sufficiently optimised agents will care about the former instead of the latter. I’m not aware of any particularly compelling arguments for this conclusion - indeed, as I’ll explain later, I think it’s more plausible to model humans as caring about the latter.

Okay, so what about definition 1? This is a more standard interpretation of having preferences over time: requiring choices under uncertainty to move between different states makes this setup very similar to POMDPs, which are often used in reinforcement learning. It would be natural to now interpret the non-transitive travel example as follows: let F, J and B be the states of being in San Francisco, San Jose and Berkeley respectively. Then paying to go from F to J to B to F demonstrates incoherent preferences over states (assuming there’s also an option to just stay put in any of those states).

First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time. In fact, there are plenty of cases where you might choose to change your utility function (or have that change thrust upon you). I like Nate Soares’ example of wanting to become a rockstar; other possibilities include being blackmailed to change it, or sustaining brain damage. However, it seems unlikely that a sufficiently intelligent AGI will face these particular issues - and in fact the more capable it is of implementing its utility function, the more valuable it will consider the preservation of that utility function.³ So I’m willing to accept that, past a certain high level of intelligence, changes significant enough to affect what utility function a human would infer from that AGI’s behaviour seem unlikely.

Here’s a more important problem, though: we’ve now ruled out some preferences which seem to be reasonable and natural ones. For example, suppose you want to write a book which is so timeless that at least one person reads it every year for the next thousand years. There is no single point at which the state of the world contains enough information to determine whether you’ve succeeded or failed in this goal: in any given year there may be no remaining record of whether somebody read it in a previous year (or the records could have been falsified, etc). This goal is fundamentally a preference over world-trajectories.⁴ In correspondence, Rohin gave me another example: a person whose goal is to play a great song in its entirety, and who isn’t satisfied with the prospect of playing the final note while falsely believing that they’ve already played the rest of the piece.⁵ More generally, I think that virtue-ethicists and deontologists are more accurately described as caring about world-trajectories than world-states - and almost all humans use these theories to some extent when choosing their actions. Meanwhile Eric Drexler’s CAIS framework relies on services which are bounded in time taken and resources used - another constraint which can’t be expressed just in terms of individual world-states.

There’s a third issue with this framing: in examples like non-transitive travel, we never actually end up in quite the same state we started in. Perhaps we’ve gotten sunburned along the journey. Perhaps we spent a few minutes editing our next blog post. At the very least, we’re now slightly older, and we have new memories, and the sun’s position has changed a little. So really we’ve ended up in state F’, which differs in many ways from F. You can presumably see where I’m going with this: just like with definition 2, no series of choices can ever demonstrate incoherent revealed preferences in the sense of definition 1, since every choice actually made is between a different set of possible world-state outcomes. (At the very least, they differ in the agent’s memories of which path it took to get there.⁶ And note that outcomes which are identical except for slight differences in memories should sometimes be treated in very different ways, since having even a few bits of additional information from exploration can be incredibly advantageous.)

Now, this isn’t so relevant in the human context because we usually abstract away from the small details. For example, if I offer to sell you an ice-cream and you refuse it, and then I offer it again a second later and you accept, I’d take that as evidence that your preferences are incoherent - even though technically the two offers are different because accepting the first just leads you to a state where you have an ice-cream, while accepting the second leads you to a state where you both have an ice-cream and remember refusing the first offer. Similarly, I expect that you don’t consider two outcomes to be different if they only differ in the precise pattern of TV static or the exact timing of leaves rustling. But again, there are no coherence constraints saying that an agent can’t consider such factors to be immensely significant, enough to totally change their preferences over lotteries when you substitute in one such outcome for the other.

So for the claim that sufficiently optimised agents appear coherent to be non-trivially true under my first definition of coherence, we’d need to clarify that such coherence is only with respect to outcomes when they’re categorised according to the features which humans consider important, except for the ones which are intrinsically temporally extended. But then the standard arguments from coherence constraints no longer apply. At this point I think it’s better to abandon the whole idea of formal coherence as a predictor of real-world behaviour, and replace it with Rohin’s notion of “goal-directedness”, which is more upfront about being inherently subjective, and doesn’t rule out any of the goals that humans actually have.


Thanks to Tim Genewein, Ramana Kumar, Victoria Krakovna and Rohin Shah for discussions which led to this post, and helpful comments.

[1] Disjointedness of outcomes makes this argument more succinct, but it’s not actually a necessary component, because once you’ve received one outcome, your preferences over all other outcomes are allowed to change. For example, having won $1000000, the value you place on other financial prizes will very likely go down. This is related to my later argument that you never actually have multiple paths to ending up in the “same” state.

[2] Technical note: I’m assuming an infinite time horizon and no discounting, because removing either of those conditions leads to weird behaviour which I don’t want to dig into in this post. In theory this leaves open the possibility of states with infinite expected utility, as well as lotteries over infinitely many different states, but I think we can just stipulate that neither of those possibilities arises without changing the core idea behind my argument. The underlying assumption here is something like: whether we model the universe as finite or infinite shouldn’t significantly affect whether we expect AI behaviour to be coherent over the next few centuries, for any useful definition of coherent.

[3] Consider the two limiting cases: if I have no power to implement my utility function, then it doesn’t make any difference what it changes to. By comparison, if I am able to perfectly manipulate the world to fulfil my utility function, then there is no possible change in it which will lead to better outcomes, and many which will lead to worse (from the perspective of my current utility function).

[4] At this point you could object on a technicality: from the unitarity of quantum mechanics, it seems as if the laws of physics are in fact reversible, and so the current state of the universe (or multiverse, rather) actually does contain all the information you theoretically need to deduce whether or not any previous goal has been satisfied. But I’m limiting this claim to macroscopic-level phenomena, for two reasons. Firstly, I don’t think our expectations about the behaviour of advanced AI should depend on very low-level features of physics in this way; and secondly, if the objection holds, then preferences over world-states have all the same problems as preferences over world-trajectories.

[5] In a POMDP, we don’t usually include an agent’s memories (i.e. a subset of previous observations) as part of the current state. However, it seems to me that in the context of discussing coherence arguments it’s necessary to do so, because otherwise going from a known good state to a known bad state and back in order to gain information is an example of incoherence. So we could also formulate this setup as a belief MDP. But I prefer talking about it as a POMDP, since that makes the agent seem less Cartesian - for example, it makes more sense to ask what happens after the agent “dies” in a POMDP than a belief MDP.

[6] Perhaps you can construct a counterexample involving memory loss, but this doesn’t change the overall point, and if you’re concerned with such technicalities you’ll also have to deal with the problems I laid out in footnote 4.

Friday, 8 February 2019

Arguments for moral indefinability

Epistemic status: I endorse the core intuitions behind this post, but am only moderately confident in the specific claims made. Also, while I do have a degree in philosophy, I am not a professional ethicist, and I’d appreciate feedback on how these ideas relate to existing literature.

Moral indefinability is the term I use for the idea that there is no ethical theory which provides acceptable solutions to all moral dilemmas, and which also has the theoretical virtues (such as simplicity, precision and non-arbitrariness) that we currently desire. I think this is an important and true perspective on ethics, and in this post will explain why I hold it, with the caveat that I'm focusing more on airing these ideas than constructing a watertight argument.

Here’s another way of explaining moral indefinability: let’s think of ethical theories as procedures which, in response to a moral claim, either endorse it, reject it, or do neither. Moral philosophy is an attempt to find the theory whose answers best match our intuitions about what answers ethical theories should give us (e.g. don’t cause unnecessary suffering), and whose procedure for generating answers best matches our meta-level intuitions about what ethical theories should look like (e.g. they should consistently apply impartial principles rather than using ad-hoc, selfish or random criteria). None of these desiderata are fixed in stone, though - in particular, we sometimes change our intuitions when it’s clear that the only theories which match those intuitions violate our meta-level intuitions. My claim is that eventually we will also need to change our meta-level intuitions in important ways, because it will become clear that the only theories which match them violate key object-level intuitions. In particular, this might lead us to accept theories which occasionally evince properties such as:
  • Incompleteness: for some claim A, the theory neither endorses nor rejects either A or ~A, even though we believe that the choice between A and ~A is morally important.
  • Vagueness: the theory endorses an imprecise claim A, but rejects every way of making it precise.
  • Contradiction: the theory endorses both A and ~A (note that this is a somewhat provocative way of framing this property, since we can always add arbitrary ad-hoc exceptions to remove the contradictions. So perhaps a better term is arbitrariness of scope: when we have both a strong argument for A and a strong argument for ~A, the theory can specify in which situations each conclusion should apply, based on criteria which we would consider arbitrary and unprincipled. Example: when there are fewer than N lives at stake, use one set of principles; otherwise use a different set).

Why take moral indefinability seriously? The main reason is that ethics evolved to help us coordinate in our ancestral environment, and did so not by giving us a complete decision procedure to implement, but rather by ingraining intuitive responses to certain types of events and situations. There were many different and sometimes contradictory selection pressures driving the formation of these intuitions - and so, when we construct generalisable principles based on our intuitions, we shouldn't expect those principles to automatically give useful or even consistent answers to very novel problems. Unfortunately, the moral dilemmas which we grapple with today have in fact "scaled up" drastically in at least two ways. Some are much greater in scope than any problems humans have dealt with until very recently. And some feature much more extreme tradeoffs than ever come up in our normal lives, e.g. because they have been constructed as thought experiments to probe the edges of our principles.

Of course, we're able to adjust our principles so that we are more satisfied with their performance on novel moral dilemmas. But I claim that in some cases this comes at the cost of those principles conflicting with the intuitions which make sense on the scales of our normal lives. And even when it's possible to avoid that, there may be many ways to make such adjustments whose relative merits are so divorced from our standard moral intuitions that we have no good reason to favour one over the other. I'll give some examples shortly.

A second reason to believe in moral indefinability is the fact that human concepts tend to be open texture: there is often no unique "correct" way to rigorously define them. For example, we all know roughly what a table is, but it doesn’t seem like there’s an objective definition which gives us a sharp cutoff between tables and desks and benches and a chair that you eat off and a big flat rock on stilts. A less trivial example is our inability to rigorously define what entities qualify as being "alive": edge cases include viruses, fires, AIs and embryos. So when moral intuitions are based on these sorts of concepts, trying to come up with an exact definition is probably futile. This is particularly true when it comes to very complicated systems in which tiny details matter a lot to us - like human brains and minds. It seems implausible that we’ll ever discover precise criteria for when someone is experiencing contentment, or boredom, or many of the other experiences that we find morally significant.

I would guess that many anti-realists are sympathetic to the arguments I’ve made above, but still believe that we can make morality precise without changing our meta-level intuitions much - for example, by grounding our ethical beliefs in what idealised versions of ourselves would agree with, after long reflection. My main objection to this view is, broadly speaking, that there is no canonical “idealised version” of a person, and different interpretations of that term could lead to a very wide range of ethical beliefs. I explore this objection in much more detail in this post. (In fact, the more general idea that humans aren’t really “utility maximisers”, even approximately, is another good argument for moral indefinability.) And even if idealised reflection is a coherent concept, it simply passes the buck to your idealised self, who might then believe my arguments and decide to change their meta-level intuitions.

So what are some pairs of moral intuitions which might not be simultaneously satisfiable under our current meta-level intuitions? Here’s a non-exhaustive list - the general pattern being clashes between small-scale perspectives, large-scale perspectives, and the meta-level intuition that they should be determined by the same principles:
  • Person-affecting views versus non-person-affecting views. Small-scale views: killing children is terrible, but not having children is fine, even when those two options lead to roughly the same outcome. Large-scale view: extinction is terrible, regardless of whether it comes about from people dying or people not being born.
  • The mere addition paradox, aka the repugnant conclusion. Small-scale views: adding happy people and making people more equal can't make things worse. Large-scale view: a world consisting only of people whose lives are barely worth living is deeply suboptimal. (Note also Arrhenius' impossibility theorems, which show that you can't avoid the repugnant conclusion without making even greater concessions).
  • Weighing theories under moral uncertainty. I personally find OpenPhil's work on cause prioritisation under moral uncertainty very cool, and the fundamental intuitions behind it seem reasonable, but some of it (e.g. variance normalisation) has reached a level of abstraction where I feel almost no moral force from their arguments, and aside from an instinct towards definability I'm not sure why I should care.
  • Infinite and relativistic ethics. Same as above. See also this LessWrong post arguing against applying the “linear utility hypothesis” at vast scales.
  • Whether we should force future generations to have our values. On one hand, we should be very glad that past generations couldn't do this. But on the other, the future will probably disgust us, like our present would disgust our ancestors. And along with "moral progress" there'll also be value drift in arbitrary ways - in fact, I don't think there's any clear distinction between the two.

I suspect that many readers share my sense that it'll be very difficult to resolve all of the dilemmas above in a satisfactory way, but also have a meta-level intuition that they need to be resolved somehow, because it's important for moral theories to be definable. But perhaps at some point it's this very urge towards definability which will turn out to be the weakest link. I do take seriously Parfit's idea that secular ethics is still young, and there's much progress yet to be made, but I don't see any principled reason why we should be able to complete ethics, except by raising future generations without whichever moral intuitions are standing in the way of its completion (and isn't that a horrifying thought?). From an anti-realist perspective, I claim that perpetual indefinability would be better. That may be a little more difficult to swallow from a realist perspective, of course. My guess is that the core disagreement is whether moral claims are more like facts, or more like preferences or tastes - if the latter, moral indefinability would be analogous to the claim that there’s no (principled, simple, etc) theory which specifies exactly which foods I enjoy.

There are two more plausible candidates for moral indefinability which were the original inspiration for this post, and which I think are some of the most important examples:
  • Whether to define welfare in terms of preference satisfaction or hedonic states.
  • The problem of "maximisation" in utilitarianism.
I've been torn for some time over the first question, slowly shifting towards hedonic utilitarianism as problems with formalising preferences piled up. While this isn't the right place to enumerate those problems (see here for a previous relevant post), I've now become persuaded that any precise definition of which preferences it is morally good to satisfy will lead to conclusions which I find unacceptable. After making this update, I can either reject a preference-based account of welfare entirely (in favour of a hedonic account), or else endorse a "vague" version of it which I think will never be specified precisely.

The former may seem the obvious choice, until we take into account the problem of maximisation. Consider that a true (non-person-affecting) hedonic utilitarian would kill everyone who wasn't maximally happy if they could replace them with people who were (see here for a comprehensive discussion of this argument). And that for any precise definition of welfare, they would search for edge cases where they could push it to extreme values. In fact, reasoning about a "true utilitarian" feels remarkably like reasoning about an unsafe AGI. I don't think that's a coincidence: psychologically, humans just aren't built to be maximisers, and so a true maximiser would be fundamentally adversarial. And yet many of us also have strong intuitions that there are some good things, and it's always better for there to be more good things, and it’s best if there are most good things.

How to reconcile these problems? My answer is that utilitarianism is pointing in the right direction, which is “lots of good things”, and in general we can move in that direction without moving maximally in that direction. What are those good things? I use a vague conception of welfare that balances preferences and hedonic experiences and some of my own parochial criteria - importantly, without feeling like it's necessary to find a perfect solution (although of course there will be ways in which my current position can be improved). In general, I think that we can often do well enough without solving fundamental moral issues - see, for example, this LessWrong post arguing that we’re unlikely to ever face the true repugnant dilemma, because of empirical facts about psychology.

To be clear, this still means that almost everyone should focus much more on utilitarian ideas, like the enormous value of the far future, because in order to reject those ideas it seems like we’d need to sacrifice important object- or meta-level moral intuitions to a much greater extent than I advocate above. We simply shouldn’t rely on the idea that such value is precisely definable, nor that we can ever identify an ethical theory which meets all the criteria we care about.

Saturday, 26 January 2019

Orpheus in Drag: two theatre reviews

I've been to a couple of interesting and thought-provoking musicals lately - in particular, Everybody's talking about Jamie and Hadestown. I'd give each of them slightly over 4 stars: excellent in most ways, but with a few clear flaws. The former is the story of a 16 year old from Sheffield who wants to become a drag queen, and his struggle to wear a dress to the school prom. The latter is a jazz and blues style retelling of the Greek myth of Orpheus and Eurydice, and how he journeys into the underworld to rescue her from Hades. An interesting contrast indeed. (Watch out: spoilers ahead).

The thing I liked most about Jamie was the superb dialogue and characterisation. The musical took place in working-class Sheffield, and while that’s not a setting I’m particularly familiar with, the voice that came through felt very authentic. Jamie's classmates were racially diverse enough to actually look like a typical British state school (half the girls wore hijabs), and they talked to each other in a way which could have been straight out of my own high school (cringe-inducing innuendo included). The main difference was that they were often unrealistically witty, with too many hilarious one-liners - a forgivable sin. Even Jamie's three nemeses were portrayed with nuance and care: one with a schoolyard bully's bluster, one a prim teacher hiding behind rules and propriety, and one (his father) with rough but not overstated bigotry. Meanwhile Jamie’s drag queen mentor wasn't (just) a caricature, but also an ageing man somewhat shocked to see how young they're starting these days.

Anyway, it was fascinating to see a proper representation of a slice of real life that is so rarely shown in the theatre. But the real show-stealer was Jamie’s mother. Her struggle and love and sheer bloody-minded devotion in raising him solo shone through, especially in her ballad He’s my boy. She supported him, she questioned herself, they fought, they reconciled - haven’t we all been exactly there, on one side or the other? At least that’s how it feels, watching them. No saint either of them, but it was their relationship which made the show.

The one weak point, overall? Well, Jamie himself - in some ways. His acting was fabulous, his singing angelic - and yet somehow I couldn't buy into his bildungsroman. Partly, at the start, it was because he was so incredibly camp that it was difficult to see the personality underneath. As the show went on, that wore off. But the reasons why we should care about his quest were still murky. Jamie didn’t seem to have an explanation for it either - for him, cross-dressing wasn’t a sexual thing, it was “just fun”, just...part of who he was? He seemed as uncertain as the audience. I think I believe, on an intellectual level, that desires like these can be very important, psychologically and symbolically. But emotionally speaking, it’s difficult to empathise with such a strong desire to put on women’s clothes. And actually there were several ways in which the script made it harder to do so. Jamie was already happily out and his gayness accepted by almost everyone - so it felt like this was a struggle distinct from his sexual orientation. Of course drag has been a key element of gay culture for decades, but it wasn’t really portrayed that way in the musical (perhaps the link was clearer to the older members of the audience - I, and many other younger audiences members, have primarily been exposed to more “mainstream” gay culture). Then, at the end of the first act, Jamie put on a successful drag show! He’s already on track to be a queen, there’s just one thing he needs to do first - go to his school prom in a dress? It sounds fun, sure, but not a climactic victory or a source of fulfilment. The closest he got to linking it to something deeper was talking about his underlying desire to be “beautiful”. But if anything that’s unhealthy, not empowering. So at the end, I’m happy for Jamie, but I’d have been just as happy if he’d skipped the prom and done another drag show instead.

Ironically enough, my overall impression of Hadestown was in many ways the exact opposite. The details were a little disappointing, and the lyrics were trying so hard to rhyme that they ended up rather facile, but the overall force and weight of the story swept over all of that like the Styx bursting its banks. That was particularly true in the second half (for which the first half was mostly forgettable setup). What happened in the second half, then? A bunch of things that I thought wouldn't be possible:
  1. For the first half of the show, Hades had been portrayed as a ruthless industrialist-style villain - a factory and mine owner who abused his workers and scorned his wife Persephone. It’s a clever tack to take, and makes a strong impression, but at the expense of turning him into something of a caricature. In the second half, though, we get a deeper exploration of their marriage, which helps to humanise him and paves the way for Orpheus’ success. (Hades also had an amazing contrabass - it’s just a pity that the score tried so hard to exploit it that he ended up croaking a lot. Two tones up and he’d have been able to hit some really powerful notes).
  2. In the myth, Orpheus sings to Hades so beautifully that everyone who hears it is touched, and he wins Eurydice' freedom from the underworld. In general, it's bloody hard to write a show based around a transcendentally beautiful song (or painting, or dance, or poem) because you either have to hide it, which is disappointing, or show it, and risk even more disappointment (the portrayal of Jamie’s pivotal drag show fell in the former category). Well, here they showed it, and it really was that good. Orpheus was mostly portrayed as a hipster guitarist type, but on the high notes his voice soared beautifully.
  3. In the myth, Hades is persuaded to let Eurydice go - but for his own mysterious reasons, adds the condition that Orpheus must not look back on the way out, or else Eurydice will be trapped forever. In myths you accept that the gods are whimsical and capricious, but I didn't think the scriptwriters could come up with an explanation for why a realistic Hades would be soft-hearted enough to let them go, but cold-hearted enough to make it conditional. I was wrong.
  4. In the myth, Orpheus almost makes it out of the underworld with Eurydice, but then looks back. I didn't think the musical could portray that without making Orpheus seem like an idiot. I was wrong (for reasons I’ll explain shortly).
  5. Eurydice ends up trapped in the underworld forever, and that's the end of the story. I didn't think they could end the show with that tragedy without it being very downbeat. Again, I was wrong, and will explain why - although this one will take quite a bit of exposition.

The most important divergence from the original story was that in the musical, instead of being sent to the underworld because she died, Eurydice accepted an invitation from Hades because she was hungry and desperate and Orpheus was neglecting her; she was then tricked into signing herself into slavery (there were also hints that Hades seduced her). Orpheus neglected her because he was working on a song that would “bring the spring back” and presumably solve the problems of hunger and hardship on a wider scale. Combined, these changes make Eurydice a much less sympathetic character, although she’s played so well that it’s difficult to blame her.

But I have a bone to pick with the overall message. In the original, Orpheus and Eurydice’s love is pure, and the enemy is Death, humanity's eternal nemesis. Now, the enemy is… Orpheus’ obsession with improving the world, and businesspeople who trick desperate workers into signing binding contracts? It’s not quite the same. As Orpheus and Eurydice leave for the surface, Hades’ other workers sing to them “If you can do it, so can we!” But they can’t, can they? What will they do, learn the lyre in between mine shifts? This is a strategy that doesn’t scale. Insofar as the plot catalyst in this version was Eurydice’s poverty, industrialisation is actually the correct way to solve it, despite the hardships which it causes during the transition period. By contrast, Orpheus’ hope of singing the spring back is never really fleshed out, and got totally derailed by the quest for Eurydice, despite (from an altruistic perspective) being much more important.

There is at least one improvement, though. In the original, Orpheus looked back because he was worried that Hades had tricked him - which sort of makes sense, but isn’t particularly narratively resonant to us today. In the musical, though, he is primarily plagued by self-doubt - is he really the sort of person who deserves to have Eurydice follow him? Who is he to expect such devotion? I’ve seen people follow that line of thought often enough that it really does make sense that it could drive him to ruin everything - especially when it’s conveyed by three Fates circling and taunting him. In that sense, it’s a musical for modern times.

And yet it’s a story that’s over two thousand years old, a fact that kept hitting me throughout the performance. The weight of history pressed down, the loves and fears of people separated from me by millennia. At the end, the narrator brings up the theme of eternal recurrence, to soften the ending: this is an old song, a tragedy we’ve sung before. “Here’s the thing: to know how it ends, and still to begin to sing it again - as if it might turn out this time…” can be optimistic, can be hopeful. God, isn’t that exactly it, though? We know how all the quests to rescue ourselves and our loved ones from the underworld have ended - we know what happened to Gilgamesh, and what happened to Eurydice, and what happened to the alchemists seeking the philosopher’s stone, and what happened to the monks who spent their lives in prayer and preparation for the afterlife. And yet death is a technical problem, and we can solve technical problems, and so maybe - maybe - it’ll be us who sing that song, and change the ending. I work for that.

Topics on my mind: December 2018

I recently went on holiday to Italy and Egypt. Visiting such ancient countries has made me wonder whether historians will engage with our own civilisation in the same way in the future. From that perspective, it feels like we don’t spend much effort creating monuments intended to last indefinitely. Skyscrapers are objectively pretty impressive, but I have no idea if any would stay up if left unattended for 500 or 1000 years. To be clear, I’m perfectly happy with us building few or no big vanity projects, and indeed think we’re spending far too much effort preserving historical buildings, especially those from the past two centuries. But the more seriously we take the long-term future, the better - and so it’d be nice to normalise planning on the timeframe of centuries.

On the other hand, we shouldn’t necessarily take the presence of durable historical monuments as evidence that our ancestors were more focused on the long term than we are. Firstly, there’s a selection effect whereby we primarily remember the civilisations that left lasting evidence of their presence. Secondly, history is long, and so even very infrequent construction could leave us with a cornucopia of riches. Thirdly, ancient civilisations may have traded off against longevity if they’d had the options we do - we build less of our infrastructure out of concrete than the Romans did, because we now have access to materials which are better overall, albeit less durable. And fourthly, even when they intended to leave a lasting legacy, it was often more focused on the next world than this one.

Speaking of that, another question on my mind is whether our society pays less attention to death than previous ones. The pyramids - the most memorable monuments in existence - were tombs after all. And while the Epic of Gilgamesh is just one data point, I find it very interesting that the very first great work of literature is about a quest for immortality. Then there’s Renaissance art, with its memento mori and danses macabre. And of course Christianity in general has always been a little bit obsessed with death. This was probably spurred by the atrociously high mortality rates of almost every era in human history, which made death a part of everyday life. So now that Western societies have become more secular, lifespans have increased dramatically, and religions are less fire-and-brimstone, discussions of death have dwindled. Some humanists encourage an acceptance of death as a normal part of life, although usually only in passing (I am reminded of Pullman’s afterlife in His Dark Materials, where shades of the dead joyously “loosen and float apart”).* To me, this is a prime example of the naturalistic fallacy. Transhumanists strongly reject the inevitability of death, but make up only a tiny proportion of the population. And yes, there are outcries about lives cut short by disease or violence, and medical spending is through the roof - but when it comes to the overall concept of death, all I see is a sort of blank numbness, some trite aphorisms, and a sweeping under the rug. I think we would all benefit from better discussions on this - I particularly like Bostrom’s Fable of the Dragon Tyrant as a way of making events on a vast scale emotionally comprehensible.


* This passage from Pullman is strikingly beautiful, and equally misguided:

"All the atoms that were them, they’ve gone into the air and the wind and the trees and the earth and all the living things. They’ll never vanish. They’re just part of everything...

Even if it means oblivion, friends, I’ll welcome it, because it won’t be nothing. We’ll be alive again in a thousand blades of grass, and a million leaves; we’ll be falling in the raindrops and blowing in the fresh breeze; we’ll be glittering in the dew under the stars and the moon out there in the physical world, which is our true home and always was."

Monday, 21 January 2019

Disentangling arguments for the importance of AI safety

I recently attended the 2019 Beneficial AGI conference organised by the Future of Life Institute. I’ll publish a more complete write-up later, but I was particularly struck by how varied attendees' reasons for considering AI safety important were. Before this, I’d observed a few different lines of thought, but interpreted them as different facets of the same idea. Now, though, I’ve identified at least 6 distinct serious arguments for why AI safety is a priority. By distinct I mean that you can believe any one of them without believing any of the others - although of course the particular categorisation I use is rather subjective, and there’s a significant amount of overlap. In this post I give a brief overview of my own interpretation of each argument (note that I don’t necessarily endorse them myself). They are listed roughly from most specific and actionable to most general. I finish with some thoughts on what to make of this unexpected proliferation of arguments. Primarily, I think it increases the importance of clarifying and debating the core ideas in AI safety.
  1. Maximisers are dangerous. Superintelligent AGI will behave as if it’s maximising the expectation of some utility function, since doing otherwise can be shown to be irrational. Yet we can’t write down a utility function which precisely describes human values, and optimising very hard for any other function will lead to that AI rapidly seizing control (as a convergent instrumental subgoal) and building a future which contains very little of what we value (because of Goodhart’s law and the complexity and fragility of values). We won’t have a chance to notice and correct misalignment because an AI which has exceeded human level will improve its intelligence very quickly (either by recursive self-improvement or by scaling up its hardware), and then prevent us from modifying it or shutting it down.
    1. This was the main thesis advanced by Yudkowsky and Bostrom when founding the field of AI safety. Here I’ve tried to convey the original line of argument, although some parts of it have been strongly critiqued since then. In particular, Drexler and Shah have disputed the relevance of expected utility maximisation (the latter suggesting the concept of goal-directedness as a replacement), while Hanson and Christiano disagree that AI intelligence will increase in a very fast and discontinuous way.
    2. Most of the arguments in this post originate from or build on this one in some way. This is particularly true of the next two arguments - nevertheless, I think that there’s enough of a shift in focus in each to warrant separate listings.
  2. The target loading problem. Even if we knew exactly what we wanted a superintelligent agent to do, we don’t currently know (even in theory) how to make an agent which actually tries to do that. In other words, if we were to create a superintelligent AGI before solving this problem, the goals we would ascribe to that AGI (by taking the intentional stance towards it) would not be the ones we had intended to give it. As a motivating example, evolution selected humans for their genetic fitness, yet humans have goals which are very different from just spreading their genes. In a machine learning context, while we can specify a finite number of data points and their rewards, neural networks may then extrapolate from these rewards in non-humanlike ways.
    1. This is a more general version of the “inner optimiser problem”, and I think it captures the main thrust of the latter while avoiding the difficulties of defining what actually counts as an “optimiser”. I’m grateful to Nate Soares for explaining the distinction, and arguing for the importance of this problem.
  3. The prosaic alignment problem. It is plausible that we build “prosaic AGI”, which replicates human behaviour without requiring breakthroughs in our understanding of intelligence. Shortly after they reach human level (or possibly even before), such AIs will become the world’s dominant economic actors. They will quickly come to control the most important corporations, earn most of the money, and wield enough political influence that we will be unable to coordinate to place limits on their use. Due to economic pressures, corporations or nations who slow down AI development and deployment in order to focus on aligning their AI more closely with their values will be outcompeted. As AIs exceed human-level intelligence, their decisions will become too complex for humans to understand or provide feedback on (unless we develop new techniques for doing so), and eventually we will no longer be able to correct the divergences between their values and ours. Thus the majority of the resources in the far future will be controlled by AIs which don’t prioritise human values. This argument was explained in this blog post by Paul Christiano.
    1. More generally, aligning multiple agents with multiple humans is much harder than aligning one agent with one human, because value differences might lead to competition and conflict even between agents that are each fully aligned with some humans. (As my own speculation, it’s also possible that having multiple agents would increase the difficulty of single-agent alignment - e.g. the question “what would humans want if I didn’t manipulate them” would no longer track our values if we would counterfactually be manipulated by a different agent).
  4. The human safety problem. This line of argument (which Wei Dai has recently highlighted) claims that no human is “safe” in the sense that giving them absolute power would produce good futures for humanity in the long term, and therefore that building an AI which extrapolates and implements the values of even a very altruistic human is insufficient. A prosaic version of this argument emphasises the corrupting effect of power, and the fact that morality is deeply intertwined with social signalling - however, I think there’s a stronger and more subtle version. In everyday life it makes sense to model humans as mostly rational agents pursuing their goals and values. However, this abstraction breaks down badly in more extreme cases (e.g. addictive superstimuli, unusual moral predicaments), implying that human values are somewhat incoherent. One such extreme case is running my brain for a billion years, after which it seems very likely that my values will have shifted or distorted radically, in a way that my original self wouldn’t endorse. Yet if we want a good future, this is the process which we require to go well: a human (or a succession of humans) needs to maintain broadly acceptable and coherent values for astronomically long time periods.
    1. An obvious response is that we shouldn’t entrust the future to one human, but rather to some group of humans following a set of decision-making procedures. However, I don’t think any currently-known institution is actually much safer than individuals over the sort of timeframes we’re talking about. Presumably a committee of several individuals would have lower variance than just one, but as that committee grows you start running into well-known problems with democracy. And while democracy isn’t a bad system, it seems unlikely to be robust on the timeframe of millennia or longer. (Alex Zhu has made the interesting argument that the problem of an individual maintaining coherent values is roughly isomorphic to the problem of a civilisation doing so, since both are complex systems composed of individual “modules” which often want different things.)
    2. While AGI amplifies the human safety problem, it may also help solve it if we can use it to decrease the value drift that would otherwise occur. Also, while it’s possible that we need to solve this problem in conjunction with other AI safety problems, it might be postponable until after we’ve achieved civilisational stability.
    3. Note that I use “broadly acceptable values” rather than “our own values”, because it’s very unclear to me which types or extent of value evolution we should be okay with. Nevertheless, there are some values which we definitely find unacceptable (e.g. having a very narrow moral circle, or wanting your enemies to suffer as much as possible) and I’m not confident that we’ll avoid drifting into them by default.
  5. Misuse and vulnerabilities. These might be catastrophic even if AGI always carries out our intentions to the best of its ability:
    1. AI which is superhuman at science and engineering R&D will be able to invent very destructive weapons much faster than humans can. Humans may well be irrational or malicious enough to use such weapons even when doing so would lead to our extinction, especially if they’re invented before we improve our global coordination mechanisms. It’s also possible that we invent some technology which destroys us unexpectedly, either through unluckiness or carelessness. For more on the dangers from technological progress in general, see Bostrom’s paper on the vulnerable world hypothesis.
    2. AI could be used to disrupt political structures, for example via unprecedentedly effective psychological manipulation. In an extreme case, it could be used to establish very stable totalitarianism, with automated surveillance and enforcement mechanisms ensuring an unshakeable monopoly on power for leaders.
    3. AI could be used for large-scale projects (e.g. climate engineering to prevent global warming, or managing the colonisation of the galaxy) without sufficient oversight or verification of robustness. Software or hardware bugs might then induce the AI to make unintentional yet catastrophic mistakes.
    4. People could use AIs to hack critical infrastructure (include the other AIs which manage aforementioned large-scale projects). In addition to exploiting standard security vulnerabilities, hackers might induce mistakes using adversarial examples or ‘data poisoning’.
  6. Argument from large impacts. Even if we’re very uncertain about what AGI development and deployment will look like, it seems likely that AGI will have a very large impact on the world in general, and that further investigation into how to direct that impact could prove very valuable.
    1. Weak version: development of AGI will be at least as big an economic jump as the industrial revolution, and therefore affect the trajectory of the long-term future. See Ben Garfinkel’s talk at EA Global London 2018. Ben noted that to consider work on AI safety important, we also need to believe the additional claim that there are feasible ways to positively influence the long-term effects of AI development - something which may not have been true for the industrial revolution. (Personally my guess is that since AI development will happen more quickly than the industrial revolution, power will be more concentrated during the transition period, and so influencing its long-term effects will be more tractable.)
    2. Strong version: development of AGI will make humans the second most intelligent species on the planet. Given that it was our intelligence which allowed us to control the world to the large extent that we do, we should expect that entities which are much more intelligent than us will end up controlling our future, unless there are reliable and feasible ways to prevent it. So far we have not discovered any.

What should we think about the fact that there are so many arguments for the same conclusion? As a general rule, the more arguments support a statement, the more likely it is to be true. However, I’m inclined to believe that quality matters much more than quantity - it’s easy to make up weak arguments, but you only need one strong one to outweigh all of them. And this proliferation of arguments is (weak) evidence against their quality: if the conclusions of a field remain the same but the reasons given for holding those conclusions change, that’s a warning sign for motivated cognition (especially when those beliefs are considered socially important). This problem is exacerbated by a lack of clarity about which assumptions and conclusions are shared between arguments, and which aren’t.

On the other hand, superintelligent AGI is a very complicated topic, and so perhaps it’s natural that there are many different lines of thought. One way to put this in perspective (which I credit to Beth Barnes) is to think about the arguments which might have been given for worrying about nuclear weapons, before they had been developed. Off the top of my head, there are at least four:
  1. They might be used deliberately.
  2. They might be set off accidentally.
  3. They might cause a nuclear chain reaction much larger than anticipated.
  4. They might destabilise politics, either domestically or internationally.

And there are probably more which would have been credible at the time, but which seem silly now due to hindsight bias. So if there’d been an active anti-nuclear movement in the 30’s or early 40’s, the motivations of its members might well have been as disparate as those of AI safety advocates today. Yet the overall concern would have been (and still is) totally valid and reasonable.

I think the main takeaway from this post is that the AI safety community as a whole is still confused about the very problem we are facing. The only way to dissolve this tangle is to have more communication and clarification of the fundamental ideas in AI safety, particularly in the form of writing which is made widely available. And while it would be great to have AI safety researchers explaining their perspectives more often, I think there is still a lot of explicatory work which can be done regardless of technical background. In addition to analysis of the arguments discussed in this post, I think it would be particularly useful to see more descriptions of deployment scenarios and corresponding threat models. It would also be valuable for research agendas to highlight which problem they are addressing, and the assumptions they require to succeed.

This post has benefited greatly from feedback from Rohin Shah, Alex Zhu, Beth Barnes, Adam Marblestone, Toby Ord, and the DeepMind safety team. All opinions are my own.

Saturday, 12 January 2019

Comments on Comprehensive AI Services

Over the last few months I’ve talked with Eric Drexler a number of times about his Comprehensive AI Services (CAIS) model of AI development, and read most of his technical report on the topic. I think these are important ideas which are well worth engaging with, despite personally being skeptical about many of the conclusions. Below I’ve summarised what I see as the core components of Eric’s view, followed by some of own arguments. Note that these are only my personal opinions. I did make some changes to the summary based on Eric’s comments on early drafts, to better reflect his position - however, there are likely still ways I’ve misrepresented him. Also note that this was written before reading Rohin’s summary of the same report, although I do broadly agree with most of Rohin’s points.

One useful piece of context for this model is Eric's background in nanotechnology, and his advocacy for the development of nanotech as "atomically precise manufacturing" rather than self-replicating nanomachines. The relationship between these two frameworks has clear parallels with the relationship between CAIS and a recursively self-improving superintelligence.

The CAIS model:
  1. The standard arguments in AI safety are concerned with the development of a single AGI agent doing open-ended optimisation. Before we build such an entity (if we do so at all), we will build AI services which each perform a bounded task with bounded resources, and which can be combined to achieve superhuman performance on a wide range of tasks. 
  2. AI services may or may not be “agents”. However, under CAIS there will be no entity optimising extremely hard towards its goals in the way that most AI safety researchers have been worrying about, because: 
    1. Each service will be relatively specialised and myopic (focused on current episodic performance, not maximisation over the whole future). This is true of basically all current AI applications, e.g. image classifiers or Google Translate.
    2. Although rational agents can be proved equivalent to utility-maximisers, the same is not necessarily true of systems of rational agents. Most such systems are fundamentally different in structure from rational agents - for example, individual agents within the system can compete with or criticise each other. And since AI services aren’t “rational agents” in the first place, a system composed of them is even less likely to implement a utility-maximiser. 
    3. There won't be very much demand for unified AIs which autonomously carry out large-scale tasks requiring general capabilities, because systems of AI services will be able to perform those tasks just as well or better. 
  3. Early AI services could do things like massively disrupt financial markets, increase the rate of scientific discovery, help run companies, etc. Eventually they should be able to do any task that humans can, at our level or higher. 
    1. They could also be used to recursively improve AI technologies and to develop AI applications, but usually with humans in the loop - in roughly the same way that science allows us to build better tools with which to do better science. 
  4. Our priorities in doing AI safety research can and should be informed by this model: 
    1. A main role for technical AI safety researchers should be to look at the emergent properties of systems of AI services, e.g. which combinations of architectures, tasks and selection pressures could lead to risky behaviour, as well as the standard problems of specifying bounded tasks. 
    2. AI safety experts can also give ongoing advice and steer the development of AI services. AI safety researchers shouldn't think of safety as a one-shot problem, but rather a series of ongoing adjustments. 
    3. AI services will make it much easier to prevent the development of unbounded agent-like AGI through methods like increasing coordination and enabling surveillance, if the political will can be mustered. 

I'm broadly sympathetic to the empirical claim that we'll develop AI services which can replace humans at most cognitively difficult jobs significantly before we develop any single superhuman AGI (one unified system that can do nearly all cognitive tasks as well as or better than any human). One plausible mechanism is that deep learning continues to succeed on tasks where there's lots of training data, but doesn't learn how to reason in general ways - e.g. it could learn from court documents how to imitate lawyers well enough to replace them in most cases, without being able to understand law in the way humans do. Self-driving cars are another pertinent example. If that pattern repeats across most human professions, we might see massive societal shifts well before AI becomes dangerous in the adversarial way that’s usually discussed in the context of AI safety.

If I had to sum up my objections to Eric’s framework in one sentence, it would be: “the more powerful each service is, the harder it is to ensure it’s individually safe; the less powerful each service is, the harder it is to combine them in a way that’s competitive with unified agents.” I’ve laid out my arguments in more detail below.

Richard’s view:
  1. Open-ended agentlike AI seems like the most likely candidate for the first strongly superhuman AGI system. 
    1. As a basic prior, our only example of general intelligence so far is ourselves - a species composed of agentlike individuals who pursue open-ended goals. So it makes sense to expect AGIs to be similar - especially if you believe that our progress in artificial intelligence is largely driven by semi-random search with lots of compute (like evolution was) rather than principled intelligent design. 
      • In particular, the way we trained on the world - both as a species and as individuals - was by interacting with it in a fairly unconstrained way. Many machine learning researchers believe that we’ll get superhuman AGI via a similar approach, by training RL agents in simulated worlds. Even if we then used such agents as “services”, they wouldn’t be bounded in the way predicted by CAIS. 
    2. Many complex tasks don’t easily decompose into separable subtasks. For instance, while writing this post I had to keep my holistic impression of Eric’s ideas in mind most of the time. This impression was formed through having conversations and reading essays, but was updated frequently as I wrote this post, and also draws on a wide range of my background knowledge. I don’t see how CAIS would split the task of understanding a high-level idea between multiple services, or (if it were done by a single service) how that service would interact with an essay-writing service, or an AI-safety-research service. 
      • Note that this isn’t an argument against AGI being modular, but rather an argument that requiring the roles of each module and the ways they interface with each other to be human-specified or even just human-comprehensible will be very uncompetitive compared with learning them in an unconstrained way. Even on today’s relatively simple tasks, we already see end-to-end training outcompeting other approaches, and learned representations outperforming human-made representations. The basic reason is that we aren’t smart enough to understand how the best cognitive structures or representations work. Yet it’s key to CAIS that each service performs a specific known task, rather than just doing useful computation in general - otherwise we could consider each lobe of the human brain to be a “service”, and the combination of them to be unsafe in all the standard ways. 
      • It’s not clear to me whether this is also an argument against IDA. I think that it probably is, but to a lesser extent, because IDA allows multiple layers of task decomposition which are incomprehensible to humans before bottoming out in subtasks which we can perform. 
    3. Even if task decomposition can be solved, humans reuse most of the same cognitive faculties for most of the tasks that we can carry out. If many AI services end up requiring similar faculties to each other, it would likely be more efficient to unify them into a single entity. It would also be more efficient if that entity could pick up new tasks in the same rapid way that humans do, because then you wouldn’t need to keep retraining. At that point, it seems like you no longer have an AI service but rather the same sort of AGI that we’re usually worried about. (In other words, meta-learning is very important but doesn’t fit naturally into CAIS). 
    4. Humans think in terms of individuals with goals, and so even if there's an equally good approach to AGI which doesn't conceive of it as a single goal-directed agent, researchers will be biased against it. 
  2. Even assuming that the first superintelligent AGI is in fact a system of services as described by the CAIS framework, it will be much more like an agent optimising for an open-ended goal than Eric claims. 
    1. There'll be significant pressure to reduce the extent to which humans are in the loop of AI services, for efficiency reasons. E.g. when a CEO can't improve on the strategic advice given to it by an AI, or the implementation by another AI, there's no reason to have that CEO any more. Then we’ll see consolidation of narrow AIs into one overall system which makes decisions and takes actions, and may well be given an unbounded goal like "maximise shareholder value". (Eric agrees that this is dangerous, and considers it more relevant than other threat models). 
    2. Even if we have lots of individually bounded-yet-efficacious modules, the task of combining them to perform well in new tasks seems like a difficult one which will require a broad understanding of the world. An overseer service which is trained to combine those modules to perform arbitrary tasks may be dangerous because if it is goal-oriented, it can use those modules to fulfil its goals (on the assumption that for most complex tasks, some combination of modules performs well - if not, then we’ll be using a different approach anyway). 
      • While I accept that many services can be trained in a way which makes them naturally bounded and myopic, this is much less clear to me in the case of an overseer which is responsible for large-scale allocation of other services. In addition to superhuman planning capabilities and world-knowledge, it would probably require arbitrarily long episodes so that it can implement and monitor complex plans. My guess is that Eric would argue that this overseer would itself be composed of bounded services, in which case the real disagreement is how competitive that decomposition would be (which relates to point 1.2 above). 
  3. Even assuming that the first superintelligent AGI is in fact a system of services as described the CAIS framework, focusing on superintelligent agents which pursue unbounded goals is still more useful for technical researchers. (Note that I’m less confident in this claim than the others). 
    1. Eventually we’ll have the technology to build unified agents doing unbounded maximisation. Once built, such agents will eventually overtake CAIS superintelligences because they’ll have more efficient internal structure and will be optimising harder for self-improvement. We shouldn’t rely on global coordination to prevent people from building unbounded optimisers, because it’s hard and humans are generally bad at it. 
    2. Conditional on both sorts of superintelligences existing, I think (and I would guess that Eric agrees) that CAIS superintelligences are significantly less likely to cause existential catastrophe. And in general, it’s easier to reduce the absolute likelihood of an event the more likely it is (even a 10% reduction of a 50% risk is more impactful than a 90% reduction of a 5% risk). So unless we think that technical research to reduce the probability of CAIS catastrophes is significantly more tractable than other technical AI safety research, it shouldn’t be our main focus.
As a more general note, I think that one of the main strengths of CAIS is in forcing us to be more specific about what tasks we envisage AGI being used for, rather than picturing it divorced from development and deployment scenarios. However, I worry that the fuzziness of the usual concept of AGI has now been replaced by a fuzzy notion of “service” which makes sense in our current context, but may not in the context of much more powerful AI technology. So while CAIS may be a good model of early steps towards AGI, I think it is a worse model of the period I’m most worried about. I find CAIS most valuable in its role as a research agenda (as opposed to a predictive framework): it seems worth further investigating the properties of AIs composed of modular and bounded subsystems, and the ways in which they might be safer (or more dangerous) than alternatives.


Many thanks to Eric for the time he spent explaining his ideas and commenting on drafts. I also particularly appreciated feedback from Owain Evans, Rohin Shah and Jan Leike.