Wednesday, 27 July 2022

Moral strategies at different capability levels

Let’s consider three ways you can be altruistic towards another agent:

  • You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).

  • You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).

  • You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.

I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:

  1. Care-morality mainly makes sense as an attitude towards agents who are much less capable than you, and/or can't make decisions for themselves - for example animals, future people, and infants.
    1. In these cases, you don’t have to think much about what the other agents are doing, or what they think of you; you can just aim to produce good outcomes in the world. Indeed, trying to be cooperative or deferential towards these agents is hard, because their thinking may be much less sophisticated than yours, and you might even get to choose what their goals are.

    2. Applying only care-morality in multi-agent contexts can easily lead to conflict with other agents around you, even when you care about their welfare, because:

      1. You each value (different) other things in addition to their welfare.

      2. They may have a different conception of welfare than you do.

      3. They can’t fully trust your motivations.

    3. Care morality doesn’t focus much on the act-omission distinction. Arbitrarily scalable care-morality looks like maximizing resources until the returns to further investment are low, then converting them into happy lives.

  2. Cooperation-morality mainly makes sense as an attitude towards agents whose capabilities are comparable to yours - for example others around us who are trying to influence the world.

    1. Cooperation-morality can be seen as the “rational” thing to do even from a selfish perspective (e.g. as discussed here), but in practice it’s difficult to robustly reason through the consequences of being cooperative without relying on ingrained cooperative instincts, especially when using causal decision theories. Functional decision theories make it much easier to rederive many aspects of intuitive cooperation-morality as optimal strategies (as discussed further below).

    2. Cooperation-morality tends to uphold the act-omission distinction, and a sharp distinction between those within versus outside a circle of cooperation. It doesn’t help very much with population ethics - naively maximizing the agency of future agents would involve ensuring that they only have very easily-satisfied preferences, which seems very undesirable.
    3. Arbitrarily scalable cooperation-morality looks like forming a central decision-making institution which then decides how to balance the preferences of all the agents that participate in it.

    4. A version of cooperation-morality can also be useful internally: enhancing your own agency by cultivating virtues which facilitate cooperation between different parts of yourself, or versions of yourself across time.

  3. Deference-morality mainly makes sense as an attitude towards trustworthy agents who are much more capable than you - for example effective leaders, organizations, communities, and sometimes society as a whole.

    1. Deference-morality is important for getting groups to coordinate effectively - soldiers in armies are a central example, but it also applies to other organizations and movements to a lesser extent. Individuals trying to figure out strategies themselves undermines predictability and group coordination, especially if the group strategy is more sophisticated than the ones the individuals generate.

    2. In practice, it seems very easy to overdo deference-morality - compared to our ancestral environment, it seems much less useful today. Also, whether or not deference-morality makes sense depends on how much you trust the agents you’re deferring to - but it’s often difficult to gain trust in agents more capable than you, because they’re likely better at deception than you. Cult leaders exploit this.

    3. Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.

Cooperation-morality and deference-morality have the weakness that they can be exploited by the agents we hold those attitudes towards; and so we also have adaptations for deterring or punishing this (which I’ll call conflict-morality). I’ll mostly treat conflict-morality as an implicit part of cooperation-morality and deference-morality; but it’s worth noting that a crucial feature of morality is the coordination of coercion towards those who act immorally.

Morality as intrinsic preferences versus morality as instrumental preferences

I’ve mentioned that many moral principles are rational strategies for multi-agent environments even for selfish agents. So when we’re modeling people as rational agents optimizing for some utility function, it’s not clear whether we should view those moral principles as part of their utility functions, versus as part of their strategies. Some arguments for the former:

  • We tend to care about principles like honesty for their own sake (because that was the most robust way for evolution to actually implement cooperative strategies).

  • Our cooperation-morality intuitions are only evolved proxies for ancestrally-optimal strategies, and so we’ll probably end up finding that the actual optimal strategies in other environments violate our moral intuitions in some ways. For example, we could see love as a cooperation-morality strategy for building stronger relationships, but most people still care about having love in the world even if it stops being useful.

Some arguments for the latter:

  • It seems like caring intrinsically about cooperation, and then also being instrumentally motivated to pursue cooperation, is a sort of double-counting.

  • Insofar as cooperation-morality principles are non-consequentialist, it’s hard to formulate them as positive components of a utility function over outcomes. E.g. it doesn’t seem particularly desirable to maximize the amount of honesty in the universe.

The rough compromise which I use here is to:

  • Care intrinsically about the welfare of all agents which currently exist or might in the future, with a bias towards myself and the people close to me.

  • Care intrinsically about the agency of existing agents to the extent that they're capable enough to be viewed as having agency (e.g. excluding trees), with a bias towards myself and the people close to me.
    • In other words, I care about agency in a person-affecting way; and more specifically in a loss-averse way which prioritizes preserving existing agency over enhancing agency.
  • Define welfare partly in terms of hedonic experiences (particularly human-like ones), and partly in terms of having high agency directed towards human-like goals.

    • You can think of this as a mixture of hedonism, desire, and objective-list theories of welfare.

  • Apply cooperation-morality and deference-morality instrumentally in order to achieve the things I intrinsically care about.

    • Instrumental applications of cooperation-morality and deference-morality lead me to implement strong principles. These are partly motivated by being in an iterated game within society, but also partly motivated by functional decision theories.

Rederiving morality from decision theory

I’ll finish by elaborating on how different decision theories endorse different instrumental strategies. Causal decision theories only endorse the same actions as our cooperation-morality intuitions in specific circumstances (e.g. iterated games with indefinite stopping points). By contrast, functional decision theories do so in a much wider range of circumstances (e.g. one-shot prisoner’s dilemmas) by accounting for logical connections between your choices and other agents’ choices. Functional decision theories follow through on commitments you previously made; and sometimes follow through on commitments that you would have made. However, the question of which hypothetical commitments they should follow through with depends on how updateless they are.

Updatelessness can be very powerful - it’s essentially equivalent to making commitments behind a veil of ignorance, which provides an instrumental rationale for implementing cooperation-morality. But it’s very unclear how to reason about how justified different levels of updatelessness are. So although it’s tempting to think of updatelessness as a way of deriving care-morality as an instrumental goal, for now I think it’s mainly just an interesting pointer in that direction. (In particular, I feel confused about the relationship between single-agent updatelessness and multi-agent updatelessness like the original veil of ignorance thought experiment; I also don’t know what it looks like to make commitments “before” having values.)

Lastly, I think deference-morality is the most straightforward to derive as an instrumentally-useful strategy, conditional on fully trusting the agent you’re deferring to - epistemic deference intuitions are pretty common-sense. If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.

Friday, 22 July 2022

Which values are stable under ontology shifts?

Here's a rough argument which I've been thinking about lately:

We have coherence theorems which say that, if you’re not acting like you’re maximizing expected utility over outcomes, you’d make payments which predictably lose you money. But in general I don't see any principled distinction between “predictably losing money” (which we view as incoherent) and “predictably spending money” (to fulfill your values): it depends on the space of outcomes over which you define utilities, which seems pretty arbitrary. You could interpret an agent being money-pumped as a type of incoherence, or as an indication that it enjoys betting and is willing to pay to do so; similarly you could interpret an agent passing up a “sure thing” bet as incoherence, or just a preference for not betting which it’s willing to forgo money to satisfy. Many humans have one of these preferences!

Now, these preferences are somewhat odd ones, because you can think of every action under uncertainty as a type of bet. In other words, “betting” isn't a very fundamental category in an ontology which has a sophisticated understanding of reasoning under uncertainty. Then the obvious follow-up question is: which human values will naturally fit into much more sophisticated ontologies? I worry that not many of them will:

  • In a world where minds can be easily copied, our current concepts of personal identity and personal survival will seem very strange. You could think of those values as “predictably losing money” by forgoing the benefits of temporarily running multiple copies. (This argument was inspired by this old thought experiment from Wei Dai.)

  • In a world where minds can be designed with arbitrary preferences, our values related to “preference satisfaction” will seem very strange, because it’d be easy to create people with meaningless preferences that are by default satisfied to an arbitrary extent.

  • In a world where we understand minds very well, our current concepts of happiness and wellbeing may seem very strange. In particular, if happiness is understood in a more sophisticated ontology as caused by positive reward prediction error, then happiness is intrinsically in tension with having accurate beliefs. And if we understand reward prediction error in terms of updates to our policy, then deliberately invoking happiness would be in tension with acting effectively in the world.

  • If there's simply a tradeoff between them, we might still want to sacrifice accurate beliefs and effective action for happiness. But what I'm gesturing towards is the idea that happiness might not actually be a concept which makes much sense given a complete understanding of minds - as implied by the buddhist view of happiness as an illusion, for example.

  • In a world where people can predictably influence the values of their far future descendants, and there’s predictable large-scale growth, any non-zero discounting will seem very strange, because it predictably forgoes orders of magnitude more resources in the future.

    • This might result in the strategy described by Carl Shulman of utilitarian agents mimicking selfish agents by spreading out across the universe as fast as they can to get as many resources as they can, and only using those resources to produce welfare once the returns to further expansion are very low. It does seem possible that we design AIs which spend millions or billions of years optimizing purely for resource acquisition, and then eventually use all those resources for doing something entirely different. But it seems like those AIs would need to have minds that are constructed in a very specific and complicated way to retain terminal values which are so unrelated to most of their actions.

A more general version of these arguments: human values are generalizations of learned heuristics for satisfying innate drives, which in turn are evolved proxies for maximizing genetic fitness. In theory, you can say “this originated as a heuristic/proxy, but I terminally value it”. But in practice, heuristics tend to be limited, messy concepts which don't hold up well under ontology improvement. So they're often hard to continue caring about once you deeply understand them - kinda like how it’s hard to endorse “not betting” as a value once you realize that everything is a kind of bet, or endorse faith in god as a value if you no longer believe that god exists. And they're especially hard to continue caring about at scale.

Given all of this, how might future values play out? Here are four salient possibilities:

  1. Some core notion of happiness/conscious wellbeing/living a flourishing life is sufficiently “fundamental” that it persists even once we have a very sophisticated understanding of how minds work.

  2. No such intuitive notions are strongly fundamental, but we decide to ignore that fact, and optimize for values that seem incoherent to more intelligent minds. We could think of this as a way of trading away the value of consistency.

  3. We end up mainly valuing something like “creating as many similar minds as possible” for its own sake, as the best extrapolation of what our other values are proxies for.

  4. We end up mainly valuing highly complex concepts which we can’t simplify very easily - like “the survival and flourishing of humanity”, as separate from the survival and flourishing of any individual human. In this world, asking whether an outcome is good for individuals might feel like asking whether human actions are good or bad for individual cells - even if we can sometimes come up with a semi-coherent answer, that’s not something we care about very much.

Thursday, 21 July 2022

Making decisions using multiple worldviews

Tl;dr: the problem of how to make decisions using multiple (potentially incompatible) worldviews (which I'll call the problem of meta-rationality) comes up in a range of contexts, such as epistemic deference. Applying a policy-oriented approach to meta-rationality, and evaluating worldviews by the quality of their advice, dissolves several undesirable consequences of the standard "epistemic" approach to deference.

Meta-rationality as the limiting case of separate worldviews

When thinking about the world, we’d ideally like to be able to integrate all our beliefs into a single coherent worldview, with clearly-demarcated uncertainties, and use that to make decisions. Unfortunately, in complex domains, this can be very difficult. Updating our beliefs about the world often looks less like filling in blank parts of our map, and more like finding a new worldview which reframes many of the things we previously believed. Uncertainty often looks less like a probability distribution over a given variable, and more like a clash between different worldviews which interpret the same observations in different ways.

By “worldviews” I include things like ideologies, scientific paradigms, moral theories, perspectives of individual people, and sets of heuristics. The key criterion is that each worldview has “opinions” about the world which can be determined without reference to any other worldview. Although of course different worldviews can have overlapping beliefs, in general their opinions can be incompatible with those of other worldviews - for example:
  • Someone might have severe uncertainty about far-reaching empirical claims, or about which moral theory to favor.
  • A scientist might be investigating a phenomenon during a crisis period where there are multiple contradictory frameworks which purport to explain the phenomenon.
  • Someone might buy into an ideology which says that nothing else matters except adherence to that ideology, but then feel a “common-sense” pull towards other perspectives.
  • Someone might have a strong “inside view” on the world, but also want to partly defer to the worldviews of trusted friends.
  • Someone might have a set of principles which guides their interactions with the world.
  • Someone might have different parts of themselves which care about different things.

I think of “intelligence” as the core ability to develop and merge worldviews; and “rationality” as the ability to point intelligence in the most useful directions (i.e. taking into account where intelligence should be applied). Ideally we’d like to always be able to combine seemingly-incompatible worldviews into a single coherent perspective. But we usually face severe limitations on our ability to merge worldviews together (due to time constraints, cognitive limitations, or lack of information). I’ll call the skill of being able to deal with multiple incompatible worldviews, when your ability to combine them is extremely limited, meta-rationality. (Analogously, the ideal of emotional intelligence is to have integrated many different parts of yourself into a cohesive whole. But until you’ve done so, it’s important to have the skill of facilitating interactions between them. I won’t talk much about internal parts as an example of clashing worldviews throughout this post, but I think it’s a useful one to keep in mind.)

I don’t think there’s any sharp distinction between meta-rationality and rationality. But I do think meta-rationality is an interesting limiting case to investigate. The core idea I’ll defend in this post is that, when our ability to synthesize worldviews into a coherent whole is very limited, we should use each worldview to separately determine an overall policy for how to behave, and then combine those policies at a high level (for example by allocating a share of resources to each). I’ll call this the policy approach to meta-rationality; and I’ll argue that it prevents a number of problems (such as over-deference) which arise when using other approaches, particularly the epistemic approach of combining the credences of different worldviews directly.

Comparing the two approaches

Let’s consider one central example of meta-rationality: taking into account other people’s disagreements with us. In some simple cases, this is straightforward - if I vaguely remember a given statistic, but my friend has just looked it up and says I’m wrong, I should just defer to them on that point, and slot their correction into my existing worldview. But in some cases, other people have worldviews that clash with our own on large-scale questions, and we don’t know how to (or don’t have time to) merge them together without producing a frankenstein worldview with many internal inconsistencies.

How should we deal with this case, or other cases involving multiple inconsistent worldviews? The epistemic approach suggests:
  1. Generating estimates of how accurate we think each worldview’s claims are, based on its track record.
  2. Using these estimates to evaluate important claims by combining each worldview’s credences to produce our “all-things-considered” credence for that claim.
  3. Using our all-things-considered credences when making decisions.
This seems sensible, but leads to a few important problems:
  1. Merging estimates of different variables into all-things-considered credences might lead to very different answers depending on how it’s done, since after each calculation you lose information about the relationships between different worldviews’ answers to different questions. For example: worldview A might think that you’re very likely to get into MIT if you apply, but if you attend you’ll very likely regret it. And worldview B might think you have almost no chance of getting in, but if you do attend you’ll very likely be happy you went. Calculating all-things-considered credences separately would conclude you have a medium-good chance of a medium-good opportunity, which is much more optimistic than either worldview individually.
    1. One might respond that this disparity only arises when you’re applying the epistemic approach naively, where the non-naive approach would be to only combine worldviews' final expected utility estimates. But I think that the naive approach is actually the most commonly-used version of the epistemic approach - e.g. Ben Garfinkel talks about using deference to calculate likelihoods of risk in this comment; and Greg Lewis defends using epistemic modesty for “virtually all” beliefs. Also, combining expected utility estimates isn't very workable either, as I'll discuss in the next point.
  2. Worldviews might disagree deeply on which variables we should be estimating, with no clear way to combine those variables into a unified decision, due to: Differences in empirical beliefs which lead to different focuses. E.g. if one worldview thinks that cultural change is the best way to influence the future, and another thinks that technical research works better, it may be very difficult to convert impacts on those two areas into a "common unit" for expected value comparisons. Even if we manage to do so, empirical disagreements can lead one worldview to dominate another - e.g. if one falls prey to Pascal’s mugging, then its expected values could skew high enough that it effectively gets to make all decisions from then on, even if other worldviews are ignoring the mugging.
  3. Deep-rooted values disagreements. If worldviews don't share the same values, we can't directly compare their expected value estimates. Even if every worldview can formulate its values in terms of a utility function, there's no canonical way to merge utility estimates across worldviews with different values, for the same reason that there’s no canonical way to compare utilities across people: utility functions are equivalent up to positive affine transformations (multiplying one person’s utilities by 1000 doesn’t change their choices, but does change how much they’d influence decisions if we added different peoples’ utilities together.)
    1. One proposed solution is variance normalization, where worldviews' utilities are normalized to have the same mean and variance. But that can allow small differences in how we differentiate the options to significantly affect how a worldview’s utilities are normalized. (For example, “travel to Paris this weekend” could be seen as one plan, or be divided into many more detailed plans: “travel to Paris today by plane”, “travel to Paris tomorrow by train”, etc.) It's also difficult to figure out what distribution to normalize over (as I'll discuss later).
  4. Some worldviews might contain important action-guiding insights without being able to generate accurate empirical predictions - for example, a worldview which tells us that following a certain set of principles will tend to go well, but doesn’t say much about which good outcomes will occur.
  5. Some worldviews might be able to generate accurate empirical predictions without containing important action-guiding insights. For example, a worldview which says “don’t believe extreme claims” will be right much more often than it’s wrong. But since extreme claims are the ones most likely to lead to extreme opportunities, it might only need to be wrong once or twice for the harm of listening to it to outweigh the benefits of doing so. Or a worldview that has strong views about the world in general, but little advice for your specific situation.
  6. Since there are many other people whose worldviews are, from an outside perspective, just as trustworthy as ours (or more so), many other worldviews should be given comparable weight to our own. But then when we average them all together, our all-things-considered credences will be dominated by other people’s opinions, and we should basically never make important decisions based on our own opinions. Greg Lewis bites that bullet in this post, but I think most people find it pretty counterintuitive.
The key problem which underlies these different issues is that the epistemic approach evaluates and merges the beliefs of different worldviews too early in the decision-making process, before the worldviews have used their beliefs to evaluate different possible strategies. By contrast, the policy approach involves:
  1. Generating estimates of how useful we think each worldview’s advice is, based on its track record.
  2. Getting each worldview to identify the decisions that it cares most about influencing.
  3. Combining the worldviews’ advice to form an overall strategy (or, in reinforcement learning terminology, a policy), based both on how useful we think the worldview is and also on how much the worldview cares about each part of the strategy.
One intuitive description of how this might occur is the parliamentary approach. Under this approach, each worldview is treated as a delegate in a parliament, with a number of votes proportional to how much weight is placed on that worldview; delegates can then spread their votes over possible policies, with the probability of a policy being chosen proportional to how many votes are cast for it.

The policy approach largely solves the problems I identified previously:
  1. Since each worldview separately calculates the actions it recommends in a given domain, no information is lost by combining estimates of variables before using them to make decisions. In the college admissions example, each worldview will separately conclude “don’t put too much effort into college admissions”, and so any reasonable combined policy will follow that advice.
  2. The policy approach doesn’t require us to compare utilities across worldviews, since it’s actions not utilities that are combined across policies. Policies do need to prioritize some decisions over others - but unlike in the epistemic case, this doesn’t depend on how decisions are differentiated, since policies get to identify for themselves how to differentiate the decisions. (However, this introduces other problems, which I’ll discuss shortly.)
    1. Intuitively speaking, this should result in worldviews negotiating for control over whichever parts of the policy they care about most, in exchange for giving up control over parts of the policy they care about least. (This might look like different worldviews controlling different domains, or else multiple worldviews contributing to a compromise policy within a single domain.) Worldviews which care equally about all decisions would then get to make whichever decisions the other worldviews care about least.
  3. Worldviews which can generate good advice can be favored by the policy approach even if they don’t produce accurate predictions.
  4. Worldviews which produce accurate predictions but are overall harmful to give influence over your policy will be heavily downweighted by the policy approach.
  5. Intuitively speaking, the reason we should pay much more attention to our own worldview than to other people’s is that, in the long term, it pays to develop and apply a unified worldview we understand very well, rather than being pulled in different directions by our incomplete understanding of others’ views. The policy approach captures this intuition: a worldview might be very reasonable but unable to give us actionable advice (for example if we don’t know how to consistently apply the worldview to our own situation). Under the policy approach, such worldviews either get lower weight in the original estimate, or else aren’t able to identify specific decisions they care a lot about influencing, and therefore have little impact on our final strategy. Whereas our own worldview tends to be the most actionable, especially in the scenarios where we’re most successful at achieving our goals.

I also think that the policy approach is much more compatible with good community dynamics than the epistemic approach. I’m worried about cycles where everyone defers to everyone else’s opinion, which is formed by deferring to everyone else’s opinion, and so on. Groupthink is already a common human tendency even in the absence of explicit epistemic-modesty-style arguments in favor of it. By contrast, the policy approach eschews calculating or talking about all-things-considered credences, which pushes people towards talking about (and further developing) their own worldviews, which has positive externalities for others who can now draw on more distinct worldviews to make their own decisions.

Problems with the policy approach

Having said all this, there are several salient problems with the policy approach; I’ll cover four, but argue that none of them are strong objections.

Firstly, although we have straightforward ways to combine credences on different claims, in general it can be much harder to combine different policies. For example, if two worldviews disagree on whether to go left or right (and both think it’s a very important decision) then whatever action is actually taken will seem very bad to at least one of them. However, I think this is mainly a problem in toy examples, and becomes much less important in the real world. In the real world, there are almost always many different strategies available to us, rather than just two binary options. This means that there’s likely a compromise policy which doesn’t differ too much from any given worldview’s policy on the issues it cares about most. Admittedly, it’s difficult to specify a formal algorithm for finding that compromise policy, but the idea of fairly compromising between different recommendations is one that most humans find intuitive to reason about. A simple example: if two policies disagree on many spending decisions, we can give each a share of our overall budget and let it use that money how it likes. Then each policy will be able to buy the things it cares about most: getting control over half the money is usually much more than twice as valuable as getting control over all the money.

Secondly, it may be significantly harder to produce a good estimate of the value of each worldview’s advice than the accuracy of each worldview’s predictions, because we tend to have much less data about how well personalized advice works out. For example, if a worldview tells us what to do in a dozen different domains, but we only end up entering one domain, it’s hard to evaluate the others. Whereas if a worldview makes predictions about a dozen different domains, it’s easier to evaluate all of them in hindsight. (This is analogous to how credit assignment is much harder in reinforcement learning than in supervised learning.)

However, even if in practice we end up mostly evaluating worldviews based on their epistemic track record, I claim that it’s still valuable to consider the epistemic track record as a proxy for the quality of their advice, rather than using it directly to evaluate how much we trust each worldview. For example: suppose that a worldview is systematically overconfident. Using a direct epistemic approach, this would be a big hit to its trustworthiness. However, the difference between being overconfident and being well-calibrated plausibly changes the worldview’s advice very little, e.g. because it doesn’t change that worldview’s relative ranking of options. Another example: predictions which many people disagree with can allow you to find valuable neglected opportunities, even if conventional wisdom is more often correct. So when we think of predictions as a proxy for advice quality, we should place much more weight on whether predictions were novel and directionally correct than whether they were precisely calibrated.

Thirdly, the policy approach as described thus far doesn’t allow worldviews to have more influence over some individuals than others - perhaps individuals who have skills that one worldview cares about far more than any other; or perhaps individuals in worlds where one worldview’s values can be fulfilled much more easily than others’. Intuitively speaking, we’d like worldviews to be able to get more influence in those cases, in exchange for having less influence in other cases. In the epistemic approach, this is addressed via variance normalization across many possible worlds - but as discussed above, this could be significantly affected by how you differentiate the possibilities (and also what your prior is over those worlds). I think the policy approach can deal with this in a more principled way: for any set of possible worlds (containing people who follow some set of worldviews) you can imagine the worldviews deciding on how much they care about different decisions by different people in different possible worlds before they know which world they’ll actually end up in. In this setup, worldviews will trade away influence over worlds they think are unlikely and people they think are unimportant, in exchange for influencing the people who will have a lot of influence over more likely worlds (a dynamic closely related to negotiable reinforcement learning).

This also allows us a natural interpretation of what we’re doing when we assign weights to worldviews: we’re trying to rederive the relative importance weights which worldviews would have put on the branch of reality we actually ended up in. However, the details of how one might construct this “updatelessoriginal position are an open problem.

One last objection: hasn’t this become far too complicated? “Reducing” the problem of epistemic deference to the problem of updateless multi-agent negotiation seems very much like a wrong-way reduction - in particular because in order to negotiate optimally, delegates will need to understand each other very well, which is precisely the work that the whole meta-rationality framing was attempting to avoid. (And given that they understand each other, they might try adversarial strategies like threatening other worldviews, or choosing which decisions to prioritize based on what they expect other worldviews to do.)

However, even if finding the optimal multi-agent bargaining solution is very complicated, the question that this post focuses on is how to act given severe constraints on our ability to compare and merge worldviews. So it’s consistent to believe that, if worldviews are unable to understand each other, they’ll do better by merging their policies than merging their beliefs. One reason to favor this idea is that multi-agent negotiation makes sense to humans on an intuitive level - which hasn’t proved to be true for other framings of epistemic modesty. So I expect this “reduction” to be pragmatically useful, especially when we’re focusing on simple negotiations over a handful of decisions (and given some intuitive notion of worldviews acting “in good faith”).

I also find this framing useful for thinking about the overall problem of understanding intelligence. Idealized models of cognition like Solomonoff induction and AIXI treat hypotheses (aka worldviews) as intrinsically distinct. By contrast, thinking of these as models of the limiting case where we have no ability to combine worldviews naturally points us towards the question of what models of intelligence which involve worldviews being merged might look like. This motivates me to keep a hopeful eye on various work on formal models of ideal cognition using partial hypotheses which could be merged together, like finite factored sets (see also the paper) and infra-bayesianism. I also note a high-level similarity between the approach I've advocated here and Stuart Armstrong's anthropic decision theory, which dissolves a number of anthropic puzzles via converting them to decision problems. The core insight in both cases is that confusion about how to form beliefs can arise from losing track of how those beliefs should relate to our decisions - a principle which may well help address other important problems.

Wednesday, 25 May 2022

Science-informed normativity

The debate over moral realism is often framed in terms of a binary question: are there ever objective facts about what’s moral to do in a given situation? The broader question of normative realism is also framed in a similar way: are there ever objective facts about what’s rational to do in a given situation? But I think we can understand these topics better by reframing them in terms of the question: how much do normative beliefs converge or diverge as ontologies improve? In other words: let’s stop thinking about whether we can derive normativity from nothing, and start thinking about how much normativity we can derive from how little, given that we continue to improve our understanding of the world. The core intuition behind this approach is that, even if a better understand of science and mathematics can’t directly tell us what we should value, it can heavily influence how our values develop over time.

Values under ontology improvements


By “ontology” I mean the set of concepts which we use to understand the world. Human ontologies are primarily formulated in terms of objects which persist over time, and which have certain properties and relationships. The details have changed greatly throughout history, though. To explain fire and disease, we used to appeal to spirits and curses; over time we removed them and added entities like phlogiston and miasmas; now we’ve removed those in turn and replaced them with oxidation and bacteria. In other cases, we still use old concepts, but with an understanding that they’re only approximations to more sophisticated ones - like absolute versus relative space and time. In other cases, we’ve added novel entities - like dark matter, or complex numbers - in order to explain novel phenomena.


I’d classify all of these changes as “improvements” to our ontologies. What specifically counts as an improvement (if anything) is an ongoing debate in the philosophy of science. For now, though, I’ll assume that readers share roughly common-sense intuitions about ontology improvement - e.g. the intuition that science has dramatically improved our ontologies over the last few centuries. Now imagine that our ontologies continue to dramatically improve as we come to better understand the world; and that we try to reformulate moral values from our old ontologies in terms of our new ontologies in a reasonable way. What might happen?


Here are two extreme options. Firstly, very similar moral values might end up in very different places, based on the details of how that reformulation happens, or just because the reformulation is quite sensitive to initial conditions. Or alternatively, perhaps even values which start off in very different places end up being very similar in the new ontology - e.g. because they turn out to refer to different aspects of the same underlying phenomenon. These, plus intermediate options between them, define a spectrum of possibilities. I’ll call the divergent end of this spectrum (which I’ve defended elsewhere) the “moral anti-realism” end, and the convergent end the “moral realism” end.


This will be much clearer with a few concrete examples (although note that these are only illustrative, because the specific beliefs involved are controversial). Consider two people with very different values: an egoist who only cares about their own pleasure, and a hedonic utilitarian. Now suppose that each of them comes to believe Parfit’s argument that personal identity is a matter of degree, so that now the concept of their one “future self” is no longer in their ontology. How might they map their old values to their new ontology? Not much changes for the hedonic utilitarian, but a reasonable egoist will start to place some value on the experiences of people who are “partially them”, who they previously didn’t care about at all. Even if the egoist’s priorities are still quite different from the utilitarian’s, their values might end up significantly closer together than they used to be.


An example going the other way: consider two deontologists who value non-coercion, and make significant sacrifices to avoid coercing others. Now consider an ontological shift where they start to think about themselves as being composed of many different subagents which care about different things - career, relationships, morality, etc. The question arises: does it count as “coercion” when one subagent puts a lot of pressure on the others, e.g. by inducing a strong feeling of guilt? It’s not clear that there’s a unique reasonable answer here. One deontologist might reformulate their values to only focus on avoiding coercion of others, even when they need to “force themselves” to do so. The other might decide that internal coercion is also something they care about avoiding, and reduce the extent to which they let their “morality” subagent impose its will on the others. So, from a very similar starting point, they’ve diverged significantly under (what we’re assuming is an) ontological improvement.


Other examples of big ontological shifts: converting from theism to atheism; becoming an illusionist about consciousness; changing one’s position on free will; changing one’s mind about the act-omission distinction (e.g. because the intuitions for why it’s important fall apart in the face of counterexamples); starting to believe in a multiverse (which has implications for infinite ethics); and many others which we can’t imagine yet. Some of these shifts might be directly prompted by moral debate - but I think that most “moral progress” is downstream of ontological improvements driven by scientific progress. Here I’m just defining moral progress as reformulating values into a better ontology, in any reasonable way - where a person on the anti-realist side of the spectrum expects that there are many possible outcomes of moral progress; but a person on the realist side expects there are only a few, or perhaps just one.


Normative realism


So far I’ve leaned heavily on the idea of a “reasonable” reformulation. This is necessary because there are always some possible reformulations which end up very divergent from others. (For example, consider the reformulation “given a new ontology, just pretend to have the old ontology, and act according to the old values”.) So in order for the framework I’ve given so far to not just collapse into anti-realism, we need some constraints on what’s a “reasonable” or “rational” way to shift values from one ontology to another.


Does this require that we commit to the existence of facts about what’s rational or irrational? Here I’ll just apply the same move as I did in the moral realism case. Suppose that we have a set of judgments or criteria about what counts as rational, in our current ontology. For example, our current ontology includes “beliefs”, “values”, “decisions”, etc; and most of us would classify the claim “I no longer believe that ‘souls’ are a meaningful concept, but I still value people’s souls” as irrational. But our ontologies improve over time. For example, Kahneman and Tversky’s work on dual process theory (as well as the more general distinction between conscious and unconscious processing) clarifies that “beliefs” aren’t a unified category - we have different types of beliefs, and different types of preferences too. Meanwhile, the ontological shifts I mentioned before (about personal identity, and internal subagents) also have ramifications for what we mean when talking about beliefs, values, etc. If we try to map our judgements of what’s reasonable into our new ontology in a reflectively consistent way (i.e. a way that balances between being “reasonable” according to our old criteria, and “reasonable” according to our new criteria), what happens? Do different conceptions of rationality converge, or diverge? If they strongly converge (the “normative realist” position) then we can just define reasonableness in terms of similarity to whatever conception of rationality we’d converge to under ontological improvement. If they strongly diverge, then…well, we can respond however we’d like; anything goes!


I’m significantly more sympathetic to normative realism as a whole than moral realism, in particular because of various results in probability theory, utility theory, game theory, decision theory, machine learning, etc, which are providing increasingly strong constraints on rational behavior (e.g. by constructing different types of dutch books). In the next section, I’ll discuss one theory which led me to a particularly surprising ontological shift, and made me much more optimistic about normative realism. Having said that, I’m not as bullish on normative realism as some others; my best guess is that we’ll make some discoveries which significantly improve our understanding of what it means to be rational, but others which show us that there’s no “complete” understanding to be had (analogous to mathematical incompleteness theorems).


Functional decision theory as an ontological shift


There’s one particular ontological shift which inspired this essay, and which I think has dragged me significantly closer to the moral/normative realist end of the spectrum. I haven’t mentioned it so far, since it’s not very widely-accepted, but I’m confident enough that there’s something important there that I’d like to discuss it now. The ontological shift is the one from Causal Decision Theory (CDT) to Functional Decision Theory (FDT). I won’t explain this in detail, but in short: CDT tells us to make decisions using an ontology based on the choices of individual agents. FDT tells us to make decisions using an ontology based on the choices of functions which may be implemented in multiple agents (and by expanding the concepts of causation and possible worlds to include logical causation and counterpossible worlds).


Because of these shifts, a “selfish” agent using FDT can end up making choices more similar to the choices of an altruistic CDT agent than a selfish CDT agent, for reasons closely related to the traditional moral intuition of universalizability. FDT is still a very incomplete theory, but I find this a very surprising and persuasive example of how ontological improvements might drive convergence towards some aspects of morality, which made me understand for the first time how moral realism might be a coherent concept! (Another very interesting but more speculative point: one axis on which different versions of FDT vary is how “updateless” they are. Although we don’t know how to precisely specify updatelessness, increasingly updateless agents behave as if they’re increasingly altruistic, even towards other agents who could never reciprocate.)


Being unreasonable


Suppose an agent looks at a reformulation to a new ontology, and just refuses to accept it - e.g. “I no longer believe that ‘souls’ are a meaningful concept, but I still value people’s souls”. Well, we could tell them that they were being irrational; and most such agents care enough about rationality that this is a forceful objection. I think the framing I’ve used in this document makes this argument particularly compelling - when you move to a new ontology in which your old concepts are clearly inadequate or incoherent, then it’s pretty hard to defend the use of those old concepts. (This is a reframing of the philosophical debate on motivational internalism.)


But what if they said “I believe that I am being irrational, but I just refuse to stop being irrational”; how could we respond then? The standard answer is that we say “you lose” - we explain how we’ll be able to exploit them (e.g. via dutch books). Even when abstract “irrationality” is not compelling, “losing” often is. Again, that’s particularly true under ontology improvement. Suppose an agent says “well, I just won’t take bets from Dutch bookies”. But then, once they’ve improved their ontology enough to see that all decisions under uncertainty are a type of bet, they can’t do that - or at least they need to be much unreasonable to do so.


None of this is particularly novel. But one observation that I haven’t seen before: the “you lose” argument becomes increasingly compelling the bigger the world is. Suppose you and I only care about our wealth, but I use a discount rate 1% higher than yours. You tell me “look, in a century’s time I’ll end up twice as rich as you”. It might not be that hard for me to say “eh, whatever”. But suppose you tell me “we’re going to live for a millennium, after which I’ll predictably end up 20,000 times richer than you” - now it feels like a wealth-motivated agent would need to be much more unreasonable to continue applying high discounts. Or suppose that I’m in a Pascal’s mugging scenario where I’m promised very high rewards with very low probability. If I just shrug and say “I’m going to ignore all probabilities lower than one in a million”, then it might be pretty tricky to exploit me - a few simple heuristics might be able to prevent myself being dutch-booked. But suppose now that we live in a multiverse where every possible outcome plays out, in proportion to how likely it is. Now ignoring small probabilities could cause you to lose a large amount of value in a large number of multiverse branches - something which hooks into our intuitive sense of “unreasonableness” much more strongly than the idea of “ignoring small probabilities” does in the abstract. (Relatedly, I don’t think it’s a coincidence that utilitarianism has become so much more prominent in the same era where we’ve become so much more aware of the vastness of the universe around us.)


Why am I talking so much about “reasonableness” and moral persuasion? After all, agents which are more rational will tend to survive more often, acquire more resources, and become more influential: in the long term, evolution will do the persuasion for us. But it’s not clear that the future will be shaped by evolutionary pressures - it might be shaped by the decisions of goal-directed agents. Our civilization might be able to “lock in” certain constraints - like enough centralization of decision-making that the future is steered by arguments rather than evolution. And thinking about convergence towards rationality also gives us a handle for reasoning about artificial intelligence. In particular, it would be very valuable to know how much applying a minimal standard of reasonableness to their decisions would affect how goal-directed they’ll be, and how aligned their goals will be with our own.


How plausible is this reasoning?


I’ve been throwing around a lot of high-level concepts here, and I wouldn’t blame readers for feeling suspicious or confused. Unfortunately, I don’t have the time to make them clearer. In lieu of that, I’ll briefly mention three intuitions which contribute towards my belief that the position I’ve sketched in this document is a useful one.


Firstly, I see my reframing as a step away from essentialism, which seems to me to be the most common mistake in analytic philosophy. Sometimes it’s pragmatically useful to think in terms of clear-cut binary distinctions, but in general we should almost always aim to be able to ground out those binary distinctions in axes of continuous variation, to avoid our standard bias towards essentialism. In particular, the moral realism debate tends to focus on a single binary question (do agents converge to the same morality given no pre-existing moral commitments?) whereas I think it’d be much more insightful to focus on a less binary question (how small or large is the space of pre-existing moral commitments which will converge)?


Secondly, there’s a nice parallel between the view of morality which I’ve sketched out here, and the approach some mathematicians take, of looking at different sets of axioms to see whether they lead to similar or different conclusions. In our case, we’d like to understand whether similar starting intuitions and values will converge or diverge under a given approach to ontological reformulation. (I discuss the ethics-mathematics analogy in more detail here.) If we can make progress on meta-ethics by actually answering object-level questions like “how would my values change if I believed X?”, that helps address another common mistake in philosophy - failing to link abstract debates to concrete examples which can be deeply explored to improve philosophers’ intuitions about the problem.


And thirdly, I think this framing fits well with our existing experiences. Our values are strongly determined by evolved instincts and emotions, which operate using a more primitive ontology than the rest of our brains. So we’ve actually got plenty of experience in struggling to shift various values from one ontology to another, and the ways in which some people manage to do so, and some remain unreasonable throughout.  We just need to imagine this process continuing as we come to understand the world far better than we do today.

Friday, 15 April 2022

Three intuitions about effective altruism: responsibility, scale, self-improvement

This is a post about three intuitions for how to think about the effective altruism community.

Part 1: responsibility

The first intuition is that, in a global sense, there are no “adults in the room”. Before covid I harboured a hope that, despite the incessant political squabbling we see worldwide, in the face of a major crisis with global implications, there were serious people who would come out of the woodwork to ensure that it went well. There weren’t. And that’s not just a national phenomenon, that’s a global phenomenon. Even countries like New Zealand, which handled covid incredibly well, weren’t taking responsibility in the global way I’m thinking about - they looked after their own citizens, but didn’t try to speed up vaccine distribution overall (e.g. by allowing human challenge trials), or fix everyone else’s misunderstandings.

Others developed the same “no adults in the room” intuition by observing failures on different issues. For some, AI risk; for others, climate change; for others, policies like immigration or housing reform. I don’t think covid is a bigger failure than any of these, but I think it comes much closer to creating common knowledge that the systems we have in place aren’t capable of steering through global crises. This naturally points us towards a long-term goal for the EA community: to become the adults in the room, the people who are responsible enough and capable enough to steer humanity towards good outcomes.

By this I mean something different from just “being in charge” or “having a lot of power”. There are many large power structures, containing many competent people, which try to keep the world on track in a range of ways. What those power structures don’t have is the ability to absorb novel ideas and take novel actions in response. In other words, the wider world solves large problems via OODA loops that take decades. In the case of climate change, decades of advocacy led to public awareness which led to large-scale policies, plus significant reallocation of talent. I think this will be enough to avoid catastrophic outcomes, but that’s more from luck than skill. In the case of covid, the OODA loop on substantially changing vaccine regulations was far too long to make a difference (although maybe it’ll make a difference to the next pandemic).

The rest of the world has long OODA loops because people on the inside of power structures don’t have strong incentives to fix problems; and because people on the outside can’t mobilise people, ideas and money quickly. But EA can. I don’t think there’s any other group in the world which can allocate as much talent as quickly as EA has; I don’t think there’s any other group which can identify and propagate important new ideas as quickly as EA can; and there are few groups which can mobilise as much money as flexibly.

Having said all that, I don’t think we’re currently the adults in the room, or else we would have made much more of a difference during covid. While it’s not itself a central EA concern, it’s closely related to one of our central concerns, and would have been worth addressing for reputational reasons alone. But I do think we were closer to being the adults in the room than almost any other group - particularly in terms of long-term warnings about pandemics, short-term warnings about covid in particular, and converging quickly towards accurate beliefs. We should reflect on what would have been needed for us to convert those advantages into much more concrete impact.

I want to emphasise, though, that being the adults in the room doesn’t require each individual to take on a feeling of responsibility towards the world. Perhaps a better way to think about it: every individual EA should take responsibility for the EA community functioning well, and the EA community should take responsibility for the world functioning well. (I’ve written a little about the first part of that claim in point four of this post.)

Part 2: scale, not marginalism

Historically, EA has thought primarily about the marginalist question of how to do the most good per unit of resources. An alternative, which is particularly natural in light of part 1, is to simply ask: how can we do the most good overall? In some sense these are tautologically equivalent, given finite resources. But a marginalist mindset makes it harder to be very ambitious - it cuts against thinking at scale. For the most exciting projects, the question is not “how effectively are we using our resources”, but rather “can we make it work at all?” - where if it does work it’ll be a huge return on any realistic amount of investment we might muster. This is basically the startup investor mindset; and the mindset that focuses on megaprojects.

Marginalism has historically focused on evaluating possible projects to find the best one. Being scale-focused should nudge us towards focusing more on generating possible projects. On a scale-focused view, the hardest part is finding any lever which will have a big impact on the world. Think of a scientist noticing an anomaly which doesn’t fit into their existing theories. If they tried to evaluate whether the effects of understanding the anomaly will be good or bad, they’d find it very difficult to make progress, and maybe stop looking. But if they approach it in a curious way, they’re much more likely to discover levers on the world which nobody else knows about; and then this allows them to figure out what to do.

There are downsides of scaling, though. Right now, EA has short OODA loops because we have a very high concentration of talent, a very high-trust environment, and a small enough community that coordination costs are low. As we try to do more large-scale things, these advantages will slowly diminish; how can we maintain short OODA loops regardless? I’m very uncertain; this is something we should think more about. (One wild guess: we might be the one group best-placed to leverage AI to solve internal coordination problems.)

Part 3: self-improvement and growth mindset

In order to do these ambitious things, we need great people. Broadly speaking, there are two ways to get great people: recruit them, or create them. The tradeoff between these two can be difficult - focusing too much on the former can create a culture of competition and insecurity; focusing too much on the latter can be inefficient and soak up a lot of effort.

In the short term, it seems like there are still low-hanging fruit when it comes to recruitment. But in the longer term, my guess is that EA will need to focus on teaching the skillsets we’re looking for - especially when recruiting high school students or early undergrads. Fortunately, I think there’s a lot of room to do better than existing education pipelines. Part of that involves designing specific programs (like MLAB or AGI safety fundamentals), but probably the more important part involves the culture of EA prioritising learning and growth.

One model for how to do this is the entrepreneurship community. That’s another place where returns are very heavy-tailed, and people are trying to pick extreme winners - and yet it’s surprisingly non-judgemental. The implicit message I get from them is that anyone can be a great entrepreneur, if they try hard enough. That creates a virtuous cycle, because it’s not just a good way to push people to upskill - it also creates the sort of community that attracts ambitious and growth-minded people. I do think learning to be a highly impactful EA is harder in some ways than learning to be a great entrepreneur - we don’t get feedback on how we’re doing at anywhere near the same rate entrepreneurs do, so the strategy of trying fast and failing fast is much less helpful. But there are plenty of other ways to gain skills, especially if you’re in a community which gives you support and motivation to continually improve.