Monday, 27 November 2017

A New State Solution?

A few days ago I attended a fascinating talk by two officers in the Israel Defence Forces, Brigadier-General Amir Avivi and Sergeant Benjamin Anthony. They were representing the Miryam Institute, a recently-founded Israeli think-tank which advocates for the "New State" solution to the Israel-Palestine conflict. For those unfamiliar with the geopolitical situation, a very rough summary is that are two separate Palestinian territories, the Gaza Strip (controlled by Hamas) and the West Bank (controlled by the Palestinian Authority). The Gaza Strip is less than 10% of the size of the West Bank, and much poorer, but far more densely populated (2 million in Gaza strip vs 3 million in West Bank). Both are occupied by Israel, with significant economic repercussions, particularly in the Gaza Strip. There are also frequent outbreaks of violence between Palestinians and Israelis, including several uprisings ("intifadas"), crackdowns by the Israel Defence Forces, and regular terrorist and rocket attacks on Israel. Given the technological and military advantages of Israel, there are many more casualties amongst Palestinians; the number of these who are active combatants is disputed.


Proposed resolutions to the situation can generally be classified as either "one-state" or "two-state" solutions; in arguing for the New State solution, Avivi and Anthony aim to break out of those two categories (note that I follow their lead in using "two-state solution" to refer to the establishment of a Palestinian state roughly based on current borders, and autonomous from Israel). They started by making a number of claims, all of which sound fairly plausible:
  1. When considering possible solutions to the Israel-Palestine conflict, it's important for a Palestinian state, if formed, to be independent and sovereign.
  2. Controlling the West Bank is crucial to Israel's long-term security. Its western border is high ground overlooking flat, densely-populated portions of Israel. Snipers or artillery based there could easily terrorise most of Israel's population; defending against a large-scale assault launched from there would be incredibly difficult. In the long term, without Israeli securing the border, Jordan would likely not be able to prevent militants from ISIS or Iran from entering the West Bank.
  3. Israel, Gaza and the West Bank are tiny. Israel, at its narrowest point, is only 9 miles wide. Planes from Europe landing at Ben-Gurion Airport need to cross over almost the entire West Bank just to get the right landing angle. There's no way, in a two state solution, both could maintain effective sovereign airspace.
  4. The West Bank is not economically promising territory. It has no significant natural resources and no sea access.
  5. A disconnected Palestinian state comprising the West Bank and Gaza Strip would also be economically difficult; and even if they could be connected (e.g. via a tunnel) the connection could not realistically be defended.
The four points above suggest that no two-state solution can remove Palestine from Israeli control without leading to significant future conflict or else an effective client state status for Palestine. There are also other reasons why the current solutions are undesirable :
  1. Most proposed two-state solutions also involve the forced removal of hundreds of thousands of people from their homes; even if this is a necessary step, it still breaches their rights.
  2. All proposed one-state solutions would involve either Jews eventually losing control of the Israeli government (since Palestinians have many more children) or Palestinians being second-class citizens without voting rights.
  3. While there is a lot of focus on the West Bank, it is the Gaza Strip which is the most urgent humanitarian crisis; most proposed solutions unfairly treat it as a side issue.
  4. Given the chaos throughout the Middle East, and the threat of Iran, many nearby countries are or want to be on relatively good terms with Israel, and no longer find the Palestine situation useful to drum up outrage. So it'll become more difficult to apply enough pressure to force Israel to accepted the strategic concessions which I outlined above.
In a roundabout way, these claims boil down to the fact that there is no proposal for Palestinian autonomy which it would benefit Israel to accept - and that while this remains the case, peace cannot last. It may well be a good thing to apply international pressure to force Israel to alleviate the suffering of Palestinians in the short term, but in the long term I'm not sure that it will ever make such major strategic concessions as those outlined above. It does seem unjust, in response to this, to propose peace agreements which give Israel even more of what it wants. But any approach to the peace process which emphasises justice will inevitably get bogged down in arguments over historical wrongs stretching back at least to the foundation of Israel, if not before. I'm more interested in figuring out how to realistically achieve the best outcome for the most people - and so, it seems, is the Miryam Institute. Their proposal, the New State solution, is to form an independent Palestine comprising the Gaza Strip, plus a connected area of the Sinai Peninsula roughly the size of the West Bank. Israel would get control of the West Bank. Egypt would therefore lose a little less than 10% of the Sinai, or in other words a little less than 0.6% of its overall territory; this area houses roughly 30 000 people, or 0.03% of Egypt's population. Everyone involved would have the choice whether to stay in their current location, or to relocate.

What's required:
  1. Egyptian consent (and whatever remuneration would ensure it)
  2. Money to set up basic housing, infrastructure and services in the New State
With those requirements fulfilled, advocates of the New State solution claim that New Palestine could become autonomous without being an existential threat to Israel (in the same way as Jordan, Lebanon or Egypt) which wouldn't be possible in any other solution; and further, that it would have good opportunities to develop economically, with ports and the tourism potential of a beautiful coastline (unlike the West Bank).

The list of requirements might seem rather short. It would of course be ideal to also have the agreement of Israel, the Palestinian Authority, Hamas and the UN. But my favourite thing about this proposal is that unlike any other I've seen, it has the potential to work without requiring the agreement of any of the above parties (let alone all of them!). All you need is one country - say, the US - capable of cutting a deal with Egypt, pumping investment money into New Palestine, and providing interim governance for a couple of years until elections could take place. Given the option, I think many or most Gazans would flee there. Israel wouldn't be stupid enough to forcefully close the Egypt-Gaza border, especially if that meant antagonising the US. Hamas might want to stop people from leaving to the rest of New Palestine - but given the deprivation of the Gaza Strip, any attempt to do so would dry up its support very quickly. Similarly, the PA couldn't do much - and if the New State did well enough, the West Bank would find its population shrinking rapidly, up to the point where its being absorbed into Israel seems entirely reasonable.

I'm not even sure that very much money is needed, either. The terrible conditions in Gaza are fundamentally driven by oppression; left to themselves (and with the help of remittances from abroad), Gazans and immigrants to New Palestine would likely become better off relatively quickly. Further, US aid to Egypt (mostly military) is a significant amount, and the Egyptian economy is not doing particularly well; it may well be beneficial for Egypt to agree to the New State solution for little more than goodwill and the assurance of continued assistance.

I have, of course, been ignoring the elephant in the room, which is the fact that the New State proposal doesn't meet the demands of Palestinian negotiators - most notably the Right of Return of Palestinian refugees to Israeli territory, and partial control over Jerusalem - and would almost certainly be rejected outright by the Palestinian Authority. However, this does not seem to me to be a knock-down argument against it. The international community is not outraged primarily because Palestinians do not have the Right of Return, or control over Jerusalem; rather, the current situation is abhorrent because of the depredations which are being imposed on them. If there is a way for Palestinians (especially Gazans) to be liberated from that position, then that would be a good solution - not perfect, but a lot closer than any other. It's also worth noting that even if older generations are committed to the goals above, it seems unlikely that younger Palestinians, less represented by authorities, feel similarly about returning to areas that they've never lived in or even seen.

A second elephant is the reaction of the Israeli government - for example, whether they'd feel secure with a militarised Palestine as their neighbour. However, the fact that this proposal is being pushed by Israeli military officers in particular is reassuring on that front. New Palestine's borders wouldn't be as close to Israeli population centers as the West Bank is, and it also wouldn't have the advantage of high ground. And even if there were an eventual conflict, geography would allow it to be a fairly limited one (similar to the 2006 Israel-Lebanon war), unlike what would happen if Palestine kept its current borders and a war broke out.

Lastly, there's the question of whether a good government could be set up in New Palestine. In theory, a neutral interim government, followed by democratic elections, should allow the Palestinian people to choose a government that represents them. In practice, all major factions have been involved in terrorism against civilians, and it's unlikely that any government could truly represent a "fresh start". Again, however, this would be true in any proposed solution; and I hope that a New Palestinian government busy with running a country would be defined to a lesser extent by their opposition to Israel. When it comes to the Israel-Palestine conflict, hopes have often been misplaced. Perhaps, though, that is all the more reason to embrace a new approach.

Saturday, 18 November 2017

Book review: Happiness by Design

I've been reading a book called Happiness by Design, by Paul Dolan. Most of its material is very standard, but there was at least one thing I hadn't seen before. Dolan thinks that we should consider happiness to be a combination of the feelings of pleasure and purpose. He shows that this is a significant change in our definition of happiness because many of the most pleasurable activities, such as eating, feel the least purposeful - and vice versa. Unfortunately, Dolan doesn't ever make explicit arguments about why some states of mind should be considered 'happiness' and not others. Rather, he seems to be using an implicit definition of happiness as "the thing we value experiencing", or perhaps "the experiences which are intrinsically good for us". I'm going to simplify this and use the phrase "good experiences" instead. The claim that purposeful experiences are good experiences is not unreasonable - in hindsight I am glad to have had experiences of purposefulness, and hope to have more in the future (regardless of whether they lead to other good outcomes or not).

Unfortunately, this evaluation doesn't actually show that the experience of purposefulness is a good one, merely that I evaluate it as good for my past and future selves. As Dolan explains, it's well-known that evaluations of satisfaction don't fully reflect our actual experiences. Our judgements about the past are prone to several notable biases, including the peak-end effect, where the most extreme moment and the final moment of an experience disproportionately influence our later evaluations; duration neglect, where the length of an experience has disproportionately little influence; cognitive dissonance, where our evaluations skew towards confirming that we are the type of person we like to think we are; and priming effects, where making people think about money or relationships exaggerates the impacts of those factors on their reported life satisfaction (although many priming effects have been discredited, it seems like these ones are robust). Furthermore, as bad as we are at figuring out how good we felt during past experiences, we're even worse at figuring out how good we will feel during future experiences. People massively overrate how much happier they will be after getting a new job, or a new house; even events as extreme as winning the lottery or having a limb amputated each have only small effects on experienced happiness, which rebounds back to near a set point fairly soon afterwards.

I'm therefore torn about how much to value life-satisfaction versus valuing quality of experience (the latter position is what Dolan calls "sentimental hedonism", which differs from traditional hedonism in valuing more emotions than just pleasure and lack of pain) in evaluating the overall welfare of a life. On one hand, I'm loath to rely on a metric which is as variable as life-satisfaction evaluations. Even ignoring reporting biases, an evaluation might change from very negative to very positive based on the events in the last few minutes of one's life - for instance, depending on whether you hear that you won the Nobel Prize or not. There are also difficulties in judging cases where your goals change significantly over time. On the other hand, I believe very strongly that there are more things which matter to me than the emotions I feel. For example, I might know that founding a billion-dollar company or curing malaria would require sacrifices in the short term without making me any happier in the long term - yet still devote most of my life to achieving them. The hedonic position would conclude that even these great achievements are actively bad for me, compared with living a simple and happier life. It would also endorse pursuit of the feeling of meaningfulness instead of actually doing anything meaningful. (For a detailed discussion of related issues, check out my essay on different types of utilitarianism). Given the flaws of both metrics, my current theoretical position is somewhere in the middle, leaning a little more towards sentimental hedonism.

From the hedonistic perspective, a good way of judging whether purpose matters is simply to ask people how good they feel while actually doing purposeful activities. Dolan doesn't mention any such research, but it seems to me that Csikszentmihalyi's concept of of 'flow' is very relevant here. Flow can be roughly summarised as being "fully immersed in a feeling of energised focus". It's seen particularly in sportspeople (who often describe it as being "in the zone"), musicians and artists, and some religious practitioners. Notably, flow is often experienced as being "intrinsically rewarding". Flow is not synonymous with purposefulness, but I think both terms aim at the same broad idea: a component of wellbeing which depends on intention, and is distinct from pleasure (perhaps we can think of flow as a state of extreme purposefulness). A third phenomenon to consider is the way that we enjoy emotional experiences, for example watching tragedies, despite (or because?) they evoke negative emotions. This is not quite purpose or flow, but it does evoke the related idea of meaningfulness. If we take these three facets of wellbeing into account, we might revise some conclusions that have been drawn over the last few decades about which experiences are most rewarding. For example, research suggests that having children makes you less happy, by standard measures of happiness. However, it is plausible that parents experience less pleasure but more purpose and meaningfulness on a daily basis, which might compensate for this.

Happiness by Design does two more useful things (although in a rather confusing and messy way). Firstly it describes, using an economic metaphor, the "production" of happiness and unhappiness from attention. I think this is a useful analogy, especially in conjunction with a number of ways that Dolan claims we can increase happiness via shifting our attention. These include mindfulness, especially of new experiences or different facets of old experiences; not thinking about money; not focusing on our expectations; trying to resolve uncertainty; thinking of others; and avoiding shifting our attention too frequently by multitasking. These all seem like fairly sensible and standard cognitive habits which are very difficult to implement consistently. So secondly, Dolan describes a number of tangible steps which can help cement these cognitive habits. These include: recording how we actually feel during various activities; asking others to evaluate our happiness or even make decisions for us; changing our environments to prime us away from unrewarding activities, while adding cues for good habits; making public commitments (while still being willing to abandon sunk costs); spending more time physically with friends; spending more on experiences not possessions; and listening to music.

Dolan provides some evidence for why some of these work, but could do with a more rigorous approach in this section. Nevertheless, I'm willing to believe that most of the above steps are worth trying; I'd also add altruism to the list, since it has shown to be a good way to make yourself happier. Overall I'm glad that the belief that it's much easier to become happy by changing our attitudes and habits than by changing our external circumstances has become so widespread. I do worry a little that many of us are focusing on making those changes individually, without necessarily creating norms or community structures that will allow others to benefit from them automatically. These do exist in some communities (notably amongst effective altruists and rationalists) but I think a lot more can be done to make sure they spread more widely.

Friday, 17 November 2017

An introduction to deep learning

The last few years have seen a massive surge of interest in deep learning - that is, machine learning using many-layered neural networks. This is not unjustified - these deep neural networks have achieved impressive results on a wide range of problems. However, the core concepts are by no means widely understood; and even those with technical machine learning knowledge may find the variety of different types of neural networks a little bewildering. In this essay I'll start with a primer on the basics of neural networks, before discussing a number of different varieties and some properties of each.

Three points to be aware of before we start:
  1. Broadly speaking, there are three types of machine learning: supervised, unsupervised and reinforcement. They work as follows: in supervised learning, you have labeled data. For example, you might have a million pictures which you know contain cats, and a million pictures which you know contain dogs; you can then teach a neural net to distinguish between cats and dogs by making it classify each image and modifying it based on what it predicted. This is the most standard variety of deep learning. In unsupervised learning, you have unlabeled data. For example, you might have a million pictures which each contain either cats or dogs, but you don't know which ones are which. You can still create a system which divides the images up into two categories based on how similar they are to each other, and hopefully the two most obvious clusters will be cats and dogs. Reinforcement learning is when a system is able to take actions, then get feedback about how good those actions were which will affect future actions. For instance, to teach a robot how to walk using reinforcement learning, we could reward it for moving further while not falling over.
  2. The two major applications of deep learning that I'll be discussing in this essay are computer vision and natural language processing. Tasks grouped under the former include face recognition, image classification, video analysis in self-driving cars, etc; for the latter, speech recognition, machine translation, content summarisation and so on. Of course there are many more applications which don't fall under these headings, such as recommendation systems; there are also many important machine learning algorithms which have nothing to do with deep learning.
  3. Ultimately, as hyped as neural networks are, they are simply a complicated way of implementing a function from inputs to outputs. Depending on exactly how we represent that function, the dimensions of either the inputs or the outputs or both might be very, very large. Several of the following variations on neural networks can be viewed as attempts to reduce the dimensions of the spaces involved.
Basics of Neural Networks

The simplest neural networks are multilayer perceptrons (MLPs), which contain an input layer, a few hidden layers, and an output layer. Each layer consists of a number of artificial neurons (don't be fooled by the name, though; these "neurons" are very basic and very different to real neurons). Every neuron in a non-input layer is connected to every neuron in the previous layer; each connection has a weight which may be positive or negative (or zero). The value of a neuron is calculated as follows: take the value of each neuron in the previous layer, multiply it by the corresponding weight, and then add them all together. Also add a "bias" term. Now apply some non-linear function to the result to get the overall value of that neuron. If this seems confusing, the image below might help clear it up (within each neuron, the two symbols stand for the two steps of summing all weighted inputs, then applying a non-linear "activation function"). Finally, the output layer will return some value, which is the overall evaluation of the input. In the image, this is just one number; however, in other cases there will be many neurons in the output layer, each corresponding to a different feature that we're interested in. For instance, one output neuron might represent the belief that there is a cat in an input image; another might represent the belief that there is a dog. With large neural networks which have been trained on lots of data, such outputs can be more accurate at these simple image-recognition tasks than humans are.


Three more points to note:
  1. The non-linearity in each neuron is crucial. Without it, we would end up only doing linear operations - in other words, an arbitrarily deep neural network would be no better than connecting the inputs to the output directly. However, it turns out that we don't need particularly complicated non-linear functions. Later on I'll describe a few of the most common. 
  2. The architecture above involves a LOT of weights. Suppose we want to process a 640 x 480 pixel image (for simplicity, we'll say it's black and white). Then our input layer will contain 307200 values, one for each pixel. Let's say that (as is common) each hidden layer is the same size as the input layer. Then there will be 94.37 billion weights between the input layer and the first hidden layer (307200^2 plus 307200 bias weights), and the same number between each hidden layer and the next. That's absurdly many, especially considering that most images are several times larger than 640 x 480. In fact there are much better neural network architectures for image processing, which I'll describe later. 
  3. Roughly speaking, the more weights you use, the more data is required to train the neural network in the first place, but the more powerful it can be after training. This training has always been the most computationally expensive part of using neural networks.
Training Neural Networks

The key idea behind training neural networks is called gradient descent. Given some neural network and a number of inputs, we define a loss function, which is high when the outputs of the neural network on those inputs are very different to what they should be, and low when they are very similar. Gradient descent tells us to change the neural network's weights in whichever direction would most quickly make the loss function smaller: that is, to descend down the loss function's gradient. Then we calculate the new gradient at the point we've just reached, move a small distance in that direction, and repeat until the loss is as low as we can make it. (Since there are often a large number of inputs and it would be inefficient to calculate the exact gradient after every step, most people usually use stochastic gradient descent instead; this selects one of the inputs randomly at each step and calculates the loss function based only on that. Although this is sometimes very different from the gradient over all inputs, averaged over many steps it's generally close enough to work).

The image below illustrates both advantages and disadvantages of gradient descent. The advantage is that both black lines quickly make their way from the red area (representing high loss) to the blue areas (representing low loss). The disadvantage is that even when starting from very similar points, they take very different paths. Ideally we'd like to end up in whichever of those blue valleys is lower without getting "stuck" in the other one, no matter where we start. There are a variety of ways to make this more likely; perhaps the easiest is to do the training process multiple times, starting with different randomly-selected weights, and see which one ends up as the best solution.



This is all very well in principle, and seems easy enough in the two-dimensional case above - but as I mentioned above, there can be millions of weights in a neural network. So even to do stochastic gradient descent we need to find the gradient of a complicated function in a million-dimensional space. Surprisingly, this can actually be done exactly and efficiently using the backpropogation algorithm. This works by going from the output layer backwards, calculating how much each individual weight contributes to the loss function using the chain rule. For example, if Y is the value of the output node, L is the loss, and the neurons in the last hidden layer are A₁, A₂, etc, then we know that dL/dA₁ = dL/dY × dY/dA₁. But now we can calculate the gradient for any neuron B in the penultimate hidden layer, dL/dB, as a function of dL/dA₁, dL/dA₂, and so on. So we know how much we should update each weight in order to descend down the steepest gradient.

Non-linearities

I mentioned before that choice of non-linearity is important. The three most common non-linearities are the tanh function, the sigmoid function (pictured directly below) and RELUs (rectified linear units), which simply use the function max(0,x). 


These days RELUs are the most popular, because they make backpropogation very simple (since a RELU gradient is always either 1 or 0) and can learn faster than the others. There are a few problems with them, but not insurmountable ones. One is that a neuron might end up always in the 0 section of its activation function, so its gradient is also always 0 and its weights never update (the "dying RELU" problem). This can be solved using "leaky RELUs", which replace the zero gradient with a small positive gradient (e.g. by implementing max(0.1x, x); see image below). Another problem is that we usually want the final output to be interpretable as a probability, for which RELUs are not suitable; so the output layer generally uses either a sigmoid function (if there's only one output neuron) or a softmax function (if there are many).


Types of Neural Networks

Convolutional neural networks

These are currently the most important type of neural network for computer vision. I mentioned above that image processing with standard MLPs is very inefficient, because so many different weights are required. CNNs solve this problem by adding "convolutional layers" before the standard MLP hidden layers. The value of any neuron in a convolutional layer is calculated by multiplying a small part of the previous layer by a "kernel" matrix. For example, in the image below the value in the green square (4) is calculated by multiplying each value in the red square by the corresponding value in the kernel (blue square) and summing all products. Then the next value in the I*K layer (1) is calculated by moving the red square rightward one space and repeating the calculation. The same kernel is therefore applied to every possible position in the previous layer. The overall effect is to reduce the number of weights per layer drastically; for instance, in the image below only 9 weights are required for the kernel, whereas it would require 7*7*5*5=1225 weights to connect the two layers in a MLP.


Three things to note here:
  1. CNNs still have fully-connected layers like MLPs, but they come only after the convolutional layers have reduced the number of neurons per layer, so that not too many weights need to be learned. They usually also have "pooling layers", which reduce the number of neurons even further, as well as contributing to the network's ability to recognise similar images with different rotations, translations or proportions. 
  2. The usefulness of CNNs is the result of a tradeoff, as they are less general than MLPs. Kernels only make sense for inputs whose structure is local. Fortunately, images have this sort of structure. We can think of each kernel as picking out some local feature of an image (for instance, an edge) and passing it on to the next layer; then the kernel for the next layer detects some local structure of those features (for instance, the combination of edges to form a square) and passes that forward in turn. 
  3. The diagram above is a simplification because sometimes several kernels are used in a single layer - this can be thought of as detecting multiple different local features at once.
Recurrent Neural Networks

The two types of NNs I've discussed so far have both been varieties of feedforward neural network - meaning they have no cycles in them, and therefore the evaluation of an input doesn't depend on previous inputs. Relatedly, they can only take inputs of a fixed length. RNNs are different; their hidden layers feed back upon themselves. This makes them useful for processing sequences of data, because an RNN fed each element in turn is equivalent to a much larger NN which processes the whole sequence at once (as indicated in the "unfolding" step in the diagram below; each circle is a standard MLP, all with the same weights). If we want to train the RNN to predict a sequence, then the loss function should penalise each o to the extent that it differs from the following x.


Long Short-term Memory Networks

Standard RNNs can process sequences, but it turns out that they are quite "forgetful", because the output from a given element passes through a MLP once for each following element in the sequence, and quickly becomes attenuated. Almost all implementations now replace those MLPs with a more complicated structure which retains information more easily; these modified RNNs are called LSTMs. The diagram below may look complicated but it's equivalent to the one above, only with the circles replaced with boxes. Inside these boxes, each rectangle represents a neural network layer, its label shows which nonlinearity it uses, and each circle represents simple addition or multiplication. The particular architecture below is not unique, and many variations exist, most of which have proven very effective at predicting sequences.



Some less prominent types of neural networks:

Recursive Neural Networks

Recursive NNs are a generalisation of Recurrent Neural Networks. I noted above that recurrent NNs "unroll" into a linear structure. But what if we wanted them to unroll into something more interesting: in particular, a tree structure? We can do this by creating a neural network which takes two inputs and combines them in some way so that they can be passed onwards, as well as outputting a score. In a similar way to how convolutional NNs are well-suited to vision problems, recursive NNs have been used successfully in natural language processing, since natural languages are inherently recursive and tree-structured. However, they don't seem as popular as the others discussed above, possibly because they require input to be pre-processed to form a tree structure.

Generative Adversarial Networks

GANs can be used to produce novel outputs, particularly images. They consist of two networks: the generative network creates candidate images, and the discriminator network decides whether they are real or fake. Each learns via backpropogation to try and beat the other; this results in the generative network eventually being able to produce photorealistic images.

Residual Neural Networks

These were introduced by Microsoft quite recently. The core idea is that it is often easier for a neural layer to calculate its output as a change from the input (a 'residual'), rather than as an unreferenced value. This is particularly useful in very deep neural networks.

Deep Belief Networks

These are similar to standard MLPs, except that the connections between layers go both ways. Each adjacent pair of layers can be trained independently using unsupervised learning, before being combined into the DBN for additional supervised learning. (A single pair of layers with these properties is called a Restricted Boltzmann Machine).

Self-Organising Maps

These use unsupervised learning to produce a low-dimensional representation of input data. They use competitive learning instead of error-correction learning (such as gradient descent).

Capsule Networks

These are a very new architecture, based on CNNs and introduced by one of the founders of deep learning, Geoffrey Hinton. Apparently, the pooling layers in CNNs introduce the undesirable side effect of also accepting some types of deformed images. This means that CNNs are not very good at representing 3-dimensional objects as geometric objects in space. Capsules address this problem using an algorithm called "dynamic routing"; because of this, they require much fewer examples to train (although they still need massive computational power). Capsule networks seem very exciting and I'll hopefully be investigating them further as part of a project for my MPhil, so stay tuned.

Sunday, 5 November 2017

Epistemic modesty

Inspired by Eliezer's new book, and a conversation I had with Ben Pace a few months back, I decided to write up some thoughts on the problem of trying to form opinions while accounting for the fact that other people disagree with you - in other words, epistemic modesty. After finishing, I realised that a post making very similar points had been published by Gregory Lewis on an effective altruism forum a few days beforehand. I decided to upload this anyway since I come at it from a slightly different angle.

The most basic case I'm interested in would be meeting somebody who is in just as good an epistemic position are you are - someone who is exactly as intelligent, rational and well-informed as you, and who shares your basic assumptions (i.e. your 'prior') - but who nevertheless disagrees with you to a significant extent. Let's say they assign 20% confidence to a proposition you believe with 80% confidence. At least one of you is missing something; since your opinions are symmetric in all relevant ways, it's just as likely to be you as them. It seems inescapable that, if you don't have enough time to resolve the dispute, and you're really sure that you reason equally well, you should split the difference and both update to 50%. Call this the symmetry argument. Now the possibility of meeting your epistemic "evil twin" is a bit of a farfetched one, but we can easily modify the scenario to stipulate that they are in at least as good an epistemic position as you are, so that if you meet them your credence in the proposition you disagree upon should drop from 80% to 50% or below.

In reality we don't have isolated cases like this; you meet people who agree or disagree with you all the time, so that the effect of any given individual's opinions on your own is fairly small. To be clear, I'm talking here about the effect of simply learning that they have that opinion, assuming again that you don't get a chance to discuss it with them. Even if this constitutes only a small amount of evidence for that opinion, it's still some information. By this logic, though, you shouldn't just be swayed by the opinions of the people you've met, but also everyone else who has an opinion on this subject. You can think of the people whose opinions you know as samples from that population, each of which shifts your beliefs about the underlying distribution. Most of these samples will be very biased, because of filter effects and social bubbles, but theoretically you can control for that using standard statistics to get your best estimate of how opinions on a given topic vary by expertise, intelligence, rationality and other relevant factors.

We face two further issues. The first is how to account for the distribution of other people's beliefs when forming your own. This is something that an ideal Bayesian reasoner would do by using a prior. It's also a domain where I think humans naturally do well at approximating a prior, since our social intuitions are heavily optimised for figuring out other people's trustworthiness. Additionally, there are plenty of historical examples, although it's difficult to quantify them in a way that we can use to draw conclusions. (It seems like Eliezer's new book will attempt to provide a framework which can be used to address this issue).

The second issue is this: what if you meet someone who is in just as good an epistemic position as you are (as described above), and who agrees with you about the distribution of everyone else's beliefs about some proposition, but who still disagrees with you about the probability of that proposition? According to the methodology I described above, it's enough to just update your estimates for who believes what, and your resulting probability estimates. This will usually only result in a small shift, because they're only one person - whereas the symmetry argument in the first paragraph suggests that you should in fact end up halfway between their opinion and your own. But if we accepted the symmetry argument, so that considering the disagreement of just one other person could rationally change any of your beliefs from arbitrary confidence to 50%, then the effect of the opinions billions of other people must be so overwhelming that no rational person could ever defy a consensus. What gives?

(I'm eliding over the issue of how confident you are that the other person is in at least as good an epistemic position as you are. While obviously uncertainty about that will have some effect, almost all of us should be very confident that such people exist, and can imagine being convinced that someone you've met is one of them).

The problem, I think, is the clash between the reasonable intuition towards epistemic humility, and any formalisation of it. Suppose you reason according to some system R, which assigns probabilities to various propositions. There's always a chance that R is systematically flawed, or you've made a mistake in implementing it, and so you should update your reasoning system to take that into account. Let's call this new reasoning system R'; it's very similar to R except that it's a bit less confident in most beliefs. But then we can apply exactly the same logic to R' to get R'', and so on indefinitely, each time becoming less and less certain of your beliefs until you end up not believing very much at all!

I can see three potential ways to defuse this issue. Firstly, if such adjustments are limited in scope, so that even arbitrarily many changes from R to R' to R'' only change your beliefs by a small, bounded amount, then you can still be confident in some beliefs. But I don't think that this can be the case, because if you are using R and you meet someone else using R who very confidently disagrees with you, then by the symmetry argument you need to be able to adjust any belief all the way down to 50% (or very near it). This sort of adjustment is the responsibility of R', and so R' must be able to modify R to a significant extent. Further, it's almost certain that there are people out there who are in general in a better epistemic position than you and disagree with you strongly! So your resulting beliefs will still end up slavishly following other people's. A second possibility is if you're certain that you have already taken every factor into account - so that if you meet someone who shares your priors, and who reasons similarly to you, then it's impossible for the two of you to have come to different conclusions. This is ideal Bayesianism. But even ideal Bayesian agents have blind spots - for example, they can't assign any nonzero probability to revisions to the laws of logic or mathematics, or any other possibility that implies that Bayesian reasoning is incorrect. But we ourselves were convinced of such correctness by evidence which could be overturned, for example by someone publishing a proof that Bayesianism is inconsistent. It seems like we'd want an "ideal" agent to be capable of evaluating that proof in a way that an ideal Bayesian can't.

If you think this sounds a bit Godellian, though, you're right. What we're looking for seems to be very similar to asking a proof-based system to demonstrate its own consistency, which is impossible. I don't know how that result might extent to probabilistic reasoners; it seems likely that asking them to assign coherent probabilities to the prospect of their own failure would be similarly futile. But it might not be, which is the third way out: for your epistemology to be self-referential in a way which allows it to sensibly assign a small probability to the possibility that it is systematically wrong. This seems closely related to the issue of Vingean uncertainty, and also hints at problems with defining agenthood which I plan to explore in upcoming posts.