Tuesday, 24 July 2018

You can't download happiness, but...

...you can download machine learning papers, which is almost the same thing, and I have so many papers on my to-read list. I like being able to write summaries of the most interesting ones, but when I hit 200 open tabs of interesting papers, I realised that the backlog was only going to keep piling up. So I've spent the day skimming as many as I could. There are many I'd like go to back and reread in more detail, but for now the summaries below will have to do.

Deep learning theory

Approximation by superpositions of a sigmoidal function. The original paper showing that sufficiently-wide one-hidden-layer neural networks can approximate any function.

Why does unsupervised pre-training help deep learning? Another classic paper, offering an explanation for why unsupervised pre-training improves performance: apparently it's an "unusual form of regularisation". (How widely is it still used?)

Do deep nets really need to be deep? Ba and Caruna show that shallow neural networks can achieve the same performance as deep neural networks with the same number of parameters if they are trained to mimic the deeper networks - even though they can't reach it when trained independently. When mimicking, shallow nets need to output the same probabilities, not just the same classification.

How does batch normalisation help optimisation? Batch normalisation, a widely-used technique in deep learning, changes the mean and variance of each batch's activations in each layer to be zero and one respectively. The standard explanation of why batch normalisation works is that it reduces internal covariate shift, in which each layer has to keep adapting to a new distribution of inputs. However, the authors find that batch normalisation may not be reducing ICS, and that batch normalised networks with artificially-increased ICS still show good performance. Instead, they argue that batch normalisation makes the loss landscape significantly smoother, and increases its Lipschitzness.

The reversible residual network. With neural networks becoming deeper and wider, memory use (for storing activations to be used in backpropagation) becomes a bottleneck. However, in reversible networks a layer's activations can be deduced from the next layer's. In particular, reversibility is achieved by splitting inputs into two groups, x1 and x2, and calculating the next layer as follows: y1 = x1 + F(x2), y2 = G(y1) + x2. So given y1, y2, and the current weights (which determine F and G), x1 and x2 can be deduced. This adds a computational overhead of around 33-50%, but does not decrease performance very much, and in some cases increases it. An alternative which makes the same tradeoff is checkpointing, in which some of the activations are discarded and recomputed when necessary for backpropagation.

Deep reinforcement learning

Reinforcement learning squared. OpenAI trains a "fast" reinforcement learner (implemented as an RNN) using "slow" standard reinforcement learning.

Prefrontal cortex as a meta-reinforcement learning system. DeepMind argues that something very like the mechanism in the OpenAI paper above is being implemented by the brain, using dopamine to train the prefrontal cortex.

Universal Value Function Approximators. Schaul et al. extend RL techniques using value functions to environments where there are multiple goals, and show that Universal Value Function Approximators can generalise to new goals.

Prioritised Experience Replay. Schaul et al. improve experience replay by prioritising "surprising" transitions (those with high temporal difference error). Since pure prioritisation skews the samples too much, prioritised samples are interpolated with random samples.

Hindsight Experience Replay. OpenAI use HER to speed up learning from sparse rewards in environments with multiple goals. Even if agents don't reach their current goal, if they reach another goal then they can interpret their experience as a success towards that other goal.

Asynchronous methods for deep RL DeepMind claims that asynchronicity is a robust and effective replacement for experience replay as a way of reducing the instability of deep RL (caused by the fact that sequences of observations are non-stationary, and updates are highly correlated). Unlike experience replay, it also allows on-policy algorithms. Note that last I heard, OpenAI disagreed about the value of asynchronicity.

Deep RL with Double Q-learning. DDQN waas introduced in response to the fact that deep Q-learning tends to overestimate the values of the actions chosen (essentially because of the winner's curse). DDQN also uses the slower-updating target networks of the original DQN paper. Instead of evaluating actions using the current value network, DDQN does so using its target network. This is a very minimal change which nevertheless seems to make a significant difference.

Dueling Network Architectures. This deep RL architecture separates learning which states are valuable from learning which actions are valuable, combining the two in order to calculate standard Q values. Instead of the value of a state-action pair being updated only when that action is chosen from that state, in the dueling architecture the value of a state-action pair is effectively updated every time the state is visited. In addition, state value gaps tend to be much larger than action value gaps, so updating them separately is more stable. For additional stability, advantages are normalised so that the chosen action is at 0. This leads to improvements over DDQN in most Atari games.

Feudal Networks for Hierarchical Reinforcement Learning. Their hierarchical structure apparently allows FuNs to perform particularly well on tasks involving long-term credit assignment or memorisation.

The Reactor: A fast and sample-efficient actor-critic agent for RL. This seems to be DeepMind's latest and most advanced architecture: the Retrace-Actor framework "takes a principled approach to combining the sample-efficiency of off-policy experience replay with the time-efficiency of asynchronous algorithms".

Interpretability and transparency

Towards a rigorous science of interpretable machine learning. Doshi-Velez and Kim define and discuss interpretability.

Challenges for transparency. Weller carries out a high-level survey of transparency, which can be roughly defined as interpretability + sharing of information. He points out some ways in which transparency can be harmful, e.g. Braess' paradox, or that transparency may make us complacent.

Interpretable and pedagogical examples. Milli et al. improve the interpretability of student-teacher interactions by training the student and teacher iteratively, rather than jointly. Previous work in this area usually led to student-teacher pairs learning an arbitrary code.

NLP and grounded language learning

Multi-agent cooperation and the emergence of (natural) language. Lazaridou et al. introduce multi-agent communication games. In particular, a sender is trained (using reinforcement learning) to send a word to a receiver to signal which of two images to select. The sender had a vocabulary of 10-100 words available, but in this setup only needed to use two words in order to coordinate almost perfectly; surprisingly, that binary choice seems to noisily correspond to some high-level concepts (e.g. living vs non-living). The authors then grounded vocabulary choice by also training the sender on an image-recognition tasks; the results were somewhat interpretable to humans.

Emergence of language with multi-agent games. Havrylov and Titov extend Lazaridou et al. by allowing sequences of symbols and also providing distracting images to the receiver only. Since combinatorial explosion now makes standard reinforcement learning techniques ineffective, they use "straight-through Gumbel-softmax estimators" to allow for end-to-end differentiation despite all messages being discrete. The resulting language seems to be hierarchical: the first words in a phrase specify broad categories, and the later ones narrow these down. The language is weakly grounded by being regularised based on a separately-trained language model (however, this does not require the preservation of words' meanings).

Linear Algebraic Structure of Word Senses. Arora et al. argue, on both theoretical and practical grounds, that the vector embeddings of polysemous words are linear combinations of the vectors which would have been assigned to each sense of the word, were all occurrences disambiguated. Their first experiment induces artificial polysemous pseudowords; the second uses WordNet definitions of different senses of various words. They also link each word sense with one of about 2000 "atoms" of discourse.

Monday, 16 July 2018

Notes from the heart of Europe

I just came back from a long weekend in Berlin, which I think of as the centre of Europe - not just geographically and historically, but also in its current role as the capital of Europe's most influential country. Although not as popular a destination as London or Paris, it's still a magnet for travellers - it felt like half the people around me on the streets were tourists, and I overheard conversations in English nearly as often as German (along with plenty of French, particularly at the screening of the World Cup final). I also wouldn't be surprised if the average Berliner I talked to were more proficient in English than the average Briton (although my sample was pretty skewed towards intellectuals). Coming from a country where even bilingualism is rare, I am continually impressed by European linguistic proficiency. And not just Europeans - during two recent trips to Morocco, I discovered that virtually all Moroccans are fluent in Arabic and French, plus a Berber dialect for the Berber ethnic majority, plus English for the young and ambitious, and often plus Spanish for those living in the north or working in tourism.

Back on the topic of Berlin, I'd be remiss if I didn't mention its famed nightlife, which is crazy in every sense. Here's how a night out might go:
  • Wait in line at a club for two hours.
  • Be examined by the bouncer, and asked a few probing questions ("Have you been here before?" - the answer should always be yes; "Why our club?" - cue a few lines of gushing praise about today's DJs).
  • Be rejected, as most people are.
  • Repeat at progressively less selective clubs, until accepted.
You might not even get a chance to plead your case: after queuing for an hour, some friends of mine were given a one-second glance-over and told to beat it. There are dozens of well-known reasons for rejection: you're too loud, too drunk, or in a group of more than 3 people; you look at your phone while waiting in the queue; you're wearing clothes with logos, slogans or any bright colours; you don't seem German enough; etc, etc. Of course, almost everyone is trying to optimise for these criteria; some who have tried their luck many times are convinced that most decisions are effectively random. The whole process reminds me of nothing so much as the US university application system, with bouncers as the all-powerful admissions officers. In both cases, the point is to create an atmosphere of exclusivity - and in both cases, that exclusivity is desirable enough to cause massive oversubscription. (Note that, like top US universities, these clubs charge nowhere near the market-clearing price). It's probably rational to apply to good universities, but is a small chance of being accepted to a hip nightclub really worth queueing for hours? Berliners seem to think so - and so did my other friends, who made it in and then kept partying until 9am. Personally, I went to an open-entry club which was still a step up from any I'd been to in NZ or the UK.

On transport: Berlin is notable for having decided to solve public transportation by throwing practically everything at it. The city has buses, trams, underground rail, overground rail, and off-the-ground rail - the elevated S-bahn that circles Berlin. There are open-air cars powered by a dozen frantically-pedalling tourists, with drivers who pump beer for them at red lights (yes, really!). There are bike lanes everywhere, plus a fierce competition between bike-rental startups to see who can leave the most bikes lying around the city. The only thing Berlin's missing is cable cars, or perhaps a hyperloop. It also has an unusual ticketing system: all public transport is walk-on, with no barriers. You're meant to have bought a ticket beforehand; to enforce this, there are infrequent inspections (I wasn't inspected during any of two dozen rides) and massive fines for fare-dodgers. I guess saving on ticket barriers and inspectors is one of the benefits of having a high-trust society, although I don't know how well it works in practice. Weekly and monthly passes are also much cheaper than in London (to be fair, that's true almost everywhere; an annual London travelcard can cost more than the world's median annual income).

A more general observation about Germany, and in fact most of continental Europe, is how common and socially acceptable it is to spend many years in higher education. Plenty of people spend 5+ years getting bachelors degrees, or get several masters degrees. To be fair, that's partly because it's standard to interweave study and work (and not just minimum-wage student jobs, but career-relevant positions). And from an individual perspective, university years are often the best of your life, so it's rational to spend more time enjoying them. Nevertheless, I think that degree inflation is a pretty major inefficiency on a societal level, one which is probably heightened by the fact that all study is paid for by the German government. In my mind, overeducation is closely linked with the "complacency" which Tyler Cowan argues has settled over the West during the last few decades. Expect more discussion of this idea in a forthcoming review of his book, The Complacent Class.

Thursday, 5 July 2018

Thoughts on Esport AIs

Dota 2 and Starcraft 2 are two of the most popular and complex esports; professional gamers train for years to master them. But how hard are they for AIs? DeepMind is working on the latter. OpenAI has already created a system (OpenAI Five) which is able to beat strong amateurs at a simplified version of the former, and they'll be challenging professionals in August. By their own account, they were surprised that their standard reinforcement learning approach worked so well, because of the game's complexity:
  • Counting every fourth frame, as OpenAI Five does, Dota games last on average 20,000 "moves", compared with 40 for chess and 150 for go.
  • Unlike chess and go, Dota is a partial-information game in which players need to infer what their enemies are doing.
  • The range of actions in Dota is very large, with many of them varying continuously. Apparently, around 1000 are valid at each timestep (although I'm not quite sure what OpenAI are referring to as an action, to get that figure).
  • The information which can be observed at any given time is represented by 20,000 numbers; and unlike go or chess, many of these are floating-point numbers which need to be interpreted visually to detect a range of features.

Technical considerations

In Starcraft, a crucial factor is how quickly players can perform actions; top professionals average 150-300 actions per minute. In Dota, this isn't such a big concern, since players only control one hero; and Five averages only about 1/3 as many actions as it theoretically could (150 vs 450 per minute). However, timing and accuracy are pretty crucial, and here it has a major advantage: not only can it select actions arbitrarily precisely, it can also consistently achieve frame-perfect timing.

A second factor is coordination. Dota teams consist of five players, each with slightly different roles; teams communicate with each other mostly via audio. Five is technically comprised of five different neural networks, each controlling one player in a team. Although technically they don't have any communication channels between them, each network can instantly see everything that their teammates can see, and act accordingly. Given that they trained together, this should give them an advantage over humans (roughly equivalent to giving pros a splitscreen view of their team, although taking full advantage of that would be a difficult multitasking challenge).

Thirdly, there's the question of APIs - that is, what inputs Five actually receives. It'd be most impressive if it were given only pixels, as humans are. However, that would apparently push the task beyond the limits of feasibility. Instead, "OpenAI Five is given access to the same information as humans, but instantly sees data like positions, healths, and item inventories that humans have to check manually". While this is a slightly confusing way to frame it, it seems like this means that the agent doesn't get access to pixels at all, only the API. That's a little disappointing, but understandable.

It's difficult to predict how the upcoming match against professionals will go, because unlike the previous matches, this one will be under much fewer restrictions, requiring a version of Five retrained from scratch. And presumably the aspects which had previously been excluded - including warding, Roshan, invisibility and summons/illusions - were chosen because they were expected to be difficult to deal with. (Update: OpenAI recently announced the rules for the upcoming match: while there are still a few restrictions left, the most important ones have been removed. I'm most impressed by the fact that they're expanding the hero pool to 18, which indicates either that they've figured out some way of improving generalisation, or more likely that they're using absolutely enormous amounts of compute for training.) But even if the new version of Five struggles, there are three notable reasons to be impressed by its achievements so far. The first is that AlphaGo and AlphaGo Zero relied on Monte Carlo Tree Search to explore many possible lines of play in great depth. OpenAI Five doesn't - the neural network itself contains all the information required to execute long-term plans, without explicitly modelling how the game will unfold.

The second is that, because Dota players have limited access to information, they need to remember many facts about previous states. This requires the addition of a LSTM on top of the visual processing done by a CNN; while this architecture has been around for a few years, I think the Dota result is its first major success.

The third is that there are many different tactics and strategies which need to be explored to learn to play well; and when they need to be chained together, the number of possibilities increases exponentially. Even in Atari games like Montezuma's Revenge, neural networks struggle to discover strategies to get keys. OpenAI addressed this issue through a combination of shaped rewards, randomisation of starting states, and hardcoded combinations of item and skill builds. This is a bit hacky, but you can't blame them for doing what works.

Having said that, we should also note that the overall result continues a trend where progress requires massively increasing amounts of computational power. Five trained on 180 years of Dota per day for what looks like around 18 days, i.e. 3 millennia overall. When you take into account that this training consisted of running five networks at once, that running Dota itself takes a fair bit of processing, and that they probably trained quite a few different versions of Five, we're looking at ludicrous amounts of compute. I wouldn't be surprised if the amount of money researchers are willing to spend on AI compute has been increasing even faster than processing power itself over the last year or two.

Speculations

My current thinking is that, while this result is impressive, maybe it tells us that Dota just isn't as hard as we thought it was. People keep talking about the complex long-term strategic skills required to win at the professional level. But even if human professionals learn the nuances of each other's playing styles and predict what their opponents predict that they will predict, plausibly this is all unnecessary if you're good enough at the micro game and can recognise and respond well to a few dozen tactical and strategic patterns. And we know that neural networks can be excellent at the micro game, because in 1v1 play (which has much less strategy) OpenAI's bots already beat professionals.

An interesting point from OpenAI's blog post: Five learned - from self-play alone - a lane-switching strategy which it took humans years to come up with, and which professionals only started adopting recently. So there's no denying that Five uses complex strategy. But the distinction I want to make is between plans and planning. The agent is intelligent enough to follow plans. But it's the training process that is the planner. In his latest book, Daniel Dennet makes a very similar argument: that animals are competent without comprehension of what they're doing - because they're not doing the planning, evolution is. A resulting prediction: even if Five beats a team of professionals, if it's made publicly available then someone will find some strange strategy to beat it. (Update: after writing this post, I found out that while the 1v1 Dota bot was undefeated in official matches, when it was made publicly available amateur players found several strategies to reliably defeat it, including unusual item builds and “creep pulling”.)

If I'm right, I think that's good news for AI safety in the short and medium term. It implies that for most tasks, even cognitively complex ones, we'll be able to create agents which are much better than humans, but which are still unable to succeed in novel environments. Those agents will be able to handle novelty within environments that they're trained for, but only because their training process was "smart" enough to equip them with powerful and broad strategies and heuristics. These agents would likely still fail on adversarial examples optimised very hard to fool them, but I think humans would too if there were a way of generating such images. (Here's the best adversarial example for humans I've seen so far, which I and many others originally took to be a crowd at a concert).

On the other hand, my analysis also means that when we figure out how to train an agent to plan and strategise, it might quickly become very good at it. But "planning" is such a vague and general skill that the bottleneck for learning it is probably our ability to improve our training methods or task representations. It's also possible that current neural network architectures are just not powerful enough to do arbitrarily-complex planning - for example, because feed-forward networks like CNNs have no backwards connections, and recurrent networks have them only in a very constrained way. Either way, since much of our recent progress has been made using algorithms and architectures that have been around for decades, my guess is that if improvements are necessary, they won't arise for decades more.