Monday, 19 February 2018

The Unreasonable Effectiveness of Deep Learning

This essay consists of summaries, explanations, and discussions of several papers which provide high-level arguments and intuitions about why, conceptually, deep learning works. Particular areas of investigation are "Which classes of functions can deep neural nets approximate well in principle?"; "Why can they quickly learn functions which have very small training loss?"; and "Why do the functions they learn generalise so well?".

Understanding deep learning requires rethinking generalisation - Zhang, Bengio, Hardt, Recht and Vinyals, 2017

Machine learning in general is about identifying functions in high-dimensional spaces based on finitely many samples from them. In doing so, we navigate between two potential errors: learning a function which is too simple to capture most of the variation in our data (underfitting) and learning a function which matches the data points well, but doesn't generalise (overfitting). Underfitting implies higher training error; overfitting implies higher test error. Attempting to avoid underfitting is called optimisation, and attempting to avoid overfitting is called regularisation. We can represent a particular neural network with w weights as a w-dimensional space; when training it, we are trying to find the point in that space which most closely corresponds to the true underlying function. A loss function is a heuristic based on our data which tells us approximately how close that correspondence is; one reason that training is difficult is because good loss functions are generally not convex, but rather have some local minima, which a training method like gradient descent has difficulty escaping. However, we can get better results with stochastic gradient descent (SGD): the randomness provided by only updating some variables allows more chance of escaping local minima, in a way comparable to simulated annealing. [2] SGD also allows proofs of convergence similar to those for total gradient descent, and is much easier to compute.

Another problem is the fact that because our data are finite, there are many models which have very low loss but are very far from the truth. In such extreme cases of overfitting, a learner could effectively "memorise" every piece of input data without their model capturing any of the underlying patterns. An easy way to avoid this is to use a model which doesn't have enough "memory" to store all the points; this is a form of regularisation. A simple 2-dimensional example: if we want to plot a polynomial of order 100 to fit 1000 data points, the 101 variables we can change are not enough to store all 1000 points, and so if our model has low training error it must be because it's capturing some pattern in the input data. However, it turns out that this isn't the case for neural networks: the first result of this paper can be summarised as "deep neural nets easily fit random labels". The authors found that on both standard image inputs with randomised labels, and inputs consisting of random noise with random labels, their neural networks achieved zero training error! Since there's (almost) no pattern in such data, this implies that a neural network can essentially memorise a large number of inputs. (Of course, when trained on random labels the test error will be very high.) In terms of number of parameters alone, this is no surprise, since neural networks are typically overparameterised (they have many more weights than training examples). However, it's surprising that backpropogation can quickly converge to such a detailed representation. In fact, the authors found that there was very little difference between randomised and non-randomised inputs in terms of training time required to reach given levels of training error.

Why do neural networks generalise well in practice, then, instead of just memorising their inputs and gaining no predictive power? (See this paper [3] for experimental evidence that memorisation isn't what is happening). One possibility is our use of regularisation. But the authors identify that neural networks can generalise even when no explicit forms of regularisation (such as dropout or weight decay) are used. Their answer: that stochastic gradient descent itself is a form of implicit regularisation which hinders memorisation when there are ways to generalise. Why is that? Apparently, SGD tends to find flat minima rather than sharp minima, especially when the batch sizes used for each step of SGD are small. [4] Flat minima are robust to small deviations in input parameters, which suggests that they will generalise well. We can also think about this from a Bayesian minimum description length perspective: a function represented by a flat minimum is less complex to specify, and therefore should have a higher prior probability. [5] On the other hand, [6] find that reparameterisation can change sharp minima to flat ones and vice versa. They argue that ability to generalise shouldn't be affected by reparameterisation - but I think I disagree, since some metrics are more natural than others.

The original paper also presents another slightly confusing result: that regularisation methods such as weight decay and batch normalisation can actually improve performance on training data. This is strange because making the model less expressive shouldn't make it more precise. The explanation may be related to the success of sparse representations in the brain, as discussed below. Other forms of explicit regularisation they consider are data augmentation (e.g. adding inputs which are rotated or perturbed versions of others), dropout, and early stopping; they conclude that data augmentation is a more effective regulariser than the others.

Quirks and curses of dimensionality

Often inputs to deep neural networks are in really high-dimensional spaces - such as images with millions of pixels, or one-hot word vectors of length > 100000. Geometry gets weird in those dimensions, and our intuitions are easily led astray. Here are some features of a 1000-dimensional hypercube with side length 1, for instance:
  • It has 2^1000 corners.
  • The distance between opposite corners is around 31, even though the distance between adjacent corners is 1.
  • The volume of a hypersphere inside the cube which touches all its sides is less than 10^-1000 of the cube's volume.
These are absurdly high numbers. Here's an attempt to portray 2, 3, 4 and 6-dimensional cubes with spheres inscribed inside them, with the correct number of corners. Note that these images are inevitably misleading: in reality, you can get from any corner to any other corner without leaving the hypercube, and of course the corners aren't coplanar. But at least it should give you a better intuition for the issue - and remember, even image d represents a cube with only 6 dimensions.


What conclusions can we draw? Firstly, that measuring Euclidean distances between points isn't very useful. We've already seen that hyperspheres with diameters comparable to a hypercube's side length end up occupying a negligible portion of that hypercube's volume, and so almost no points will be "close" to any other. Interestingly, the opposite is also true - almost no points will be nearly as far apart as opposing corners. Simulations suggest that distances between random points in high dimensions cluster very tightly around 41% of the distance between opposing corners; also, that the angles between those points and the origin cluster very close to 90 degrees, so that dot-product similarity is less useful. [7] In fact, for data points drawn from reasonable distributions, the ratio between the distance to their nearest neighbour and the distance to their furthest neighbour will approach 1 in high dimensions. Algorithms to find nearest neighbours are much less efficient in high dimensions, and tend to require linear time for each query, because indexing is no longer viable. (This difficulty can be alleviated somewhat by measuring proximity using Manhattan distance - i.e. the L1 metric - or fractional metrics less than 1, which work even better.) [8]

In general, these sorts of negative effects are often referred to as the "curse of dimensionality". The most obvious way to ameliorate them is to use a dimensionality-reducing technique before processing the rest of the data. Algorithms such as principal component analysis (PCA) pick out an orthonormal basis which explains as much of the variance as possible; more generally, we can apply a kernel before doing PCA (which becomes computationally practical when using the kernel trick). Neural networks can also be used for dimensionality reduction, in the form of autoencoders. In CNNs specifically, pooling layers are a form of gradual dimensionality reduction. However, their benefits are controversial.

What's wrong with convolutional nets? - Hinton, 2014

Geoffrey Hinton in particular has spoken out against the usefulness of pooling, for several reasons. [9] (Following Hinton, I use "pose" to refer to an object's position, orientation, and scale).
  • It is a bad fit to the psychology of shape perception. Humans' perceptions of objects are based on the rectangular coordinate frames we impose on them; CNNs don't use coordinate frames. 
  • We want equivariance to viewpoint, not invariance. CNNs can recognise objects by throwing away pose data; whereas humans recognise objects from different viewpoints while also knowing their pose.
  • It doesn't use the underlying linear structure. CNNs don't have a built-in way of generalising to different viewpoints, like humans do; they instead need training data from many perspectives. But extrapolating what a changed perspective looks like shouldn't be hard, it's just matrix multiplication of representations of 3D objects! 
  • Pooling is a poor way to do dynamic routing. In different images, different sets of pixels should be grouped together (basically, factorising an image into objects; apparently, humans do this using eye movement?). But CNNs with max-pooling networks only route parts of lower-level representations to higher-level representations based on how active those parts are. Instead, Hinton wants the higher-level representations to give feedback to lower-level representations. For example, if we want to group representations of primitive shapes at one level into representations of objects at the next level, we don't know whether a circle is a human face, or a car wheel, or a ball. So we send it up to to all three object representations, and they check how well it fits, and it is assigned based on those assessments. Note that this is similar to how predictive processing works in the human brain. [10]
Hinton's solution is capsule networks, which explicitly store and reason about the poses of objects that they see, as well as using the feedback mechanisms I mentioned above. They've performed well on some tasks so far, but have yet to prove themselves more generally. Either way, I think this example illustrates a good meta-level point about the usefulness of deep learning. Bayesian methods are on a sounder theoretical footing than deep learning, and knowing them gives us insights into how the brain works. But knowing how the brain works doesn't really help us build better Bayesian methods! Deep learning serves as an intermediate step which is informed both by abstract mathematical considerations, and by our knowledge of the one structure - the human brain - that already has many of the capabilities we are aiming for.

Learning Deep Architectures for AI - Bengio, 2009

Another way of thinking about deep neural networks is that each layer represents the input data at a higher level of abstraction. We have good reasons to believe that human brains also feature many layers of abstraction, especially in the visual system. The parallel is particularly clear in CNNs trained for vision tasks, where early layers identify features such as edges, then primitive shapes built out of edges, then complex shapes built out of primitive shapes, in a similar way to human brains. Another example is the hierarchical structure of programs, which often have subfunctions and subsubfunctions. In each case, having more layers allows more useful abstractions to be built up: we can abstractly represent a picture with millions of pixels using a short description such as "man walking in a park". This relies on each layer being nonlinear, since arbitrarily many layers of linear functions are no more powerful than one layer. The key advantage of neural networks is that humans do not have to hand-code which abstraction should be used at which layer; rather, the neural network deduces them automatically given the input and output representations. Once it has made these deductions when training on some input data (e.g. distinguishing cats and dogs), it should theoretically be easier to train it to recognise a new category (such as pigs), since it can "describe" that category in terms of the features it already knows.

The representations we find in the brain have two key features: they are distributed (multiple neurons in a representation can be active at once) and sparse (only 1-4% of neurons are active at any time). The number of patterns a distributed representation can distinguish is exponential in the dimension of the representation (as opposed to non-distributed one-hot encodings, which are only linear). One key aspect here is the idea of local vs non-local generalisation. In general, local models require two steps. First they check how much a data point matches each "region" of the function it has learned; then they combines values for each region, depending on how well they match. Two examples are k-nearest-neighbours, and kernel machines using the Gaussian kernel. Although boundaries between regions may be fuzzy, intuitively local models are dividing up a function "piecewise" and then describing each piece separately. However, this doesn't allow us to learn a function which has many more regions than we have training examples. For example, kernel machines with a Gaussian kernel require a number of examples linear in the number of "bumps" in the target function (one bump is the function starting positive, then becoming negative, then positive again). Instead, we need to learn representations which are less local, such as more abstract and distributed representations.

Learning multiple layers of distributed representations from scratch is a challenge for neural networks, though. One way to improve performance is by pre-training one layer at a time in an unsupervised fashion. We can reward the first layer for producing similar outputs when given similar inputs (using, for instance, a Euclidean metric); once it is trained, we can do the same for the second layer, and so on. It might be expected that by doing this, we will simply teach each layer to implement the identity function. But in fact, because of implicit regularisation, as discussed above, and the fact that implementing the identity function requires very small weights in the first layer, in practice pre-trained layers actually end up with different representations. Without pre-training, the top two layers alone are able to get very low training error even if the lower layers produce bad representations; but when we limit the size of the top two layers, pre-training makes a major difference - suggesting that it is particularly important in lower layers, perhaps because the signal from backpropogation is more diffuse when it reaches them.

Various expressibility proofs

Surprisingly, adding more layers doesn't change which functions a neural network can learn. In general, any continuous function can be approximated arbitrarily closely by a neural network with only one hidden layer. However, that layer may need to be very, very wide. [12] In general, for a k-layer network to be able to express all the same functions as a (k+1)-layer network, it may require exponentially-wider layers! A simple example is the parity function, which can only be computed in a 2-layer network if it is exponentially wide in the number of inputs (this is intuitively because the function has many "regions" in the sense described above). One useful intuition here is to view a deep and narrow network as "factorising" a shallow but wide network: a term can be computed in a lower layer and then referred to many times in higher layers, instead of being separately computed many times.

There are also several techniques we can think of as equivalent to adding another layer. For example, transforming inputs according to a fixed kernel can be thought of as adding one (more powerful) layer at the bottom. Boosting, a technique to aggregate the results of sub-models, adds an additional layer at the top. These can reduce the necessary width by orders of magnitude - for example, boosted ensembles of decision trees can distinguish between exponentially many regions (intuitively, if each decision tree divides the input space in half along a different dimension, then adding another decision tree could double the number of separate regions).

In [1], the authors argue that it's not the space of all functions which we should be interested in, but rather the expressive power of neural networks on a finite sample of size n. They prove that there exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function f : S->R, where |S| = n and each element of S has d dimensions. This is a much nicer bound than above; however, I'm not sure how relevant it is. In reality, we will always need to process inputs which aren't exactly the ones that we've trained on. It's true that we can often just treat novel inputs similarly to others near them that we've already learned - but for reasons I explained above, to get good coverage in high-dimensional spaces n would need to be astronomically large.

Lastly, a CNN with a last hidden layer which is wider than the number of inputs can a) learn representations for each input which are linearly independent of each other, and b) achieve zero loss on those inputs. Both of these properties hold with probability 1 even if all previous layers have random values. [13] And in fact several CNN setups which have achieved top results do have layers this wide. This helps explain the memorisation results in [1], but if anything makes it harder to explain why our trained models can still generalise well.

However, note that all of these results only show that some combination of weights exists with these properties, not that those weights can be efficiently found.

A representation of programs for learning and reasoning - Looks and Goertzel, 2008

This paper isn't directly about deep learning, but it provides an interesting framework which I hadn't considered before. We often write programs to search over some space of possibilities. But we can think about the process of writing a program itself as searching through the space of all possible character strings for one that will implement a desired function. This paper explores how we might automate that search. Some preliminaries: we distinguish the syntax of a program (the string of characters of which it's composed) from the semantics of a program (the function that it encodes). A "syntactically valid" program is one which compiles, and therefore has a semantic "meaning".

First, the paper discusses some of the key features of programs: that they are well-specified (unambiguous), compact (they allow us to specify functions more tersely than by explicating input-output pairs), combinatorial (programs can rearrange the outputs of other programs) and hierarchical (can be decomposed into subprograms). Some of the properties that make reasoning about them challenging: open-endedness (programs can be arbitrarily large), over-representation (syntactically distinct programs may be semantically identical), chaotic execution (programs that are very syntactically similar may be very semantically different), and high resource-variance (time and space requirements of syntactically similar programs may vary widely).

Now let's say that we want to search for a program which has certain semantic properties. If our search space is over all possible programs, and we consider building them up character by character, we very quickly get combinatorial explosion. Intuitively, this is very inefficient: the vast majority of possible character strings simply don't compile. It's also difficult to find good heuristics to direct such a search - there are many invalid programs which differ from a correct solution by only one character. What we want is a way to represent programs at a higher level than just characters, so that we can specify all possible ways to change a program so that it's still syntactically valid, and then conduct a search through the resulting space. This is similar to lambda calculus or formal logic, in which there are several fixed ways to create an expression out of other expressions, the result of which will always be syntactically valid. In lambda calculus and formal logic, every expression has a "normal form" with the same semantics. The authors propose that each program should similarly be represented by a normal form that preserves its hierarchical structure; from these, we can find "reduced normal forms" which are even more useful, by simplifying based on reduction rules. They claim that a representation with those properties will be more useful than others even if these representations are all capable of expressing the same programs (in the same way that it's sometimes more useful to use one-hot encodings for words, rather than representing them as strings of characters).

This paper presents a proof-of-concept normal form. Primitive types are Booleans and numbers; then there are data types parameterised by primitive types, like lists and tuples. Each has elementary functions; the elementary functions of lists are append, and a constructor. There are also general functions, such as the one which returns the nth element of a tuple. Once we've represented a program in this way, we can compress it using canonical reduction rules, which decrease the size of a program without changing the semantics (a basic example could be "if a variable is declared with one value, then immediately changed, then you can reduce these two lines by just declaring it with the latter value directly"). We can then conduct a search starting from any given program by applying transformations to that program. Transformations come in two types: "neutral" (which preserve semantics) and "non-neutral" (which may not). Examples of non-neutral transformations include adding elements to a list, composing one function with another, and use of the fold function. Neutral transformations include abstraction and inserting an if-statement where the condition is always true. Such neutral transformations aren't useful by themselves, but they may later allow desirable non-neutral transformations (e.g. one which then changes the tautological if-statement condition to something more useful). By only allowing certain well-specified transformations, we narrow down the search space of all possible character strings to the search space of all syntactically valid programs in normal form. The latter space is still massive, but much much smaller than the former. The hope is that programs in normal form also have more obvious correlations between syntax and semantics, so that we can create good search heuristics without having to evaluate the actual "fitness" of every program we create by running it on many inputs, or proving that it works.

Other papers that I'm planning to read

There are a few more papers which seem particularly relevant to these topics, but which I haven't had time to read and understand yet. Once I have, I might update the essay above, or else write another. You can find my "Understanding AI" reading list here; if you have any suggestions for particularly useful papers, please do let me know.

Why does deep cheap learning work?
Stochastic gradient descent as approximate Bayesian inference
Understanding locally competitive networks
Using synthetic data to train neural networks is model-based reasoning
Why and when can deep - but not shallow - networks avoid the curse of dimensionality
Towards an integration of deep learning and neuroscience
DeepMath - deep sequence models for premise selection
Theoretical impediments to machine learning with seven sparks from the causal revolution

References

  1. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalisation. https://arxiv.org/abs/1611.03530 
  2. Leon Bottou. 1991. Stochastic gradient learning in neural networks. http://leon.bottou.org/publications/pdf/nimes-1991.pdf
  3. David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S. Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, Aaron Courville. 2017. Deep nets don't learn via memorisation. https://openreview.net/pdf?id=rJv6ZgHYg
  4. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. 2017. On large-batch training for deep learning: generalisation gap and sharp minima. https://openreview.net/pdf?id=H1oyRlYgg
  5. Ferenc Huszar. 2017. Everything that works works because it's Bayesian: why deep nets generalise? http://www.inference.vc/everything-that-works-works-because-its-bayesian-2/
  6. Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio. 2017. Sharp minima can generalise for deep nets. https://arxiv.org/pdf/1703.04933.pdf
  7. Martin Thoma. 2016. Average distance of random points in a unit hypercube. https://martin-thoma.com/average-distance-of-points/
  8. Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim. 2001. On the surprising behaviour of distance metrics in high dimensional space. https://bib.dbvis.de/uploadedFiles/155.pdf
  9. Geoffrey Hinton. 2014. What's wrong with convolutional nets? http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-convolutional-nets
  10. Scott Alexander. 2017. Book review: Surfing Uncertainty. http://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/
  11. Yoshua Bengio. 2009. Learning deep architectures for AI. https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf
  12. Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. https://www.sciencedirect.com/science/article/pii/089360809190009T?via%3Dihub
  13. Quynh Nguyen, Matthias Hein. 2017. The loss surface and expressivity of deep convolutional neural networks. https://arxiv.org/abs/1710.10928
  14. Moshe Looks, Ben Goertzel. 2008. A representation of programs for learning and reasoning. https://pdfs.semanticscholar.org/67cf/4ddc0cf5c12654273c2b0dd9e8121673d940.pdf

Friday, 16 February 2018

In defence of conflict theory

Scott Alexander recently wrote an interesting blog post on the differences between approaches to politics based on conflict theory and mistake theory. Here's a rough summary, in his words:

"Mistake theorists treat politics as science, engineering, or medicine. The State is diseased. We’re all doctors, standing around arguing over the best diagnosis and cure. Some of us have good ideas, others have bad ideas that wouldn’t help, or that would cause too many side effects. Conflict theorists treat politics as war. Different blocs with different interests are forever fighting to determine whether the State exists to enrich the Elites or to help the People... Right now I think conflict theory is probably a less helpful way of viewing the world in general than mistake theory. But obviously both can be true in parts and reality can be way more complicated than either."

This comparison doesn't explain everything, but it definitely captures some important aspects of political activity. However, I disagree with Scott's judgement that conflict theory is less helpful overall. Here's my main argument against emphasising mistake theory over conflict theory: you're only able to be a mistake theorist after the conflict theorists have done most of the hard work. Even if the lens of mistake theory is more useful in dealing with most of the political issues we engage with on a daily basis, that's only the case because those issues are a) within our Overton window, so that they can be discussed, and b) considered important by either some powerful people, or many normal people, so that proposed solutions have a chance of being implemented. Ensuring that a given issue fulfills those criteria requires a conflict-theoretic mindset, because until they are met you will face opponents much more powerful than you. Mistake theorists miss those important long-term shifts.

Let's take a few examples. The main one is democracy itself. Mistake theorists wish for technocrats to have more power, so they can implement better policies. But conflict theorists have spent the last three centuries drastically curtailing the power of monarchies and dictatorships - which were close cousins of technocracy, given that hereditary rulers tended to be far more educated than the population as a whole. Compared with that seismic shift, the difference between modern conflict-theorists and mistake-theorists is a rounding error: supporting what past generations thought of as "mob rule" for the sake of that mob having power over its leaders puts us all way on the conflict theorist side of the spectrum. Of course it'd be very nice to have a voting system which selects more competent politicians, but we should keep in mind that the main benefit of democracy is to protect us from tyranny - and we should appreciate that it's doing a pretty good job.

Second example. Modern social justice movements support a lot of policies whose effects are contentious, like raising minimum wages and opposing free speech. A mistake theorist might be right in saying that what's currently most necessary for them to succeed in their goals is not more political firepower to push those policies through, but rather a better understanding of which policies will lead to the best effects (particularly in places like Scandenavia where the right wing isn't very strong). But there used to be a lot of very simple, obvious ways to improve the lives of disadvantaged minorities, like not enslaving them, or giving them the vote. It took a lot of effort from a lot of conflict theorists (and in America, a civil war) to implement those reforms. Only now that conflict theorists have shifted public opinion, and implemented the most obviously-beneficial policies, is it plausible that mistake theorists are best-placed to push for more improvements.

Third example. Taxation in many Western countries is pretty screwed up; the wealthy can easily find tax loopholes and not pay their intended legal rate. Mistake theorists would say that the fundamental flaw here is a poorly-designed tax system, and fixing that is much more important than raising the nominal top tax rate or stirring up anti-elite sentiment in general; for what it's worth, I think that's probably true. But the very fact that we have a progressive tax system at all is a triumph of conflict theorists who made strong moral arguments about the duties of the wealthy to pay back to society.

Fourth example. The welfare systems in many Western countries are needlessly bureaucratic and inefficient at helping the poor, and throwing more money at them probably wouldn't solve that. Mistake theorists therefore rightly realise that the conflict-theoretic view of poverty misses important factors. But that wasn't nearly as true back when social safety nets and labour regulations just didn't exist, working conditions were atrocious, and debtors were thrown in prison.

Now you could argue that we live in an era where most low-hanging fruit have been plucked, and so mistake theory is the best mindset to have right now. But I think that claim relies too much on the present being unusual. Actually, there are plenty of easy ways to do a great deal of good, but most people don't yet think of them as moral necessities (almost by definition, because otherwise they would have already taken the obvious steps). Here are some issues which conflict theorists haven't yet "won", and which are therefore still most usefully described as a conflict between interests of different groups, rather than something people agree on, but don't know how to solve:
  • Global warming, where the wealthy countries and people who emit massive amounts of emissions are screwing over everyone else, including future generations.
  • Factory farming, where everyone who eats meat is screwing over lots of animals.
  • International borders, which very effectively entrench the advantages of citizens of wealthy countries.

What will it look like when conflict theorists have made enough headway on these issues that they reach the point where mistake theory is more valuable?
  • There will be massive domestic public pressure to decrease emissions. Wealthy countries will be willing to subsidise reductions of emissions by developing countries. We'll just need to figure out how to reduce our emissions most effectively. (The domestic pressure already exists in some countries; not so much the international goodwill.)
  • It'll be illegal to raise animals in inhumane conditions like factory farming. But we won't be sure whether animals in humane conditions have lives worth living, and how cost-effective lab-grown meat can be.
  • Most people will agree that preventing people from accessing opportunities based on accidents of birth is immoral. Many more migrants will be allowed in to Western countries. But we won't know how to best manage the effects of mass migration or cultural clash.

Perhaps you don't agree with the specifics of some examples, but the general theme should be clear: first you need enough public acceptance that you can implement the policies which promise clear benefits, by overruling the people who benefit from the status quo. This step is best described under conflict theory. Once those policies are in place, it becomes more difficult to discern which next action is most beneficial, so you need to rely on expert knowledge; this step is best described under mistake theory.

Note that I don't mean to imply that the policies which promise clear benefits are easy to implement. In fact they may be very difficult, because you need to convince or coerce elites into giving your side more power. Rather, I mean that they're the most obvious gains, which will almost certainly create good outcomes if you can just convince people to support them. Whether or not it's worth fighting that conflict, instead of finding mistakes to solve, will depend on the specific case. A salient example is the choice between funding political campaigns for animal rights vs technical research into lab-grown meat. In general, we should probably prefer to "pull the rope sideways" by avoiding already-politicised issues, which are difficult to influence, but sometimes the obvious gains are so large that it might be worth taking a stand.

I want to finish with a more charitable portrayal of conflict theory. Scott deliberately caricatured both sides, but to an audience of mistake theorists, the result may be a skewed view of what constitutes a reasonable version of conflict theory. In particular, I now think that liberalism and libertarianism are perfectly consistent with conflict theory, but I didn't immediately after reading his essay. Two particularly misleading quotes:
  • "Conflict theorists aren’t mistake theorists who just have a different theory about what the mistake is. They’re not going to respond to your criticism by politely explaining why you’re incorrect."
Unless they want to convince you to join their side. Which is sensible, and which almost all ideological movements do. More generally, conflict theorists think there's a conflict between some groups, but that doesn't imply they need to be belligerent towards you (assuming you're not an actively-oppressive member of the elite; and maybe even if you are). Later on, Scott says that conflict theorists think that "mistake theorists are the enemy" and "the correct response is to crush them". But conflict theorists still have the concept of people making mistakes. The Second World War is perhaps the one example where conflict theory is most justified. During it, Switzerland made a mistake in not fighting Nazi Germany, because it seems very improbable that Hitler would have left them alone after winning. But that doesn't mean that the Allies needed to view Switzerland as their enemy; it'd be a ridiculous waste of resources to even try to crush them instead of attempting to sway them to your position.
  • "When conflict theorists criticize democracy, it’s because it doesn’t give enough power to the average person – special interests can buy elections, or convince representatives to betray campaign promises in exchange for cash. They fantasize about a Revolution in which their side rises up, destroys the power of the other side, and wins once and for all."
This may be a fair description of smart conflict theorists in the 1800s. But what about conflict theorists in 2018 who have learned from history that power corrupts, and that seizing control isn't an automatic final victory? They don't need to have fantasies of revolution in order to care about special interests corrupting representatives; that seems pretty bad regardless. In fact, in the modern context corruption of democracy may be the most important issue for conflict theorists. So I think that a more charitable interpretation is conflict theory as "constant vigilance". There is no system which does not develop cracks and flaws eventually. There are no holders of power who do not become complacent or corrupt eventually. Overthrowing those rulers and systems comes at massive cost to all involved. Sometimes it may be necessary. But we can postpone that necessity, perhaps indefinitely, by plugging up the cracks and sniffing out corruption. People protesting outside government buildings and politicians getting impeached aren't aberrations, but necessary and inevitable feedback mechanisms.

Under this view, mistake theorists who spend their time pushing for policies which improve society overall are well-intentioned but misguided. They may create better outcomes on 90% of issues they pursue, but while they do so, people with power will systematically consolidate their positions - and the question of who controls society overall is so important that it should be our main focus (although in the face of individual issues of enormous scale such as existential risk, this argument is less compelling). That's not to say that we should seize such control ourselves, because that will simply create a new elite - but we need to make sure nobody else does. More sensible mistake theorists, who recognise this imperative, would focus on improving power structures themselves, for example by improving voting systems. But we should consider suspect any small group of people with the power to change how governments work; to be legitimate, they have to represent a large group of people, who need to be convinced to care - probably by a conflict theorist. Perhaps one day someone will design a system with so many checks and balances that the process of avoiding tyranny is practically automatic. But more likely, the struggle to rally people without power to keep the powerful in check will be a Red Queen's race that we simply need to keep running for as long as we want prosperity to last.

Wednesday, 14 February 2018

Topics on my mind: January 2018

My degree has forced me to start learning some linguistics, which turns out to be very interesting. It feels like, in trying to figure out what understanding language really means, we're grappling with the very notion of concepthood, and the nature of intelligence itself. My thesis is based on the question of how to represent words in machine learning models. Vectors seem to work pretty well, and make intuitive sense for nouns and verbs at least. In this model, if you take king, subtract man, and add woman, you get queen. Something can be more or less 'rain' (drizzle, shower, downpour, torrent), or more or less 'run' (jog, lope, sprint), or even more or less 'bird' (ostrich, penguin, vulture, sparrow). Things get a little more complicated when we consider determiners, conjunctions, and prepositions, since you can't really be more or less 'for' or 'or'. And when it comes to putting it all together into representations of sentences, we really have no good solution. Yet.

I'm becoming more sympathetic to continental philosophy. I found this lecture on the psychology of religion by Jordan Peterson fascinating. While difficult to summarise, Peterson's main claim is roughly that the last two millennia of religious development have selected for narratives which resonate deeply with human psychology, and that if we want to understand the latter we should pay more attention to the former; he is particularly inspired by Nietzsche and Jung. I feel like I should also try to read some Marx and Freud, if only for the intellectual background (also because so many people wanted me to add their work to my list of humanity's greatest intellectual achievements). This continues a slow swing back from my high school years, in which I was very focused on technical subjects and knew nothing about history, art, or languages. Now I know quite a lot about history and a fair bit about art, but still no other languages. :(

I'm also coming to believe that I, and almost everyone else, have underestimated the importance of sociology. I don't think that governments should necessarily be trying to steer culture, but it's definitely worthwhile for policymakers to be aware of what is needed for communities and societies to thrive. Of course, part of the problem is that sociology isn't nearly as systematic as, for instance, economics. But neither is psychology, which researchers like Kahneman and Tversky have nevertheless managed to ingrain into the public consciousness. I'm hoping that work of equivalent importance will arise in sociology at some point - and that it includes not just descriptions of how society functions, but also prescriptions on how to change it. The reason I'm warming to continental philosophy is because it attempts this normative analysis of large-scale culture - but while some of its insights seem important, they're not backed up by enough data for me to fully trust them. By contrast, Putnam's excellent book Bowling Alone, on the decline of social capital in America over the last 70 years, makes claims which are also very broad, but thoroughly rigorous. The amount of data that he had to manually sift through made writing the book a massive task. But as more and more data becomes available from the tech sphere, I'm hoping to see many more people using big data to answer sociological questions with a philosophical mindset. Christian Rudder, founder of OKCupid, is going in the right direction with his book Dataclysm, even if the implications of his conclusions aren't always teased out. There are also a number of economists, such as Tyler Cowan and Bryan Caplan, who are doing interesting analysis along these lines, albeit focusing more on economic claims.

I'm worried that we're simply clueless about how our actions will affect the far future. It seems like for every convincing argument I read, I later stumble upon a counterargument of even greater importance. For the last few years I've been very worried about the prospect of human extinction. But I recently watched the latest Black Mirror episode Black Museum, which is about how sufficiently advanced technology can allow you to cause others arbitrarily large amounts of suffering. If humanity survives, it seems extremely likely that we'll reach that level of technology eventually. And if people can use it, then someone will. Perhaps a very powerful, benevolent authority could prevent this, but that situation seems unlikely. How many expected years of torture would it take to outweigh the moral value of humanity surviving? When viscerally confronted even with fictional suffering, it feels like the answer shouldn't be that high. Either way, how could we have any estimate of the probabilities involved that isn't simply a stab in the dark?

Friday, 9 February 2018

Which neural network architectures perform best at sentiment analysis?

This essay was my main project for my module on Machine Learning for Natural Language Processing at Cambridge. It assumes some familiarity with NLP and deep learning.

Over the last few years, deep neural networks have produced state-of-the-art results in many tasks associated with natural language processing. This comes in conjunction with their excellent results in other areas of machine learning, perhaps most notably computer vision. Different types of neural networks have been particularly successful in different areas. For example, CNNs are the tool of choice in image recognition problems; their internal structure has distinct parallels with the human visual system.

Two other neural network architectures have achieved particular success in NLP; these are recurrent neural nets (RNNs) and recursive neural nets (here abbreviated RSNNs). It seems at first glance that the structures of RNNs and RSNNs are better suited to processing language than CNNs; however, empirical results have been mixed. In this essay I analyse these empirical results to see what conclusions can be drawn. I chose to focus on sentiment analysis for a number of reasons. Firstly, it is a task with direct applicability, rather than an intermediate stage in language processing. Secondly, as a discrete categorisation task, results are not too subjective; and there are standard corpora against which performance has been measured. Thirdly (and admittedly a little vaguely), it is neither "too easy" nor "too hard": the sentences used generally have a clear intended sentiment, but to extract that it's necessary to deal with negations, qualifications, convoluted sentence structure, etc.

This essay is organised into three sections. Firstly, I explain the key differences between the three architectures mentioned above, and how those differences would theoretically be expected to influence performance. Secondly, I cite and explain various results which have been achieved in sentiment analysis using these neural networks. Lastly, I discuss how we should interpret these results.

Overview of Neural Network Architectures

Modern neural networks designed for NLP tasks generally use compositional representations. Individual words are represented as dense vectors in an embedded space, rather than using one-hot encodings or n-grams. These word vectors are then combined using some composition function to create internal node representations which lie in the same vector space (Young et al, 2017).

Word vectors can be learned via unsupervised training prior to the main training phase, and are usually distributional, i.e. based on the contexts in which words are found in the relevant corpus (Turney and Pantel, 2010). Performance has been improved by also taking into account morphological features (Luong et al, 2013). This preliminary learning is often followed by adjustments to improve performance on specific tasks. For example, words often appear in the same contexts as their antonyms, but in sentiment analysis it is particularly important to ensure that such opposing pairs are represented by different vectors (Socher et al, 2011a). It is possible to fix learned word vectors at any point, but in recent papers it is more common for words vectors to be adjusted along with the rest of the neural net as the main training occurs. It is also possible to initialise word vectors randomly so that they are learned throughout main training, but this generally harms performance.

Recurrent Neural Networks

RNNs are able to process ordered sequences of arbitrary length, such as words in a sentence. At each stage, the neural network takes as input the given word, and a hidden representation of all words so far. An output of this then becomes an input into the next stage. Theoretically, it should be possible for standard RNNs to process arbitrarily long sequences; however, in practice, they suffer from the 'vanishing gradient problem' which causes the beginnings of long input sequences to be forgotten. To combat this, almost all RNN implementations use LSTM (long short-term memory) units, which help propagate gradients further.

Another difficulty with standard RNNs is the fact that items later in the sequence cannot affect the classification of items earlier in the sequence. This is particularly problematic in the context of language, where the interpretation of the first few words of a sentence often throws up ambiguities which are resolved by later words. Examples include garden path sentences such as "Man who enjoys garden path sentences friends to listen to his puns", or more simply any sentence beginning with "This can". One solution is to use bi-directional RNNs, which pass both forwards and backwards through sentences. However, in sentiment analysis tasks where only a single classification of the entire sentence is required, this is less necessary.

Convolutional Neural Networks

CNNs only accept fixed-length inputs, which means some modifications are required to allow them to accept sentences. Historically, it was standard way to convert from the latter to the former using continuous bag-of-words (CBOW) or bag-of-ngrams models (Pang and Lee, 2008). For instance, the CNN input could be the sum of vectors representing each word in the sentence. However, the resulting loss of word order became too great a price to pay, and various alternatives have emerged (which I will explore in the next section).

CNNs are feed-forward, which means there are no links back from later layers to earlier layers. Internally, CNNs generally contain convolutional layers, pooling layers, and fully-connected layers. (Goldberg, 2015) summarises their advantages as follows: "Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. For example, in a document classification task, a single key phrase (or an ngram) can help in determining the topic of the document."

In sentiment analysis, there are some cases where these strong local clues exist. For example, words such as "exhilarating" or "abhorrent" would almost always indicate positive and negative sentiment respectively, regardless of where in a sentence they are found. However, most words are able to indicate either positive or negative sentiment depending on their context, and in general we would expect CNNs layers to lose valuable contextual information. This effect may be lessened in longer documents, in which words with the same sentiment as the overall document usually end up predominating, so that order effects aren't as important (this also makes longer sentences more amenable to bag of words approaches).

Recursive Neural Networks

RSNNs only accept tree-structured inputs; a major reason why they seem promising in NLP tasks is because this structure matches the inherently recursive nature of linguistic syntax. It also means that sentences need to be preprocessed into trees by some parsing algorithm before being input to a RSNN (in the special case where the algorithm always returns an unbranched tree, RSNNs are equivalent to RNNs). This requirement may be disadvantageous, for example on inputs such as tweets which are not easily parsed. However, knowing the structure of a sentence is very useful in many cases. A sentence which is of the form (Phrase1 but Phrase2) usually has the same overall sentiment as Phrase2 - for instance, "The actors were brilliant but even they couldn't save this film." Similarly, negations reverse the sentiment of the phrase which follows them. Both of these inferences rely on knowing the scope to which these words apply - i.e. knowing where in the parse tree they are found.

Many of the adaptations which were designed for RNNs, such as bi-directionality and memory units, can also be used in RSNNs.

Experimental Results in Sentiment Analysis

In this section I describe key details of various architectures which have achieved state of the art results in sentiment classification, as well as a few less successful architectures for comparison. A major resource in sentiment classification is the Stanford Sentiment Treebank, introduced in (Socher et al, 2013); I will use performance on this as the main evaluative criterion.

Recursive Neural Networks

I will first discuss three algorithms published by Richard Socher for using RSNNs to classify sentiments. All three are composition-based: they start with dense embeddings of words, which propogate upwards through the parse tree of a sentence to give each internal node a representation with the same dimensions as the word representations. The basic technique in (Socher et al, 2011a) is simply to calculate the representation of a parent node by concatenating  two vectors representing its child nodes, multiplying that by a weight matrix, then applying a nonlinearity (note that the weight matrix and nonlinearity are the same for all nodes). The downside of this method is that the representations of the child nodes only interact via the nonlinearity, which may be quite a simple function.
A more complicated algorithm in (Socher et al, 2012) uses a Matrix-Vector Recursive Neural Network (MV-RNN). This technique represents each node using both a vector and a matrix; instead of concatenating the vectors as above, the representation of a parent node with two children is calculated by multiplying the matrix of each child with the vector of the other (then applying the weight matrix and nonlinearity as usual). However, this results in a very large number of parameters, since the MV-RNN needs to learn a matrix for every word in the vocabulary.

The third architecture, and the one which achieved the best results on the Stanford Sentiment Treebank, is a Recursive Neural Tensor Network (RNTN). As with the first architecture, nodes are simply represented by vectors, and the same function is applied to calculate every parent node - however, this function includes a more complicated tensor product as well as a weight matrix and nonlinearity. This change significantly improved performance; on a difficult subset of the corpus which featured negated negative sentiments, RNTN accuracy was 20 percentage points higher than MV-RNNs, and over three times better than a reference implementation of Naive Bayes with bigrams. Overall accuracy was 80.7% for fine-grained sentiment labels and 85.4% positive/negative sentence classification, a 5 percentage point increase on the state of the art at the time (Socher et al, 2013).

I'll briefly mention one more model. (Kokkinos and Potamianos, 2017) use a bi-directional RSNN with gated recurrent units (GRUs) and a structural attention mechanism. This sophisticated setup was state of the art in early 2017, with 89.5% on the Stanford corpus. A brief explanation of the terms is warranted. Bi-directionality means that when calculating node activations, the standard propagation of information from the leaves (representing words) upwards through the tree structure is followed by a propagation of information downwards from the root (representing the whole sentence). (Irsoy and Cardie, 2013) GRUs serve a similar role to LSTMs in helping store information for longer (Chung et al, 2014). The structural attention mechanism aggregates the most informative nodes to form the sentence representation, and is a generalisation of (Luong et al, 2015).

Convolutional Neural Networks

(Kalchbrenner et al, 2014) discuss several adapted CNN architectures; firstly, Time-Delayed Neural Networks (TDNNs); next, Max-TDNNs; and last, their own Dynamic Convolutional Neural Network (DCNN). Each of these uses a one-dimensional convolution, which is applied over the time dimension of a sequence of inputs. For example, let the input be a sentence of length s, with each word represented as a vector of length d. A convolution multiplies each k-gram in the sentence by a d x k dimension filter matrix m. This results in a d x s dimension matrix (the sentence is padded with 0s so that each weight in the filter can reach each word, known as wide convolution).

However, this results in a matrix whose size varies based on input length. To make further processing easier, in the Max-TDNN architecture the convolution is immediately followed by a max-pooling layer. Specifically, for each of the d rows (each corresponding to one dimension in the vector space that the words are embedded in), only the highest value is retained. In the DCNN architecture this is further refined with "dynamic k-max pooling", which retains the top k values (where k depends on the length of the input sentence). This architecture outperformed the RSNNs previously discussed by 1.4 percentage points on the Stanford Sentiment Treebank. It also performed well in the Twitter sentiment dataset described in (Go et al, 2009).

I will briefly discuss a second (also highly-cited) paper, (Kim, 2014). This uses an architecture similar to Max-TDNN to push the state of the art on the Stanford Sentiment Treebank up by around another percentage point from Kalchbrenner et al. The main improvement in his system is starting from pre-trained word vectors, specifically Google's word2vec, which had been trained on 100 billion words. By contrast, in (Kalchbrenner et al, 2014), word vectors were randomly initialised.

While these architectures seem to have successfully circumvented some of the limitations of CNNs, it's worth noting that they still have drawbacks. While evaluating k-grams means that local word order matters, this architecture still can't model direct relationships between words more than k spaces away from each other, nor the absolute position of words in a sentence.

Recurrent Neural Networks

(Wang et al, 2015) introduced the use of LSTMs for twitter sentiment prediction. They used word2vec software to train word embeddings from the Stanford Twitter Sentiment vocabulary, and achieved very similar results to Kalchbrenner et al on that corpus.

Radford et al use a multiplicative LSTM (mLSTM) which processes text as UTF-8 encoded bytes and was trained on a corpus of 82 million Amazon reviews. While the main focus was creating a generic representation, it achieves 91.8% on the Stanford Sentiment corpus, beating the previous state of the art, 90.2% (Looks et al, 2017). This seems to be the current record. Notably, they found a single unit within the mLSTM model which directly corresponded to sentiment; simply observing that unit achieved a test accuracy almost as high as the whole network.

While RNNs are more sensitive to word order than CNNs, they have a bias towards later input items (Mikolov et al, 2011). For example, the mLSTM in (Radford et al, 2017) had a notable performance drop when moving from sentence to document datasets, which they hypothesised was because it focused more on the last few sentences.

Other approaches
  • Recursive auto-encoders as explored in (Hermann and Blunsom, 2013) and (Socher et al, 2011b). The former uses an unusual combination of neural networks and formal compositional derivations.
  • Dynamic Memory Networks, as in (Kumar et al, 2015), which achieved state of the art performance in question-answering in general, with sentiment analysis as a particular example.
  • Dynamic Computation Graphs of (Looks et al, 2017), which may have been the first architecture to break 90% on the Stanford Sentiment Treebank.

Discussion

The overwhelming impression from the previous section is that of a field in flux, with new techniques and architectures emerging almost on a weekly basis. Looking only at the last year or two, there is a comparative lack of CNNs amongst the highest-performing models. However, it should be noted that later models may perform better without their underlying architectures being any more promising, not least because they may have increased processing power and financial resources behind them (OpenAI spent a month training their model).

It is therefore difficult to tell whether this shift away from CNNs is a long-term trend or merely a brief fad. However, Geoffrey Hinton at least is pessimistic about the future of CNNs; he has written that "The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster."

Perhaps the most relevant previous work is the head-on comparison done by (Yin et al, 2017) between three architectures: CNN, LSTM and GRU. Their conclusion was that relative performance in different tasks "depends on how often the comprehension of global/long-range semantics is required." Sentiment analysis turned out to require more global semantics, therefore placing the CNN at a disadvantage. Final performance was around 86% for GRU, 84% for LSTM and 82% for CNN. (This contrasted with Answer Selection and Question Relation Matching, where the CNN beat the other two).

Word representations

Another complicating factor is the question of how much progress in sentiment analysis has been driven not by improvements in overall architectures, but rather by improved word embeddings. (Le and Mikolov, 2014) achieve results very close to (Socher et al, 2013) simply by using a logistic regression on their "Paragraph Vector" representation. Meanwhile, (Hill et al, 2016) find that an embedding based on dictionary definitions has the overall best performance out of a number of strategies that they tested. The "Skip-thought vectors" introduced by (Kiros et al, 2015) also have very good overall performance.

This proliferation of different algorithms for computing word representations make it more difficult to compare different architectures directly. However, this is counterbalanced by the fact that the existence of standard corpora of word vectors such as word2vec and Glove allows results to be replicated while holding word embeddings constant.

Yin et al sidestep this issue by using no pretraining of word embeddings in any of their experiments. However, it's not clear that this creates a fair comparison either: it would give an advantage to architectures like MV-RNN which can't use pretraining as effectively.

Biological influences

To draw slightly less tentative conclusions, it may be instructive to consider these models in the context of human language processing. While the use of one-dimensional convolutions and pooling layers was a successful workaround to the problem of CNNs requiring fixed-length inputs, it is nevertheless clear that this is very different to the way that humans understand sentences: we do not consider each k-gram separately. Instead, when we hear language, we interpret the words sequentially, in the fashion of a RNN. If the example of human brains is still a useful guide to neural network architectures, then we have a little more reason to favour RNNs over CNNs.

Of course it is debatable whether biology is a good guide. Yet so far it has served fairly well: neural nets in general, and more specifically CNNs, were designed with biological inspiration in mind. Further, the capsule networks recently introduced by (Sabour et al, 2017) were quite explicitly motivated by the failings of CNNs in comparison with vision systems in humans and other animals (Hinton, 2014); if they are able to replicate early successes, that is another vindication of biologically-inspired design.

In arbitrating between RNNs and RSNNs, however, there are further considerations. Since linguistic syntax is recursively structured, it seems quite plausible that our brains use similarly recursive algorithms at some point to process it. However, we must take into account the fact that current RSNNs require sentences to be parsed before processing them. State of the art accuracy for parsers is around 94%, which is only a few percent higher than that for sentiment analysis. This may make parsing quality a limiting factor in future attempts to use RSNNs for sentiment analysis. Further, it suggests that RSNNs are also less biologically plausible than RNNs. Humans are able to understand sentences as we hear them, inferring syntax as we go along; while it's possible to imagine a RSNN system re-evaluating a sentence after each word is added, this is a somewhat ugly solution and surely not how our brains manage it.

Closing thoughts

Moving back to concrete experimental results: while (Kokkinos and Potamianos, 2017) achieve excellent results with their RSNN variant, they have already been beaten by RNNs which don't incorporate any of their most advantageous features (bi-directional + GRUs + attention). If the current state-of-the-art RNNs added these features, it is quite plausible that they would achieve even better results. RNNs may still face the problem of forgetfulness on long inputs - however, since sentence length is effectively bounded in the double digits for almost all practical purposes, this is not a major concern. While it's overly ambitious to make any concrete predictions, based on the analysis in this essay I am leaning towards the conclusion that RNNs using either GRUs or LSTMs will continue to have an advantage in sentiment analysis over rival architectures for the foreseeable future.


References
  • Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.  2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555.
  • A Go, R Bhayani, and L Huang. 2009. Twitter sentiment classification using distant supervision. Processing, pages 1–6.
  • Yoav Goldberg. 2015. A Primer on Neural Network Models for Natural Language Processing.
  • Karl Moritz Hermann and Phil Blunsom. 2013. The Role of Syntax in Vector Space Models of Compositional Semantics.
  • Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning Distributed Representations of Sentences from Unlabelled Data.
  • G Hinton. 2014. What is Wrong with Convolutional Neural Nets? https://www.youtube.com/watch?v=rTawFwUvnLE
  • G Hinton. Online statements. https://www.reddit.com/r/MachineLearning/comments/21mo01/ama_geoffrey_hinton/clyj4jv
  • Ozan Irsoy and Claire Cardie. 2013. Bidirectional recursive neural networks for token-level labeling with structure. CoRR, abs/1312.0493.
  • N. Kalchbrenner, E. Grefenstette, and P. Blunsom, 2014, “A convolutional neural network for modelling sentences,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Available: http://goo.gl/EsQCuC
  • Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar. Association for Computational Linguistics.
  • Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors.
  • F Kokkinos and A Potamianos. 2017. Structural Attention Neural Networks for improved sentiment analysis.
  • Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2015. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
  • Q Le and T Mikolov. 2014. Distributed Representations of Sentences and Documents.
  • Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning with Dynamic Computation Graphs.
  • M Luong, R Socher and C Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology.
  • Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In ICASSP, pages 5528–5531. IEEE.
  • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. "Effective approaches to attention-based neural machine translation". CoRR, abs/1508.04025.
  • B. Pang and L. Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135.
  • S Sabour, N Frosst, and G Hinton. 2017. Dynamic Routing Between Capsules.
  • Socher, R., Lin, C. C.-Y., Ng, A. Y., and Manning, C. D. (2011a). Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In Getoor, L., and Sceffer, T. (Eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 129–136. Omnipress.
  • R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. 2011b. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP.
  • R. Socher, B. Huval, C.D. Manning, and A.Y. Ng. 2012. Semantic compositionality through recursive matrix vector spaces. In EMNLP.
  • Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  • A Radford, R Jozefowicz, and I Sutskever. 2017. Learning to Generate Reviews and Discovering Sentiment.
  • P. D. Turney and P. Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188
  • Wang, X., Liu, Y., SUN, C., Wang, B., and Wang, X. (2015). Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1343–1353, Beijing, China. Association for Computational Linguistics. 
  • Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schutze. 2017. Comparative Study of CNN and RNN for Natural Language Processing.
  • Tom Young, Devamanyu Hazarikab, Soujanya Poriac, Erik Cambriad. 2017. Recent Trends in Deep Learning Based Natural Language Processing.