Deep neural networks have recently conquered a number of major problems in machine perception. Able to learn representations, the methods are especially effective in domains where any given feature in the raw data is uninformative. Consider, for example, an image with dimension 300x300pixels. Any one pixel by itself effectively meaningless. By computing successive layers of representation, however, a neural network is able to identify hierarchical features, like edge detectors at a low level, and faces at a high level which enable accurate classification of images into abstract object categories.
Deep neural networks have a number of other remarkable properties. For example, given a large enough sample of examples, there exists a neural network that can express a generative model that produces examples from the same distribution.
However, two major problems remain. First, given a network of some architecture, finding the optimal parameters which minimize your loss at the output is NP-Hard. Further, the optimization problem is widely known to be nonconvex. In practice, one hopes that the the optimization surface, while nonconvex, is not completely pathological (see image below).
In such a case, it’s not at all clear how one could train the model. One could settle for following the gradient to the first local minimum, but in a truly pathological case this is unlikely to be useful.
On the other hand, one could hope that given enough parameters, a neural network may present a nonconvex optimization surface that is somehow “mostly convex”. The idea is that if the local minima are not too deep (see image below) it might be possible to carry some momentum from the previous steps (by carrying over some fraction of the previous delta into each new delta). With a momentum the Δ at time t+1 could be expressed Δt+1 = η gt+1 + m Δtwhere η is a learning rate, g is the gradient calculated on the current example (or mini-batch), and m is a fraction of the previous Δ to carry over to this update.
In practice, these gradient methods with momentum have been very effective, even in the stochastic gradient setting where the geometric interpretation may be less clear. Unfortunately, finding the right momentum parameters appears to be an opaque science requiring hand-tuning.
A second major limitation of traditional deep nets remains. They assume input and output of fixed size. To be fair, most shallow methods suffer from the same limitation. One might ask, “why this is a problem?” Images are easily enough cropped and resized. However, temporal and sequence data is not nearly as amenable to a straightforward reduction to a vector of fixed size. Our best methods for reducing text to fixed length vectors (bag-of-words) are sufficient for binary classification but strip the text of seemingly any concrete meaning.
To this end, recurrent neural networks (RNNs), particularly the Long Short Term Memory model (LSTM) have recently become popular both for their ability to ingest sequences of data, one token at a time, and their ability to output sequences of data. Unlike traditional feed-forward neural networks, recurrent nets have edges between nodes that span time steps. Thus a network can output a list of probabilities of each word ocurring next in the sequence. Then given the chosen word plus the entire state of the network at the previous time step, the network can choose the second word.
In a recent paper by Google researchers Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, the authors do both simultaneously, learning a model which performs machine translation. The network consists of an encoding LSTM network which ingests the input sentence, and a decoding LSTM network which outputs the corresponding translated sentence.
Effectively the authors have provided for each sentence a fixed-length vector representation. The entire state of the network after the encoding step provides such a representation (in their case, the representation consists of 8000 real numbers).