Encoding Sequences with Feed-Forward Neural Networks
If we would like to encode the data into feature vectors for a feed-forward NN slide a window of size 10 and use the most recent 10 points as a feature vector would be the most viable strategy.
Why We Need RNNs?
RNNs address the issue of how many time steps back should we look at in the feature vector.
A vector presentation of a sentence could be used to:
- Predict whether the sentence is positive or negative
- translate the sentence to another language
- to predict the next word in the sentence
Two steps are necessary:
- mapping a sequence to a vector
- mapping a vector to a prediction
Encoding with RNN
The following is an illustration of a single-layered recurrent neural network:
Where:
- st is the new context or state
- xt new information
- st-1 is the context or state
- (deciding what part of the previous information to keep) and (taking into account new information) are the parameters of the RNN
Encoding Sentences
Let’s consider the following graphical representation of encoding sentences with RNN:
- input is received at each layer (per word), not just at the beginning as in a typical feed-forward network
- the number of layers varies and depends on the length of the sentence
- parameters of each layer are shared
Gating and LSTM
Now, we introduce a gate vector gi of the same dimension as st, which determines “how much information to overwrite in the next state.” In the equation, a single-layered gated RNN can be written as:
- If the ith element of gt is 0, the ith element of st and that of st-1 are equal
- ct and ht represent the context or state
Markov Symbols
To specify a Markov language model we need a start symbol, an end symbol and a symbol for unknown words.
Transition Probabilities
The probability of generating the following sentence <beg> ML course UNK <end> is 0.007.
Markov Models to Feedforward Neural Nets
Let the probability that word j occurs next be pj:
(pk is greater than or equal to 0). In order to satisfy those conditions, we should take the softmax activation of the outputs.
When representing a first-order Markov model as a feedforward network, the number of non-zero values in a single input vector is 1.
Advantages of the feedforward NN as described in the lecture versus Markov models
- They contain a fewer number of parameters
- We can easily control the complexity of feedforward NN by introducing hidden layers
If you have a word vocabulary of size 10 (including <beg> and <end>), and you were using a trigram language model to predict the next word. You will need 1000 parameters for a Markov Model and 210 for a feedforward neural network that contained biases and no hidden units.
RNN Components
The main challenge with an n-gram model is that history needs to be variable, not fixed:
- The input layer which takes in new information and the previous state
- Having a hidden state
RNN Outputs
Lets where extracting the relevant features for a prediction, s_t encoding the data’s relevant features and softmax transforming the result into a probability distribution.
RNN Decoding
In the first image, the foreign word “Olen” in the above picture is a “sampled” result from a distribution of the RNN produced.