Recurrent Neural Networks (RNNs)

Encoding Sequences with Feed-Forward Neural Networks

If we would like to encode the data into feature vectors for a feed-forward NN slide a window of size 10 and use the most recent 10 points as a feature vector would be the most viable strategy.

Why We Need RNNs?

RNNs address the issue of how many time steps back should we look at in the feature vector.

A vector presentation of a sentence could be used to:

  • Predict whether the sentence is positive or negative
  • translate the sentence to another language
  • to predict the next word in the sentence

Two steps are necessary:

  • mapping a sequence to a vector
  • mapping a vector to a prediction

Encoding with RNN

The following is an illustration of a single-layered recurrent neural network:

s_{t}=\tanh \left(W^{s, s} s_{t-1}+W^{s, x} x_{t}\right)

Where:

  • st is the new context or state
  • xt new information
  • st-1 is the context or state
  • W^{s, s} (deciding what part of the previous information to keep) and W^{s, s} (taking into account new information) are the parameters of the RNN

Encoding Sentences

Let’s consider the following graphical representation of encoding sentences with RNN:

  • input is received at each layer (per word), not just at the beginning as in a typical feed-forward network
  • the number of layers varies and depends on the length of the sentence
  • parameters of each layer are shared

Gating and LSTM

Now, we introduce a gate vector gi of the same dimension as st, which determines “how much information to overwrite in the next state.” In the equation, a single-layered gated RNN can be written as:

g_{t}=\operatorname{sigmoid}\left(W^{g, s} s_{t-1}+W^{g, x} x_{t}\right)
s_{t}=\left(1-g_{t}\right) \odot s_{t-1}+g_{t} \odot \tanh \left(W^{s, s} s_{t-1}+W^{s, x} x_{t}\right)

  • If the ith element of gt is 0, the ith element of st and that of st-1 are equal
  • ct and ht represent the context or state

Markov Symbols

To specify a Markov language model we need a start symbol, an end symbol and a symbol for unknown words.

Transition Probabilities

The probability of generating the following sentence <beg> ML course UNK <end> is 0.007.

Markov Models to Feedforward Neural Nets

Let the probability that word j occurs next be pj:

\sum_{k \in K} p_{k}=1 (pk is greater than or equal to 0). In order to satisfy those conditions, we should take the softmax activation of the outputs.

When representing a first-order Markov model as a feedforward network, the number of non-zero values in a single input vector is 1.

Advantages of the feedforward NN as described in the lecture versus Markov models

  • They contain a fewer number of parameters
  • We can easily control the complexity of feedforward NN by introducing hidden layers

If you have a word vocabulary of size 10 (including <beg> and <end>), and you were using a trigram language model to predict the next word. You will need 1000 parameters for a Markov Model and 210 for a feedforward neural network that contained biases and no hidden units.

RNN Components

The main challenge with an n-gram model is that history needs to be variable, not fixed:

  • The input layer which takes in new information and the previous state
  • Having a hidden state

RNN Outputs

Lets p_{t}=\operatorname{softmax}\left(W^{o} * s_{t}\right) where W^o extracting the relevant features for a prediction, s_t encoding the data’s relevant features and softmax transforming the result into a probability distribution.

RNN Decoding

In the first image, the foreign word “Olen” in the above picture is a “sampled” result from a distribution of the RNN produced.

Don’t miss these tips!

We don’t spam! Read our privacy policy for more info.