## Encoding Sequences with Feed-Forward Neural Networks

If we would like to encode the data into feature vectors for a feed-forward NN slide a window of size 10 and use the most recent 10 points as a feature vector would be the most viable strategy.

## Why We Need RNNs?

RNNs address the issue of how many time steps back should we look at in the feature vector.

A vector presentation of a sentence could be used to:

- Predict whether the sentence is positive or negative
- translate the sentence to another language
- to predict the next word in the sentence

Two steps are necessary:

- mapping a sequence to a vector
- mapping a vector to a prediction

## Encoding with RNN

The following is an illustration of a single-layered recurrent neural network:

Where:

- s
_{t}is the new context or state - x
_{t}new information - s
_{t-1}is the context or state - (deciding what part of the previous information to keep) and (taking into account new information) are the parameters of the RNN

## Encoding Sentences

Let’s consider the following graphical representation of encoding sentences with RNN:

- input is received at each layer (per word), not just at the beginning as in a typical feed-forward network
- the number of layers varies and depends on the length of the sentence
- parameters of each layer are shared

## Gating and LSTM

Now, we introduce a gate vector g_{i} of the same dimension as s_{t}, which determines “how much information to overwrite in the next state.” In the equation, a single-layered gated RNN can be written as:

- If the i
_{th}element of g_{t}is 0, the i_{th}element of s_{t}and that of s_{t-1}are equal - c
_{t}and h_{t}represent the context or state

## Markov Symbols

To specify a Markov language model we need a start symbol, an end symbol and a symbol for unknown words.

## Transition Probabilities

The probability of generating the following sentence <beg> ML course UNK <end> is 0.007.

## Markov Models to Feedforward Neural Nets

Let the probability that word j occurs next be p_{j:}

(p_{k} is greater than or equal to 0). In order to satisfy those conditions, we should take the softmax activation of the outputs.

When representing a first-order Markov model as a feedforward network, the number of non-zero values in a single input vector is 1.

## Advantages of the feedforward NN as described in the lecture versus Markov models

- They contain a fewer number of parameters
- We can easily control the complexity of feedforward NN by introducing hidden layers

If you have a word vocabulary of size 10 (including <beg> and <end>), and you were using a trigram language model to predict the next word. You will need 1000 parameters for a Markov Model and 210 for a feedforward neural network that contained biases and no hidden units.

## RNN Components

The main challenge with an n-gram model is that history needs to be variable, not fixed:

- The input layer which takes in new information and the previous state
- Having a hidden state

## RNN Outputs

Lets where extracting the relevant features for a prediction, s_t encoding the data’s relevant features and softmax transforming the result into a probability distribution.

## RNN Decoding

In the first image, the foreign word “Olen” in the above picture is a “sampled” result from a distribution of the RNN produced.