Recurrent Neural Networks

Hello, everyone!

We are back again with another interesting concept of Applied deep learning. Last week we came across CNN and its implementation.

Today we are about to see a new concept known as RNN (Recurrent Neural Networks). So let’s get started!!

Introduction:

Recurrent Neural Network(RNN) is the finest state of art algorithm used for the representation of sequential data. It has a unique ability to break through the field of NLP. it’s considered a splendid algorithm because of its ability to remember inputs through internal memory and ideally suitable for solving machine learning problems related to sequential data. It’s accomplished as the most powerful and robust neural network. These networks are precise in predicting future data more efficiently. RNN connections are interlinked with the help of a directed graph sequence. Forward and backward propagation is implemented simultaneously to recover the wrong predictions with a few gradual changes. It capable of providing an extension to pixel effectiveness.


Forward Propagation:

The RNN forward propagation can thus be represented by the below set of equations.



This is an illustration of a recurrent network that maps an input sequence to an output sequence of a similar extent. The breakdown for a given sequence of x values paired with a sequence of y values would then be just the sum of the mislay over all the time steps. We presume that the outcome o(t) is used as the argument to the softmax function to acquire the vector ลท prospect over the output. We also conjecture that the loss L is the negative log-likelihood of the true target y(t)given the input so far.

Forward Propagation Portrayal:



Let’s find out the practical representation of forward propagation :

The Forward propagation can be exhibited with the ensuing equations:

For the 1st unit : Input: x<1>x<1>, Output y^<1>y^<1>

RNN Activation unit a<0>a<0> Initialized as a vector of zeros Forward Propagation:

a<1>a<1> = g1(Waaa<0>+Waxx<1>+ba)g1(Waaa<0>+Waxx<1>+ba)

y^<1>y^<1> = g2(Wyaa<1>+by)g2(Wyaa<1>+by)

The activation functions g1()g1() is tanh() or ReLU. The activation functions g2()g2() is sigmoid() or softmax().

Generalized representation:

a⟨t⟩a⟨t⟩ = g1(Waaa⟨t−1⟩+Waxx⟨t⟩+ba)g1(Waaa⟨t−1⟩+Waxx⟨t⟩+ba)

y^⟨t⟩y^⟨t⟩ = g2(Wyaa⟨t⟩+by)g2(Wyaa⟨t⟩+by)

Representing Weight Metrics in a different way:

The equation a⟨t⟩a⟨t⟩ = g1(Waaa⟨t−1⟩+Waxx⟨t⟩+ba)g1(Waaa⟨t−1⟩+Waxx⟨t⟩+ba) can be represented in a simpler way by combining WaaWaa and WaxWax horizontally into a new matrix represented as WaWa.

The Metrics are appended side by side horizontally, such that WaWa = [Waa: Wax][Waa: Wax]. If a is R100R100, x is R10000,R10000, WaaWaa is a 100x100 matrix, and WaxWax is a 100x10000 matrix, WaWa will be a 100x10100 dimension matrix.

Similarily, a⟨t−1⟩a⟨t−1⟩ & x⟨t⟩x⟨t⟩ can also be combined one below another (vertically). Since a is R100R100 and x is R10000R10000, the combined matrix will be R10100R10100.

[Waa:Wax][a⟨t−1⟩x⟨t⟩][Waa:Wax][a⟨t−1⟩x⟨t⟩] = Waaa⟨t−1⟩+Waxx⟨t⟩Waaa⟨t−1⟩+Waxx⟨t⟩

The equation will now be represented as: a⟨t⟩a⟨t⟩ = g1(Wa[a⟨t−1⟩,x⟨t⟩]+ba)g1(Wa[a⟨t−1⟩,x⟨t⟩]+ba) y^⟨t⟩y^⟨t⟩ = g2(Wya⟨t⟩+by)g2(Wya⟨t⟩+by)

Backpropagation:

Let’s briskly recap the fundamental equations of our RNN. we observe that there’s a slight swapping in notation from to . That’s only to stay compatible with some of the literature out there that is a testimonial.



We also prescribe our loss, or error, to be the cross-entropy loss, given by:



Here, is the precise word at the time step, and is our prediction. We often treat the full sequence (sentence) as one training pattern, so the total error is just the sum of the errors at each time step (word).


Recalling that our goal is to compute the gradients of the error with regard to our parameters U, V, and W then learn superiors parameters using Stochastic Gradient Descent. Just like we sum up the errors, we also sum up the gradients at each time step for one training specimen:



To intend these gradients we use the chain rule of divergence. That’s the backpropagation algorithm when applied backward starting from the error. For the rest of this post, we’ll use as an instance, just to have tactile numbers to work with.



In the above and outer product of two vectors. Don’t agonize if you don’t follow the above, I bounded over several steps and you can try calculating these derivatives. The spike I’m trying to get across is that only depends on the values at the current time step, . If you have these, computing the gradient for simple matrix multiplication.

But the story is different for (and for U). To see why, we write out the chain rule, just as above:



Now, note that depends on , which depends W on and , and so on. So if we take the derivative with respect to we can’t simply treat as a constant! We appeal to apply the chain rule again and what we really have is this:



We sum up the contributions of each time step to the gradient. In other words, because is used in every step up to the output we care about, we need to backpropagate gradients from t=3 through the network all the way to t=0:

Note that this is exactly the same as the standard backpropagation algorithm that we use in deep Feedforward Neural Networks. The key contrast is that we sum up the gradients for W at each time step. In a traditional NN, we don’t share parameters across layers, so we don’t need to sum anything. But in my point of view, BP is just a crave name for standard backpropagation on an unrolled RNN. Just like with Backpropagation you could define a delta vector that you pass backward.

e.g.: with .

Conclusion:

I hope this blog gives you an idea of the concepts covered above and a clear apprehension of the utilization of recurrent neural networks. RNN is an efficient and unique way of performing Deep Learning models and can be well executed in natural language processing, we have experienced RNN in brief.

Comments

Popular posts from this blog

Our First Workshop Went Like...!

Convolutional Neural Networks

Classification of Animals Using CNN Model