Recurrent Neural Network — RNN
RNNs can be adapted to different types of problems by rearranging the way the cells are arranged in the graph. We will see some examples of these configurations and how they are used to solve specific problems.
We will also learn about a major limitation of the SimpleRNN cell, and how two variants of the SimpleRNN cell - long short term memory (LSTM) and gated recurrent unit (GRU) - oversome this limitation. Both LSTM and GRU are drop-in replacements for the SimpleRNN cell, so just replacing the RNN cell with one of these variants can often result in a major performance improvement in your network. While LSTM and GRU are not the only variants, it has been shown empirically that they are the best choices for most sequence problems.
Finally, we will also learn about some tips to improve the performance of our RNNs and when and how to apply them.
In this chapter, we will cover the following topics:
- SimpleRNN cell
- Basic RNN implementation in Keras in generating text
- RNN topologies
- LSTM, GRU, and other RNN variants
Vanishing and exploding gradients
Just like traditional neural networks, training the RNN also involves backpropagation. The difference in this case is that since the parameters are shared by all time steps, the gradient at each output depends not only on the current time step, but also on the previous ones. This process is called backpropagation through time (BPTT)
Long short term memory - LSTM
LSTM with Keras - sentiment analysis
Keras provides an LSTM layer that we will use here to construct and train a many-to-one RNN. Our network takes in a sentence (a sequence of words) and outputs a sentiment value (positive or negative). Our training set is a dataset of about 7,000 short sentences from UMICH SI650 sentiment classification competition on Kaggle (https://www.kaggle.com/c/si650winter11#description). Each sentence is labeled 1 or 0 for positive or negative sentiment respectivley, which our network will learn to predict.
- snippet.python
from keras.layers.core import Activation, Dense, Dropout, SpatialDropout1D from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.models import Sequential from keras.preprocessing import sequence from sklearn.model_selection import train_test_split import collections import matplotlib.pyplot as plt import nltk import numpy as np import os
Before we start, we want to do a bit of exploratory analysis on the data. Specifically we need to know how manay unique words there are int the corpus and how many words are there in each sentence:
- snippet.python
maxlen = 0 word_freqs = collections.Counter() num_recs = 0 ftrain = open(os.path.join(DATA_DIR, "umich-sentiment-train.txt"), 'rb') for line in ftrain: label, sentence = line.strip().split("t") words = nltk.word_tokenize(sentence.decode("ascii", "ignore").lower()) if len(words) > maxlen: maxlen = len(words) for word in words: word_freqs[word] += 1 num_recs += 1 ftrain.close()
Bidirectional RNNS
At a given time step t, the output of the RNN is dependent on the outputs at all previous time steps. However, it is entirely possible that the output is also dependent on the future outputs as well. This is especially true for applications such as NLP, where the attributes of the word or phrase we are trying to predict may be dependent on the context given by the entire enclosing sentence, not just the words that came before it. Bidirectional RNNs also help a network architecture place equal emphasis on the beginning and end of the sequence, and increase the data available for training.
Bidirectional RNNs are two RNNs stacked on top of each other, reading the input in opposite directions.
Stateful RNNs
RNNs can be stateful, which means that they can maintain state across batches during training.