I would like to explain the RNN-based language model. There are two uses for language models: first, to score the probability of a sentence actually appearing, and second, to score the probability of a sentence actually appearing. This score is the standard for determining whether or not it is grammatically and semantically correct. Such a model is used, for example, for machine translation. Secondly, the language model can generate new text (which, incidentally, I personally think is a cooler use). Also, Andrej Karpathy’s blog explains and develops word-level RNN language models in English.
The reader is assumed to have the rudiments of neural networks covered. If not, try reading Let’s build a neural network in Python without a library. This post describes and builds a non-recursive network model.
The benefits of RNN are the ability to use continuous information such as sentences. The traditional idea of a neural network is not so, and assumes that input data (as well as output data) are independent of each other. However, this assumption is not appropriate in many cases.
For example, if you want to predict the next word, you need to know what the previous word was, right?R in RNN stands for Recurent, so you can have each successive element do the same work, regardless of the previous calculation. To put it another way, the RNN has the memory to remember previously computed information.
Theoretically, the RNN can use information in very long sentences. However, when you actually implement it, you can only remember information from a couple of steps ago (we’ll dig into this further below). Now, let’s look at a general RNN in the following diagram.
The diagram above shows the inner workings of the RNN. Deploy, by which I mean simply write an ordered network. For example, if there is a sentence of 5 words, the expanded network is 5 layers of neural network, one word per layer.
XTXT is the input at the TT step. For example, x1x1 is a vector tied to the following words
STST is a hidden element at the TT step. This is the memory of the network. stst is calculated based on the last hidden element. And the input at this step is st=f(Uxx+Wst-1)st=f(Uxx+Wst-1). ff functions are typically of nonlinear type, such as tanh and ReLU. The S-1S-1 required to calculate the first hidden element usually starts at 0.
OTOT is the output at the TT step. For example, if we want to predict the following words, OTOT is a vector of prediction probabilities (OT=softmax(Vst)OT=softmax(Vst)).