Mạng nơ-ron hồi quy

MẠNG NƠ-RON HỒI QUY

Bùi Quốc Khánh^*

Trường Đại học Hà Nội

Tóm tắt: Ý tưởng chính của mạng hồi quy (Recurrent Neural Network) là sử dụng chuỗi

các thông tin. Trong các mạng nơ-ron truyền thống tất cả các đầu vào và cả đầu ra là độc lập

với nhau. Tức là chúng không liên kết thành chuỗi với nhau. Nhưng các mô hình này không phù

hợp trong rất nhiều bài toán. RNN được gọi là hồi quy (Recurrent) bởi lẽ chúng thực hiện cùng

một tác vụ cho tất cả các phần tử của một chuỗi với đầu ra phụ thuộc vào cả các phép tính

trước đó. Nói cách khác, RNN có khả năng nhớ các thông tin được tính toán trước. Gần đây,

mạng LSTM đang được chú ý và sử dụng khá phổ biến. Về cơ bản mô hình của LSTM không

khác mô hình truyền thống của RNN, nhưng chúng sử dụng hàm tính toán khác ở các trạng thái

ẩn. Vì vậy mà ta có thể truy xuất được quan hệ của các từ phụ thuộc xa nhau rất hiệu quả. Việc

ứng dụng LSTM sẽ được giới thiệu ở bài báo sau.

Từ khóa: Neural Networks, Recurrent Neural Networks, Sequential Data.

Abstract: One major assumption for Neural Networks (NNs) and in fact many other

machine learning models is the independence among data samples. However, this assumption

does not hold for data which is sequential in nature. One mechanism to account for sequential

dependency is to concatenate a fixed number of consecutive data samples together and treat

them as one data point, like moving a fixed size sliding window over data stream. Recurrent

Neural Networks (RNNs) process the input sequence one element at a time and maintain a

hidden state vector which acts as a memory for past information. They learn to selectively retain

relevant information allowing them to capture dependencies across several time steps, which

allows them to utilize both current input and past information while making future predictions.

Keywords: Neural Networks, Recurrent Neural Networks, Sequential Data.

RECURRENT NEURAL NETWORK

I. MOTIVATION FOR RECURRENT NEURAL NETWORKS

Before studying RNNs it would be worthwhile to understand why there is a need

for RNNs and the shortcoming of NNs in modeling sequential data. One major

assumption for NNs and in fact many other machine learning models is the

independence among data samples. However, this assumption does not hold for data

which is sequential in nature. Speech, language, time series, video, etc. all exhibit

dependence between individual elements across time. NNs treat each data sample

individually and thereby lose the benefit that can be derived by exploiting this

sequential information. One mechanism to account for sequential dependency is to

concatenate a fixed number of consecutive data samples together and treat them as one

12

data point, similar to moving a fixed size sliding window over data stream. This

approach was used in the work of [13] for time series prediction using NNs, and in that

of [14] for acoustic modeling. But as mentioned by [13], the success of this approach

depends on finding the optimal window size: a small window size does not capture the

longer dependencies, whereas a larger window size than needed would add unnecessary

noise. More importantly, if there are long-range dependencies in data ranging over

hundreds of time steps, a window-based method would not scale. Another disadvantage

of conventional NNs is that they cannot handle variable length sequences. For many

domains like speech modeling, language translation the input sequences vary in length.

A hidden Markov model (HMM) [15] can model sequential data without requiring

a fixed size window. HMMs map an observed sequence to a set of hidden states by

defining probability distributions for transition between hidden states, and relationships

between observed values and hidden states. HMMs are based on the Markov property

according to which each state depends only on the immediately preceding state. This

severely limits the ability of HMMs to capture long-range dependencies. Furthermore,

the space complexity of HMMs grows quadratically with the number of states and does

not scale well.

RNNs process the input sequence one element at a time and maintain a hidden

state vector which acts as a memory for past information. They learn to selectively

retain relevant information allowing them to capture dependencies across several time

steps. This allows them to utilize both current input and past information while making

future predictions. All this is learned by the model automatically without much

knowledge of the cycles or time dependencies in data. RNNs obviate the need for a

fixed size time window and can also handle variable length sequences. Moreover, the

number of states that can be represented by an NN is exponential in the number of

nodes.

II. RECURRENT NEURAL NETWORKS

Figure 1. A standard RNN. The left-hand side of the figure is a standard RNN. The state vector in the

hidden units is denoted by s. On the right-hand side is the same network unfolded in time to depict how

the state is built over time. Image adapted from [2]

An RNN is a special type of NN suitable for processing sequential data. The main

feature of an RNN is a state vector (in the hidden units) which maintains a memory of

13

all the previous elements of the sequence. The simplest RNN is shown in Figure 1. As

can be seen, an RNN has a feedback connection which connects the hidden neurons

across time. At time t, the RNN receives as input the current sequence element x_tand

the hidden state from the previous time step 푠_푡−1. Next the hidden state is updated to

stand finally the output of the network this calculated. In this way the current output ℎ_푡

depends on all the previous inputs 푥′_푡(for 푡^′< 푡). U is the weight matrix between the

input and hidden layers likewise a conventional NN. W is the weight matrix for the

recurrent transition between one hidden state to the next. V is the weight matrix for

hidden to output transition.

(

)

s_t= σ Ux_t+ Ws_t−1+ b_s

Equations summarize all the computations carried out at each time step.

ℎ_푡= 푠표푓푡푚푎푥(푉푠_푡+ 푏_ℎ)

The softmax represents the softmax function which is often used as the activation

function for the output layer in a multiclass classification problem. The softmax

function ensures that all the outputs range from 0 to 1 and their sum is 1.

e^a

k

y_k=

for k = 1, … , K

K

a_k′

∑

_k_′₌₁e

Equation specifies the softmax for a K class problem

A standard RNN as shown in Figure 1 is itself a deep NN if one considers how it

behaves during operation. As shown on the right side of the figure, once the network is

unfolded in time, it can be considered a deep network with the number of layers

equivalent to the number of time steps in the input sequence. Since the same weights are

used for each time step, an RNN can process variable length sequences. At each time

step new input is received and due to the way the hidden state 푠_푡is updated, the

information can flow in the RNN for an arbitrary number of time steps, allowing the

RNN to maintain a memory of all the past information.

III. TRAINING RNNS

RNN training is achieved by unfolding the RNN and creating a copy of the model

for each time step. The unfolded RNN, on the right side of the figure 1, can be treated as

a multilayer NN and can be trained in a way similar to back-propagation.

This approach to train RNNs is called back-propagation through time (BPTT)

[16]. Ideally, RNNs can be trained using BPTT to learn long-range dependencies over

arbitrarily long sequences. The training algorithm should be able to learn and tune

weights to put the right information in memory. In practice, training RNNs is difficult

because standard RNNs perform poorly even when the outputs and relevant inputs are

separated by as little as 10-time steps. It is now widely known that standard RNNs

cannot be trained to learn dependencies across long intervals [17] [18]. Training an

14

RNN with BPTT requires backpropagating the error gradients across several time steps.

If we consider the standard RNN (figure 1), the recurrent edge has the same weight for

each time step. Thus, backpropagating the error involves multiplying the error gradient

with the same value repeatedly. This causes the gradients to either become too large or

decay to zero. These problems are referred to as exploding gradients and vanishing

gradients respectively. In such situations, the model learning does not converge at all or

may take an inordinate amount of time. The exact problem depends on the magnitude of

the recurrent edge weight and the specific activation function used. If the magnitude of

weight is less than 1 and sigmoid activation is used, vanishing gradients is more likely,

whereas if the magnitude is greater than 1 and ReLU activation is used, exploding

gradients is more likely [19].

Several approaches have been proposed to deal with the problem of learning long-

term dependencies in training RNNs. These include modifications to the training

procedure as well as new RNN architectures. In the study of [19], it was proposed to

scale down the gradient if the norm of the gradient crosses a predefined threshold. This

strategy known as gradient clipping has proven to be effective in mitigating the

exploding gradients problem. The Long Short-Term Memory (LSTM) architecture was

introduced by [17] to counter the vanishing gradients problem. LSTM networks have

proven to be very useful in learning long-term dependencies as compared to standard

RNNs and have become the most popular variant of RNN.

IV. LONG SHORT-TERM MEMORY ARCHITECTURE

LSTM can learn dependencies ranging over arbitrary long-time intervals. LSTM

overcome the vanishing gradients problem by replacing an ordinary neuron by a

complex architecture called the LSTM unit or block. An LSTM unit is made up of

simpler nodes connected in a specific way. The architecture of LSTM unit with forget

gate is shown below: [20]

1) Input: The LSTM unit takes the current input vector denoted by 푥_푡and the

output from the previous time step (through the recurrent edges) denoted by ℎ_푡−1. The

weighted inputs are summed and passed through tanh activation, resulting in 푧_푡.

2) Input gate: The input gate reads 푥_푡and ℎ_푡−1, computes the weighted sum, and

applies sigmoid activation. The result ꢀ_푡is multiplied with the 푧_푡, to provide the input

flowing into the memory cell.

3) Forget gate: The forget gate is the mechanism through which an LSTM learns

to reset the memory contents when they become old and are no longer relevant. This

may happen for example when the network starts processing a new sequence. The forget

gate reads 푥_푡and ℎ_푡−1and applies a sigmoid activation to weighted inputs. The result,

푓

푡

is multiplied by the cell state at previous time step i.e. 푠_푡−1which allows for

forgetting the memory contents which are no longer needed.

15

4) Memory cell: This comprises of the CEC, having a recurrent edge with unit

weight. The current cell state 푠_푡is computed by forgetting irrelevant information (if

any) from the previous time step and accepting relevant information (if any) from the

current input.

5) Output gate: Output gate takes the weighted sum of 푥_푡and ℎ_푡−1and applies

sigmoid activation to control what information would flow out of the LSTM unit.

6) Output: The output of the LSTM unit, ℎ_푡, is computed by passing the cell state

푠_푡through a tanh and multiplying it with the output gate, 표_푡.

V. CONCLUSION AND FUTURE WORK

This work has proposed the effective approach in applying Neural Network to

solve problems with sequential data. LSTM architecture is proved to be effective in

predicting sequential data such as handwriting recognition, handwriting generation,

music generation and even language translation. The potential of application of LSTM

is that they are achieving almost human level of sequence generation quality. This topic

is of interest for further research and implementation.

REFERENCES

[1] N. D. a. S. P. H. R. J. Frank, "Time Series Prediction and Neural Networks,"

Journal of Intelligent & Robotic Systems, vol. 31, no. 1, pp. 99-103, 2001.

[2] G. E. D. a. G. H. A.-r. Mohamed, "Acoustic Modeling using Deep Belief

Networks," in EEE Transactions on Audio, Speech, and Language Processing, 2012.

16

[3] L. R. a. B. Juang, "An Introduction to Hidden Markov Models," IEEE ASSP

Magazine, vol. 3, no. 1, pp. 4-16, 1986.

[4] Y. B. a. G. H. Y. LeCun, "Deep Learning," Nature, vol. 521, no. 7553, pp.

436-444, 2015.

[5] P. J. Werbos, "Backpropagation Through Time: What It Does and How to Do

It," in Proceedings of the IEEE, 1990.

[6] Y. B. P. F. a. J. S. S. Hochreiter, "Gradient Flow in Recurrent Nets: The

Difficulty of Learning Long-Term Dependencies," 2001. [Online]. Available:

http://www.bioinf.jku.at/publications/older/ch7.pdf.

[7] P. S. a. P. F. Y. Bengio, "Learning Long-Term Dependencies with Gradient

Descent is Difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166,

1994.

[8] T. M. a. Y. B. R. Pascanu, "On the Difficulty of Training Recurrent Neural

Networks," ICML, vol. 28, no. 3, p. 1310–1318, 2013.

17