Mạng nơ-ron hồi quy

MẠNG NƠ-RON HỒI QUY  
Bùi Quc Khánh*  
Trường Đại hc Hà Ni  
Tóm tắt: Ý tưởng chính của mạng hồi quy (Recurrent Neural Network) là sử dụng chuỗi  
các thông tin. Trong các mạng nơ-ron truyền thống tất cả các đầu vào và cả đầu ra là độc lập  
với nhau. Tức là chúng không liên kết thành chuỗi với nhau. Nhưng các mô hình này không phù  
hợp trong rất nhiều bài toán. RNN được gọi là hồi quy (Recurrent) bởi lẽ chúng thực hiện cùng  
một tác vụ cho tất cả các phần tử của một chuỗi với đầu ra phụ thuộc vào cả các phép tính  
trước đó. Nói cách khác, RNN có khả năng nhớ các thông tin được tính toán trước. Gần đây,  
mạng LSTM đang được chú ý và sử dụng khá phổ biến. Về cơ bản mô hình của LSTM không  
khác mô hình truyền thống của RNN, nhưng chúng sử dụng hàm tính toán khác ở các trạng thái  
ẩn. Vì vậy mà ta có thể truy xuất được quan hệ của các từ phụ thuộc xa nhau rất hiệu quả. Việc  
ứng dụng LSTM sẽ được giới thiệu ở bài báo sau.  
Từ khóa: Neural Networks, Recurrent Neural Networks, Sequential Data.  
Abstract: One major assumption for Neural Networks (NNs) and in fact many other  
machine learning models is the independence among data samples. However, this assumption  
does not hold for data which is sequential in nature. One mechanism to account for sequential  
dependency is to concatenate a fixed number of consecutive data samples together and treat  
them as one data point, like moving a fixed size sliding window over data stream. Recurrent  
Neural Networks (RNNs) process the input sequence one element at a time and maintain a  
hidden state vector which acts as a memory for past information. They learn to selectively retain  
relevant information allowing them to capture dependencies across several time steps, which  
allows them to utilize both current input and past information while making future predictions.  
Keywords: Neural Networks, Recurrent Neural Networks, Sequential Data.  
RECURRENT NEURAL NETWORK  
I. MOTIVATION FOR RECURRENT NEURAL NETWORKS  
Before studying RNNs it would be worthwhile to understand why there is a need  
for RNNs and the shortcoming of NNs in modeling sequential data. One major  
assumption for NNs and in fact many other machine learning models is the  
independence among data samples. However, this assumption does not hold for data  
which is sequential in nature. Speech, language, time series, video, etc. all exhibit  
dependence between individual elements across time. NNs treat each data sample  
individually and thereby lose the benefit that can be derived by exploiting this  
sequential information. One mechanism to account for sequential dependency is to  
concatenate a fixed number of consecutive data samples together and treat them as one  
12  
data point, similar to moving a fixed size sliding window over data stream. This  
approach was used in the work of [13] for time series prediction using NNs, and in that  
of [14] for acoustic modeling. But as mentioned by [13], the success of this approach  
depends on finding the optimal window size: a small window size does not capture the  
longer dependencies, whereas a larger window size than needed would add unnecessary  
noise. More importantly, if there are long-range dependencies in data ranging over  
hundreds of time steps, a window-based method would not scale. Another disadvantage  
of conventional NNs is that they cannot handle variable length sequences. For many  
domains like speech modeling, language translation the input sequences vary in length.  
A hidden Markov model (HMM) [15] can model sequential data without requiring  
a fixed size window. HMMs map an observed sequence to a set of hidden states by  
defining probability distributions for transition between hidden states, and relationships  
between observed values and hidden states. HMMs are based on the Markov property  
according to which each state depends only on the immediately preceding state. This  
severely limits the ability of HMMs to capture long-range dependencies. Furthermore,  
the space complexity of HMMs grows quadratically with the number of states and does  
not scale well.  
RNNs process the input sequence one element at a time and maintain a hidden  
state vector which acts as a memory for past information. They learn to selectively  
retain relevant information allowing them to capture dependencies across several time  
steps. This allows them to utilize both current input and past information while making  
future predictions. All this is learned by the model automatically without much  
knowledge of the cycles or time dependencies in data. RNNs obviate the need for a  
fixed size time window and can also handle variable length sequences. Moreover, the  
number of states that can be represented by an NN is exponential in the number of  
nodes.  
II. RECURRENT NEURAL NETWORKS  
Figure 1. A standard RNN. The left-hand side of the figure is a standard RNN. The state vector in the  
hidden units is denoted by s. On the right-hand side is the same network unfolded in time to depict how  
the state is built over time. Image adapted from [2]  
An RNN is a special type of NN suitable for processing sequential data. The main  
feature of an RNN is a state vector (in the hidden units) which maintains a memory of  
13  
all the previous elements of the sequence. The simplest RNN is shown in Figure 1. As  
can be seen, an RNN has a feedback connection which connects the hidden neurons  
across time. At time t, the RNN receives as input the current sequence element xt and  
the hidden state from the previous time step 푡−1. Next the hidden state is updated to  
stand finally the output of the network this calculated. In this way the current output 푡  
depends on all the previous inputs 푥′(for < ). U is the weight matrix between the  
input and hidden layers likewise a conventional NN. W is the weight matrix for the  
recurrent transition between one hidden state to the next. V is the weight matrix for  
hidden to output transition.  
(
)
st = σ Uxt + Wst−1 + bs  
Equations summarize all the computations carried out at each time step.  
= 푠표푓푚푎푥(푉푠+ 푏)  
The softmax represents the softmax function which is often used as the activation  
function for the output layer in a multiclass classification problem. The softmax  
function ensures that all the outputs range from 0 to 1 and their sum is 1.  
ea  
k
yk =  
for k = 1, … , K  
K
ak′  
k=1 e  
Equation specifies the softmax for a K class problem  
A standard RNN as shown in Figure 1 is itself a deep NN if one considers how it  
behaves during operation. As shown on the right side of the figure, once the network is  
unfolded in time, it can be considered a deep network with the number of layers  
equivalent to the number of time steps in the input sequence. Since the same weights are  
used for each time step, an RNN can process variable length sequences. At each time  
step new input is received and due to the way the hidden state is updated, the  
information can flow in the RNN for an arbitrary number of time steps, allowing the  
RNN to maintain a memory of all the past information.  
III. TRAINING RNNS  
RNN training is achieved by unfolding the RNN and creating a copy of the model  
for each time step. The unfolded RNN, on the right side of the figure 1, can be treated as  
a multilayer NN and can be trained in a way similar to back-propagation.  
This approach to train RNNs is called back-propagation through time (BPTT)  
[16]. Ideally, RNNs can be trained using BPTT to learn long-range dependencies over  
arbitrarily long sequences. The training algorithm should be able to learn and tune  
weights to put the right information in memory. In practice, training RNNs is difficult  
because standard RNNs perform poorly even when the outputs and relevant inputs are  
separated by as little as 10-time steps. It is now widely known that standard RNNs  
cannot be trained to learn dependencies across long intervals [17] [18]. Training an  
14  
RNN with BPTT requires backpropagating the error gradients across several time steps.  
If we consider the standard RNN (figure 1), the recurrent edge has the same weight for  
each time step. Thus, backpropagating the error involves multiplying the error gradient  
with the same value repeatedly. This causes the gradients to either become too large or  
decay to zero. These problems are referred to as exploding gradients and vanishing  
gradients respectively. In such situations, the model learning does not converge at all or  
may take an inordinate amount of time. The exact problem depends on the magnitude of  
the recurrent edge weight and the specific activation function used. If the magnitude of  
weight is less than 1 and sigmoid activation is used, vanishing gradients is more likely,  
whereas if the magnitude is greater than 1 and ReLU activation is used, exploding  
gradients is more likely [19].  
Several approaches have been proposed to deal with the problem of learning long-  
term dependencies in training RNNs. These include modifications to the training  
procedure as well as new RNN architectures. In the study of [19], it was proposed to  
scale down the gradient if the norm of the gradient crosses a predefined threshold. This  
strategy known as gradient clipping has proven to be effective in mitigating the  
exploding gradients problem. The Long Short-Term Memory (LSTM) architecture was  
introduced by [17] to counter the vanishing gradients problem. LSTM networks have  
proven to be very useful in learning long-term dependencies as compared to standard  
RNNs and have become the most popular variant of RNN.  
IV. LONG SHORT-TERM MEMORY ARCHITECTURE  
LSTM can learn dependencies ranging over arbitrary long-time intervals. LSTM  
overcome the vanishing gradients problem by replacing an ordinary neuron by a  
complex architecture called the LSTM unit or block. An LSTM unit is made up of  
simpler nodes connected in a specific way. The architecture of LSTM unit with forget  
gate is shown below: [20]  
1) Input: The LSTM unit takes the current input vector denoted by and the  
output from the previous time step (through the recurrent edges) denoted by 푡−1. The  
weighted inputs are summed and passed through tanh activation, resulting in .  
2) Input gate: The input gate reads and 푡−1, computes the weighted sum, and  
applies sigmoid activation. The result is multiplied with the , to provide the input  
flowing into the memory cell.  
3) Forget gate: The forget gate is the mechanism through which an LSTM learns  
to reset the memory contents when they become old and are no longer relevant. This  
may happen for example when the network starts processing a new sequence. The forget  
gate reads and 푡−1 and applies a sigmoid activation to weighted inputs. The result,  
is multiplied by the cell state at previous time step i.e. 푡−1 which allows for  
forgetting the memory contents which are no longer needed.  
15  
4) Memory cell: This comprises of the CEC, having a recurrent edge with unit  
weight. The current cell state is computed by forgetting irrelevant information (if  
any) from the previous time step and accepting relevant information (if any) from the  
current input.  
5) Output gate: Output gate takes the weighted sum of and 푡−1 and applies  
sigmoid activation to control what information would flow out of the LSTM unit.  
6) Output: The output of the LSTM unit, , is computed by passing the cell state  
through a tanh and multiplying it with the output gate, .  
V. CONCLUSION AND FUTURE WORK  
This work has proposed the effective approach in applying Neural Network to  
solve problems with sequential data. LSTM architecture is proved to be effective in  
predicting sequential data such as handwriting recognition, handwriting generation,  
music generation and even language translation. The potential of application of LSTM  
is that they are achieving almost human level of sequence generation quality. This topic  
is of interest for further research and implementation.  
REFERENCES  
[1] N. D. a. S. P. H. R. J. Frank, "Time Series Prediction and Neural Networks,"  
Journal of Intelligent & Robotic Systems, vol. 31, no. 1, pp. 99-103, 2001.  
[2] G. E. D. a. G. H. A.-r. Mohamed, "Acoustic Modeling using Deep Belief  
Networks," in EEE Transactions on Audio, Speech, and Language Processing, 2012.  
16  
[3] L. R. a. B. Juang, "An Introduction to Hidden Markov Models," IEEE ASSP  
Magazine, vol. 3, no. 1, pp. 4-16, 1986.  
[4] Y. B. a. G. H. Y. LeCun, "Deep Learning," Nature, vol. 521, no. 7553, pp.  
436-444, 2015.  
[5] P. J. Werbos, "Backpropagation Through Time: What It Does and How to Do  
It," in Proceedings of the IEEE, 1990.  
[6] Y. B. P. F. a. J. S. S. Hochreiter, "Gradient Flow in Recurrent Nets: The  
Difficulty of Learning Long-Term Dependencies," 2001. [Online]. Available:  
[7] P. S. a. P. F. Y. Bengio, "Learning Long-Term Dependencies with Gradient  
Descent is Difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166,  
1994.  
[8] T. M. a. Y. B. R. Pascanu, "On the Difficulty of Training Recurrent Neural  
Networks," ICML, vol. 28, no. 3, p. 13101318, 2013.  
17  
pdf 6 trang Thùy Anh 12/05/2022 1940
Bạn đang xem tài liệu "Mạng nơ-ron hồi quy", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

File đính kèm:

  • pdfmang_no_ron_hoi_quy.pdf