Mạng nơ-ron hồi quy
MẠNG NƠ-RON HỒI QUY
Bùi Quốc Khánh*
Trường Đại học Hà Nội
Tóm tắt: Ý tưởng chính của mạng hồi quy (Recurrent Neural Network) là sử dụng chuỗi
các thông tin. Trong các mạng nơ-ron truyền thống tất cả các đầu vào và cả đầu ra là độc lập
với nhau. Tức là chúng không liên kết thành chuỗi với nhau. Nhưng các mô hình này không phù
hợp trong rất nhiều bài toán. RNN được gọi là hồi quy (Recurrent) bởi lẽ chúng thực hiện cùng
một tác vụ cho tất cả các phần tử của một chuỗi với đầu ra phụ thuộc vào cả các phép tính
trước đó. Nói cách khác, RNN có khả năng nhớ các thông tin được tính toán trước. Gần đây,
mạng LSTM đang được chú ý và sử dụng khá phổ biến. Về cơ bản mô hình của LSTM không
khác mô hình truyền thống của RNN, nhưng chúng sử dụng hàm tính toán khác ở các trạng thái
ẩn. Vì vậy mà ta có thể truy xuất được quan hệ của các từ phụ thuộc xa nhau rất hiệu quả. Việc
ứng dụng LSTM sẽ được giới thiệu ở bài báo sau.
Từ khóa: Neural Networks, Recurrent Neural Networks, Sequential Data.
Abstract: One major assumption for Neural Networks (NNs) and in fact many other
machine learning models is the independence among data samples. However, this assumption
does not hold for data which is sequential in nature. One mechanism to account for sequential
dependency is to concatenate a fixed number of consecutive data samples together and treat
them as one data point, like moving a fixed size sliding window over data stream. Recurrent
Neural Networks (RNNs) process the input sequence one element at a time and maintain a
hidden state vector which acts as a memory for past information. They learn to selectively retain
relevant information allowing them to capture dependencies across several time steps, which
allows them to utilize both current input and past information while making future predictions.
Keywords: Neural Networks, Recurrent Neural Networks, Sequential Data.
RECURRENT NEURAL NETWORK
I. MOTIVATION FOR RECURRENT NEURAL NETWORKS
Before studying RNNs it would be worthwhile to understand why there is a need
for RNNs and the shortcoming of NNs in modeling sequential data. One major
assumption for NNs and in fact many other machine learning models is the
independence among data samples. However, this assumption does not hold for data
which is sequential in nature. Speech, language, time series, video, etc. all exhibit
dependence between individual elements across time. NNs treat each data sample
individually and thereby lose the benefit that can be derived by exploiting this
sequential information. One mechanism to account for sequential dependency is to
concatenate a fixed number of consecutive data samples together and treat them as one
12
data point, similar to moving a fixed size sliding window over data stream. This
approach was used in the work of [13] for time series prediction using NNs, and in that
of [14] for acoustic modeling. But as mentioned by [13], the success of this approach
depends on finding the optimal window size: a small window size does not capture the
longer dependencies, whereas a larger window size than needed would add unnecessary
noise. More importantly, if there are long-range dependencies in data ranging over
hundreds of time steps, a window-based method would not scale. Another disadvantage
of conventional NNs is that they cannot handle variable length sequences. For many
domains like speech modeling, language translation the input sequences vary in length.
A hidden Markov model (HMM) [15] can model sequential data without requiring
a fixed size window. HMMs map an observed sequence to a set of hidden states by
defining probability distributions for transition between hidden states, and relationships
between observed values and hidden states. HMMs are based on the Markov property
according to which each state depends only on the immediately preceding state. This
severely limits the ability of HMMs to capture long-range dependencies. Furthermore,
the space complexity of HMMs grows quadratically with the number of states and does
not scale well.
RNNs process the input sequence one element at a time and maintain a hidden
state vector which acts as a memory for past information. They learn to selectively
retain relevant information allowing them to capture dependencies across several time
steps. This allows them to utilize both current input and past information while making
future predictions. All this is learned by the model automatically without much
knowledge of the cycles or time dependencies in data. RNNs obviate the need for a
fixed size time window and can also handle variable length sequences. Moreover, the
number of states that can be represented by an NN is exponential in the number of
nodes.
II. RECURRENT NEURAL NETWORKS
Figure 1. A standard RNN. The left-hand side of the figure is a standard RNN. The state vector in the
hidden units is denoted by s. On the right-hand side is the same network unfolded in time to depict how
the state is built over time. Image adapted from [2]
An RNN is a special type of NN suitable for processing sequential data. The main
feature of an RNN is a state vector (in the hidden units) which maintains a memory of
13
all the previous elements of the sequence. The simplest RNN is shown in Figure 1. As
can be seen, an RNN has a feedback connection which connects the hidden neurons
across time. At time t, the RNN receives as input the current sequence element xt and
the hidden state from the previous time step 푠푡−1. Next the hidden state is updated to
stand finally the output of the network this calculated. In this way the current output ℎ푡
depends on all the previous inputs 푥′푡 (for 푡′ < 푡). U is the weight matrix between the
input and hidden layers likewise a conventional NN. W is the weight matrix for the
recurrent transition between one hidden state to the next. V is the weight matrix for
hidden to output transition.
(
)
st = σ Uxt + Wst−1 + bs
Equations summarize all the computations carried out at each time step.
ℎ푡 = 푠표푓푡푚푎푥(푉푠푡 + 푏ℎ)
The softmax represents the softmax function which is often used as the activation
function for the output layer in a multiclass classification problem. The softmax
function ensures that all the outputs range from 0 to 1 and their sum is 1.
ea
k
yk =
for k = 1, … , K
K
ak′
∑
k′=1 e
Equation specifies the softmax for a K class problem
A standard RNN as shown in Figure 1 is itself a deep NN if one considers how it
behaves during operation. As shown on the right side of the figure, once the network is
unfolded in time, it can be considered a deep network with the number of layers
equivalent to the number of time steps in the input sequence. Since the same weights are
used for each time step, an RNN can process variable length sequences. At each time
step new input is received and due to the way the hidden state 푠푡 is updated, the
information can flow in the RNN for an arbitrary number of time steps, allowing the
RNN to maintain a memory of all the past information.
III. TRAINING RNNS
RNN training is achieved by unfolding the RNN and creating a copy of the model
for each time step. The unfolded RNN, on the right side of the figure 1, can be treated as
a multilayer NN and can be trained in a way similar to back-propagation.
This approach to train RNNs is called back-propagation through time (BPTT)
[16]. Ideally, RNNs can be trained using BPTT to learn long-range dependencies over
arbitrarily long sequences. The training algorithm should be able to learn and tune
weights to put the right information in memory. In practice, training RNNs is difficult
because standard RNNs perform poorly even when the outputs and relevant inputs are
separated by as little as 10-time steps. It is now widely known that standard RNNs
cannot be trained to learn dependencies across long intervals [17] [18]. Training an
14
RNN with BPTT requires backpropagating the error gradients across several time steps.
If we consider the standard RNN (figure 1), the recurrent edge has the same weight for
each time step. Thus, backpropagating the error involves multiplying the error gradient
with the same value repeatedly. This causes the gradients to either become too large or
decay to zero. These problems are referred to as exploding gradients and vanishing
gradients respectively. In such situations, the model learning does not converge at all or
may take an inordinate amount of time. The exact problem depends on the magnitude of
the recurrent edge weight and the specific activation function used. If the magnitude of
weight is less than 1 and sigmoid activation is used, vanishing gradients is more likely,
whereas if the magnitude is greater than 1 and ReLU activation is used, exploding
gradients is more likely [19].
Several approaches have been proposed to deal with the problem of learning long-
term dependencies in training RNNs. These include modifications to the training
procedure as well as new RNN architectures. In the study of [19], it was proposed to
scale down the gradient if the norm of the gradient crosses a predefined threshold. This
strategy known as gradient clipping has proven to be effective in mitigating the
exploding gradients problem. The Long Short-Term Memory (LSTM) architecture was
introduced by [17] to counter the vanishing gradients problem. LSTM networks have
proven to be very useful in learning long-term dependencies as compared to standard
RNNs and have become the most popular variant of RNN.
IV. LONG SHORT-TERM MEMORY ARCHITECTURE
LSTM can learn dependencies ranging over arbitrary long-time intervals. LSTM
overcome the vanishing gradients problem by replacing an ordinary neuron by a
complex architecture called the LSTM unit or block. An LSTM unit is made up of
simpler nodes connected in a specific way. The architecture of LSTM unit with forget
gate is shown below: [20]
1) Input: The LSTM unit takes the current input vector denoted by 푥푡 and the
output from the previous time step (through the recurrent edges) denoted by ℎ푡−1. The
weighted inputs are summed and passed through tanh activation, resulting in 푧푡.
2) Input gate: The input gate reads 푥푡 and ℎ푡−1, computes the weighted sum, and
applies sigmoid activation. The result ꢀ푡 is multiplied with the 푧푡, to provide the input
flowing into the memory cell.
3) Forget gate: The forget gate is the mechanism through which an LSTM learns
to reset the memory contents when they become old and are no longer relevant. This
may happen for example when the network starts processing a new sequence. The forget
gate reads 푥푡 and ℎ푡−1 and applies a sigmoid activation to weighted inputs. The result,
푓
푡
is multiplied by the cell state at previous time step i.e. 푠푡−1 which allows for
forgetting the memory contents which are no longer needed.
15
4) Memory cell: This comprises of the CEC, having a recurrent edge with unit
weight. The current cell state 푠푡 is computed by forgetting irrelevant information (if
any) from the previous time step and accepting relevant information (if any) from the
current input.
5) Output gate: Output gate takes the weighted sum of 푥푡 and ℎ푡−1 and applies
sigmoid activation to control what information would flow out of the LSTM unit.
6) Output: The output of the LSTM unit, ℎ푡, is computed by passing the cell state
푠푡 through a tanh and multiplying it with the output gate, 표푡.
V. CONCLUSION AND FUTURE WORK
This work has proposed the effective approach in applying Neural Network to
solve problems with sequential data. LSTM architecture is proved to be effective in
predicting sequential data such as handwriting recognition, handwriting generation,
music generation and even language translation. The potential of application of LSTM
is that they are achieving almost human level of sequence generation quality. This topic
is of interest for further research and implementation.
REFERENCES
[1] N. D. a. S. P. H. R. J. Frank, "Time Series Prediction and Neural Networks,"
Journal of Intelligent & Robotic Systems, vol. 31, no. 1, pp. 99-103, 2001.
[2] G. E. D. a. G. H. A.-r. Mohamed, "Acoustic Modeling using Deep Belief
Networks," in EEE Transactions on Audio, Speech, and Language Processing, 2012.
16
[3] L. R. a. B. Juang, "An Introduction to Hidden Markov Models," IEEE ASSP
Magazine, vol. 3, no. 1, pp. 4-16, 1986.
[4] Y. B. a. G. H. Y. LeCun, "Deep Learning," Nature, vol. 521, no. 7553, pp.
436-444, 2015.
[5] P. J. Werbos, "Backpropagation Through Time: What It Does and How to Do
It," in Proceedings of the IEEE, 1990.
[6] Y. B. P. F. a. J. S. S. Hochreiter, "Gradient Flow in Recurrent Nets: The
Difficulty of Learning Long-Term Dependencies," 2001. [Online]. Available:
[7] P. S. a. P. F. Y. Bengio, "Learning Long-Term Dependencies with Gradient
Descent is Difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166,
1994.
[8] T. M. a. Y. B. R. Pascanu, "On the Difficulty of Training Recurrent Neural
Networks," ICML, vol. 28, no. 3, p. 1310–1318, 2013.
17
Bạn đang xem tài liệu "Mạng nơ-ron hồi quy", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
File đính kèm:
- mang_no_ron_hoi_quy.pdf