Recurrent neural networks (RNNs) are an important class of models for learning sequential behavior. However, training RNNs to learn long-term dependencies is a tremendously difficult task, and this difficulty is widely attributed to the vanishing and exploding gradient (VEG) problem. Since it was first characterized 30 years ago, the belief that if VEG occurs during optimization then RNNs learn long-term dependencies poorly has become a central tenet in the RNN literature and has been steadily cited as motivation for a wide variety of research advancements. In this work, we revisit and interrogate this belief using a large factorial experiment where more than 40,000 RNNs were trained, and provide evidence contradicting this belief. Motivated by these findings, we re-examine the original discussion that analyzed latching behavior in RNNs by way of hyperbolic attractors, and ultimately demonstrate that these dynamics do not fully capture the learned characteristics of RNNs. Our findings suggest that these models are fully capable of learning dynamics that do not correspond to hyperbolic attractors, and that the choice of hyper-parameters, namely learning rate, has a substantial impact on the likelihood of whether an RNN will be able to learn long-term dependencies.
Keywords: Recurrent neural network; Vanishing and exploding gradient problem.
Copyright © 2024. Published by Elsevier Ltd.