Through the course of our experimentation, we sought to determine if a reinforcement learning control approach could outperform a classical control approach when applied to fast slow dynamic systems.  Specifically, emphasis was placed on what parameters resulted in high quality performance from the reinforcement learning, including if there were optimal neural network architectures or fast-slow dynamic decoupling that lead to better performance, both in training time and in the final model evaluation.

Starting with results from the cart-pole, it is evident from Figures 9-11 that the best performing controller is the breadth emphasising DQN.  Following that, the square DQN occasionally learns a valid control law, but the depth based DQN completely failed to do so within a similar time frame.

Examining the error on the slow dynamics in the left graph on Figuress 9 -11, it appears that the greater the breadth of the neural network, the more capable was of learning a control law for the fast and slow dynamics.  As the breadth of the network diminished, so does its ability to learn various control laws.  When there were only 3 nodes per layer as seen in the depth emphasized DQN, the controller learned a very similarly performing control law consistently across neural network sizes and desired cart velocities.  The square DQN learned more varied control laws, but fail to consistently control the system in the desired fashion.  The breadth DQN, however, frequently learned a well performing control algorithm.  For neural network sizes above 90 neurons, it even outperformed the classic controller.

The middle graphs in Figures 9-11 show the rise time for each DQN size against the fast slow dynamic decouplement.  The breadth emphasized DQN consistently outperformed the classical controller in this regard while maintaining a reasonable error in the slow dynamic.  Further, as the number of neurons per layer increased, the rise time decreased.  Although he rise time for the square and depth DQN were extremely small, they were disregarded as the slow dynamic errors were extremely high, indicating no substantial control law was learned.

The graphs comparing steady state error for the slow dynamic with DQN size didn’t seem to indicate any consistent relationship between DQN size and controller performance.  They did however, support the notion that limiting the number of nodes per layer limited the complexity of the control algorithm regardless of the depth of the network.

In Figure 12, it appeared that as the architecture lost nodes in each layer, its ability to efficiently learn a well performing control law significantly decreased.  Notice how the time to learn an acceptable control law increased from approximately 30 to 200 iterations when using the breadth emphasized DQN versus the square DQN.  The depth emphasized DQN did not see any visible increase in rewards achieved and this was reflected in the very high errors of its the performance graphs in Figure 11.

Next, evaluating the acrobot control performance in Figure 14 lead to inconclusive results regarding connections between slow and fast dynamic decoupling and neural network size.  However, these graphs did show that breadth emphasized DQNs were able to occasionally learn successful control algorithms while the square and depth emphasized DQNs failed to do so.  Further, this performance analysis correlated to the results comparing training time and neural network architecture.  As seen in Figure 15, the breadth emphasized DQN appeared to immediately pick up on key factors in the control law, allowing it to achieve high rewards very early on and have occasional success in their evaluation in Figure 14.  The square and depth networks, however, do not capitalize on this and do not learn any acceptable control law.

Perhaps the most definitive results come in comparing training time for the DQNs with different neural network architecture.  As seen in Figures 12 and 15, clear evidence was presented from both the cart-pole and acrobot problem that favored a neural network architecture emphasizing the number of neurons per layer over number of network layers in training time and performance.

As seen in both systems, for upwards of 30 nodes per layer, the DQN approach would effectively learn a control law within a reasonable number of iterations.  However, as soon as the nodes per layer dropped to 18, the DQN’s performance was significantly hindered.  Whereas the DQN with 30 nodes per layer could often learn a reasonably effective control law within 30 iterations, the DQN with 18 nodes per layer was unable to learn within 400 iterations.  Further, we found in the cart-pole analysis that as the number of nodes per layer dropped, the possible complexity of the control law was significantly diminished.  When only 3 nodes per layer were used (as seen in the depth emphasized DQNs in Figure 11) a similar control law was learned regardless of the slow and fast dynamics.

Somewhat more difficult to quantify, however, is how the DQN performance changes as the fast and slow dynamics are decoupled.  While results from the acrobot experimentation indicated no interesting relationship between these factors, the cart-pole results in Figure 9 indicated that as the slow and fast dynamics were decoupled, the general performance for every controller decreased, but increasing the number of nodes per layer in the DQN architecture did improve the controller performance.  Further, the breadth emphasized DQNs outperformed the classical controllers in total slow dynamic error when the number of nodes per layer was greater than 30 and in rise time for every network layout explored.

In future work, it would be helpful to develop a working dynamics approach to control the acrobot system as it would provide a more substantial metric to compare the DQN controllers with.  Further, it would be interesting to explore what factors in DQN architecture for the acrobot system would lead to more consistent performance.  In our testing, we explored various reward functions, including exponential and linear reward functions, and exploration/exploitation rates.  Perhaps other metrics to experiment with would include node activation functions, increased sizes in network breadth, and learning rates.