As discussed previously, the linear control techniques used for the cart-pole system could not be applied here due to its nonlinearity and wide range of system states.  Further, as optimal control was recommended as an alternative approach by Minh Vu, we sought to use Python’s GEKKO optimization suite.  With this tool, we generated a series of input commands crafted to swing up the acrobot.  Specifically, the optimization problem was crafted with as the objective minimization function, subject to the dynamics of the system derived by Tedrake [11] .  The solution determined by the optimizer is encapsulated in Figure 1 below.

Figure 1. Optimal Control Output

This controller claimed to be able to swing the second link almost vertical.  However, implementation differences regarding torque application in GEKKO and the OpenAIGym environment resulted in poor simulation performance.  Using this control technique in the OpenAIGym environment was unsuccessful.

The DQN controller used was very similar to the DQN designed for the cart-pole problem.  It incorporated a decaying exploration rate ranging from 1.0 to 0.1, a learning rate of 0.001, the standard RELU activation function, mse error metric, and a gamma of 0.95.  Further, we explored a similar arrangement of feedforward neural network architectural layouts, seeking to determine if a particular neural network architecture could lead to better results for various decoupling of fast and slow dynamics.

In our testing, we used 5 different neural network sizes, ranging from 90 to 330 neurons.  This number of neurons were arranged into several architectural formats emphasizing the number of nodes per layer (network breadth), network symmetry (similar depth and breadth), and number of layers (network depth).

In the breadth emphasized architectural structure, a total of 3 hidden layers were used, with the number of neurons per layer ranging from 30 to 110.  For the square network shape, the number layers and nodes per layer ranged from 9 to 18.  On the depth emphasized architectural structure, a total of 3 neurons per layer were used, with the number of layers ranging from 30 to 110.

The separation of fast and slow dynamics in this system was equated to varying the relative masses of the two links.  While the mass of the second link was held constant at 1 kg, the mass of the first link was varied from 1 to 3 kg.  We experimentally found that seeking to move beyond 3 kg made it very physically challenging for the system to achieve its desired state.

Figure 2 conveys the success of each controller with different architectures, where success is defined as the first link reaching 90% of its desired position.  This equates to reaching within 36 degrees of directly upright.  The full suite of the results can be viewed in the supplemental information section.

Figure 2. Probability of success for varying DQN structures where success was defined as the first link reaching within 90% of its desired state.  Left: Probability of success  for breadth DQN.  Middle: Probability of success for square DQN.  Right: Probability of success for depth DQN

During training, similar trends in learning rates and training time were noticed in the acrobot as seen in the cartpole.  This is exemplified in the graphs in Figure 3 where each graph plots run reward versus training iteration. The graph represent the reward growth for the breadth, square, and depth DQNs respectively.

Figure 3. Reward Growth for DQN of size 90 with the mass of the first link at 1.  Left: Reward growth for breadth DQN.  Middle: Reward growth for square DQN.  Right: Reward growth for depth DQN