The standard layout of the problem involved applying a unit force to the cart in either the left (-1) or right (+1) directions in an attempt to keep the pole balanced vertically.  This simulation relies on dynamics that can be derived from Newtonian physics or Lagrangian methods. In either case, the dynamics can be fully described by the two following equations where I is the moment of inertia, m is the mass of the pole, l is the half of the length of the pole, g is the gravitational constant, and M is the mass of the cart.  These were found in Pati’s article [3] on the cart-pole system and derived by our team for verification:

In this testbed, there was enough documentation on the physical parameters of the system to create fill out the values in the above transfer functions for the cart-pole system.  Implementing state feedback required system linearization via Jacobian methods, which resulted in the following model:


Using Ackerman’s method for pole placement at [-5, -5.1, -5.2, -5.3], the K matrix feedback gains were set to K =  [-0.8137, 0.620, -151.04, -4.868]. Then, the gains for PID controller was experimentally tuned; it was determined that using kp = 2, ki = 1, kd = 0 lead to optimal results.

The DQN controller was designed using a decaying exploration rate ranging from 1.0 to 0.1, a learning rate of 0.001, the standard RELU activation function, mse error metric, and a gamma of 0.95.  Some of these parameters and code were taken from working code repositories for the cart-pole problem and used as a skeleton upon which we could build upon for our purposes[9].

We sought to explore a variety of different feedforward neural network layouts, seeking to determine if a particular neural network architecture would lead to better results for various decoupling of fast and slow dynamics.   Specifically, we used 5 different neural network sizes ranging from 90 to 330 neurons. For each size examined, three different layouts were explored. The first sought to emphasize the number of nodes per layer, which we called the network breadth.  For this architectural structure, a total of 3 hidden layers were used, with the number of neurons per layer ranging from 30 to 110. The second layout sought to maintain the same number of neurons per layer as layers for each network size. As such, the network ranged from 9 to 18 layers and nodes per layer.  The last emphasized the total number of layers, which we referred to as network depth. This was accomplished by keeping the number of nodes per layer to 3 and ranging the number of hidden layers from 30 to 110.

To explore how the controllers handled the fast and slow dynamics decoupling/coupling, we trained and tested each DQN and classic controller on systems modified to scale the coupling of the fast and slow dynamics.  For the cart-pole, this was done by seeking to bring the cart up to increasingly high velocities, the higher the objective velocity, the greater the velocity implied a higher separation of the fast and slow dynamics.  In this experimentation, we set the desired velocity for values ranging from 1 to 4 units per second. To get the graphs seen in Figures 1-3, several controllers were trained for each dynamic each using a different neural network architecture and size.

Figure 1. Breadth Emphasised DQN performance.  Left: Total error is shown against increased fast-slow dynamic decouplement.  Middle: Steady State Error Time.  Right: Steady State Error.

Figure 2. Square Emphasised DQN performance.  Left: Total error is shown against increased fast-slow dynamic decouplement.  Middle: Rise Time.  Right: Steady State Error

Figure 3. Depth Emphasised DQN performance.  Left: Total error is shown against increased fast-slow dynamic decouplement.  Middle: Rise Time.  Right: Steady State Error

During training, a substantial difference was noticed in training time for each of these neural network layouts.  This was exemplified in the graphs in Figure 3 where each graph plots run reward versus training iteration. The graph represent the reward growth for the breadth DQN, square DQN, and depth DQN, moving from left to right respectively.

Reward Growth for Breadth DQN

Reward Growth for Depth DQN

Reward Growth for Square DQN

Figure 3. Reward Growth for DQN of size 150 with desired slow dynamic at 2.  Left: Reward growth for breadth DQN.  Middle: Reward growth for square DQN.  Right: Reward growth for depth DQN

These simulations took anywhere from 45 minutes to several hours to run, depending on the layout of the neural network and how many iterations were needed.  The breadth emphasized DQN trained the fastest, each controller learning within an hour, while the depth emphasized DQN trained the slowest, each controller needing several hours to train.