This work focused on examining when an experiential based controller would outperform a dynamics based controller.  This question was of particular interest to Dr. Silvia Zhang’s research group at Washington University as the information gained in this investigation could provide helpful results to optimize energy consumption in power applications.

There are several distinctive differences between dynamics and reinforcement learning controllers.  Classic dynamics driven control demands highly accurate knowledge of the system parameters, which are often difficult and/or costly to attain when peak performance is required.  Either, manufacturer specifications must be highly accurate or extensive system identification must take place before the controller can be designed. Further, classic controller techniques also require a precise model of the system, which, as systems become very large and increase in complexity, becomes more and more difficult to determine.  Further, as the system changes over time, complex techniques like adaptive control must be used to account for these changes. Experience based learning controllers, on the other hand, do not require any a priori knowledge of the system model or parameters. Further, it can be designed to account for changes in the system with minor adjustments.  The most prominent cost for RL controllers is the time and computational cost in training the controller and needed memory to determine the input for a various system state.

Dynamics driven control techniques for fast-slow system can take a variety of forms ranging from the application of simple linear PID controllers to more complex nonlinear control designs including sliding mode and differential mode controllers.  In this work, sought to implement dynamics based controllers for each of the systems to provide a control performance baseline.

For the cartpole system, we used full state feedback to design pole placement to keep the system’s pole vertically balanced and took advantages of the system’s dynamics to allow us to control the velocity of the system using a PID controller.  Figure 1 depicts the general schematic layout for full state feedback pole placement control design. In this diagram, K represents the control matrix setting the poles of the system to the desired optimal location. The A matrix indicates the current state of the system affects the next state of the system and the B matrix represents how the input affects each state.

Figure 1. Full state feedback pole placement control design schematic.  Individual gains are applied to the output of the system to shift the poles of the system to desired locations.

Ackermann’s formula can be used to determine the gains for the K matrix to achieve the desired pole placement.   This formula is where is the controllability matrix ,and where represents coefficients in the desired characteristic equation[10].

It is important to note that this kind of control design technique is meant for linear systems.  Since the cart-pole problem is a nonlinear system, Jacobian linearization techniques were used to approximate the system near its unstable equilibrium[9].   The general format for Jacobian linearization of nonlinear systems is presented below.

 

The dynamics required to stabilize the system’s pole can be decoupled from the position of the cart, and so a command signal could be used to control the position of the cart.  This was exploited to control the velocity of the cart by wrapping the feedback control system within a PID controller.

PID controllers have been one of the most prevalent control designs since their first development and analysis in the early 1900’s [2].  The general form of a parallel PID controller is seen in Figure 2 below.

Figure 2.  PID Controller Block Diagram [3].  Gain is applied to the error signal, integration of the error signal, and derivative of the error signal to create the input to the system.

 

As seen above, PID controllers operate by creating a control signal based on the proportion, integration, and derivative of the error signal.  The proportional component of the system typically creates the largest effect on the response of the system.  When the error is high the controller outputs a proportionally high control signal, and as the error decreases, so does the strength of the control signal.  However, using only a proportional signal, system internal friction and resistances make it impossible to ever reach zero error.  This is where the integration block becomes crucial as it allows the error in the system to accumulate and thus provide a strong enough control signal to eradicate this error.  Lastly, the derivative block of the controller slows down the response of the controller, thus making it less susceptible to artificial spikes in the error signal.

Combining these two techniques (pole placement and PID controllers) resulted in a fully functioning dynamics based control system which allowed us to control the desired velocity of the cart while keeping the pole vertically balanced. The schematic for this kind of approach is shown in Figure 3 below.

Figure 3. Dynamics control approach utilizing feedback control to stabilize the fast dynamic and PID controller to control the slow dynamic.

While these techniques worked for controlling the cartpole system in the fashion desired, these techniques fell short in controlling the acrobot system.  Since this system could operate in a large range of states, similar linearization techniques used for the cartpole would fall short in capturing the dynamics of the system.  And, since an adequate linearized model didn’t exist for this system pole stabilization techniques and PID controllers were deemed to be ineffective.  Instead, another approach had to be taken.

At the recommendation of Ph.d candidate Minh Vu, an optimal control approach was taken to solve this problem in a dynamics related fashion.  For these problems, the objective function took into account the distance of the current system state to the desired state while the constraints expressed the system dynamics.

With both of these dynamics based controllers planned as a benchmark for control performance, we moved on to explore the construction of the RL controller.  While there are a variety of reinforcement learning techniques that can be applied to control problems, two of the most popular fall under Q learning and policy gradient methods[4].  In either case, a control algorithm is trained by rewarding the system for achieving certain states.  Q learning in particular focuses on optimizing the rewards (or cost) of the current and expected future states for each action-state pair.  The current and expected future rewards are bundled into a single value called the a Q value.   The function used to provide this value is known as the Q function.  So, Q learning seeks to fill in the table shown in Figure 4 below using dynamic programming to cause convergence in the table entries.

Figure 4.  Q learning state action pair table initialized to all zeros.  During training, recursion is used to repeatedly update each value in the table with a value accounting for current and future rewards until convergence is achieved.

During learning, the table entries are recursively updated to record new Q values for each state action pair until the values begin to converge using the following formula[11]:

The update takes a weighted average of the previous Q value and the current reward values plus Q values for future states using 𝜶 to indicate relative importance of each term.  The discount factor, ɣ, indicates how much importance is applied to the future Q values.

As the state space and action space of a system grows, the Q table becomes exceedingly large and difficult to computationally handle and store in memory.  In this case, a neural network can be used to estimate the Q function: this technique is called deep Q-Learning, or DQN.  Figure 5 below depicts a neural network to approximate the Q table.

Figure 5. DQN: a neural network approximation of the Q-function.

One of the primary factors influencing a DQN’s performance is the reward function used to build the Q values and the neural network architecture.  The latter includes the number of layers, number of nodes per layer, total number of neurons, and node activation function.

In this work, deep Q learning was used to learn control algorithms for the cart-pole system using the state variables to reward desired system behavior.  After some test were ran to explore various reward functions, it was determined that a triangular reward function was an optimal strategy for providing reward for the slow dynamic of the system while a rectangular function was optimal for the fast dynamic of the system.  In the figure below, the x-axis represents a particular state variable, xd and θd represent the desired states of the variables, and the y-axis represents the numerical award given for x entering that state.

Figure 6. Reward functions for cartpole system.  (Left) Reward for slow dynamic.  (Right) Reward for fast dynamic

Similarly, a reward function had to be developed for the acrobot system as well.  Since the objective for this system was to invert the first link of the system, a reward function could be generated that examined how close to vertical the first link was.  Using cos(θ1), rewards could be provided based on how close to -1 the value approached.  The form of the reward function used is shown below in Figure 7.

Figure 7. Reward Function for Acrobot System.  When the first link is vertical, cos(θ1) = -1.  When the first link is hanging directly down, cos(θ1) = 1.

It was hypothesized by Dr. Silvia that there may be some relationship between the layout of the neural network and the ability of the DQN to handle both the fast and the slow dynamics.  Through a series of tests, we sought to determine if there was a relationship between the decoupling of the fast and slow dynamics of the system, the size and shape of the neural network, and its ability to successfully learning a control law that would outperform dynamics approaches.