Optimize Markov Policy

We have a learning algorithm that constructs a Markov model Q that describes how the environment reacts on the actions of the learning system. Now we need a complementary Markov process P that generates actions that, under the assumtion that Q is correct, optimize long-term rewards for the combined process of Q and P. A more formal and more detailed description of the problem follows.

The learning system and its environment alternately produce symbols on a bi-directional communication channel. The learning system produces actions y_t ∈ Y. The environment produces responses x_t ∈ X. For our purposes is is sufficient to take Y = X = {0,1}. It would be quite straight forware to generalize to larger event sets. We define the set of events E = Y × X. The states of the model that our learning algorithm constructs are of the form y_t−k x_t−k … y_t−1 x_t−1 ∈ E^k = S.

Let N = |S| = |E|^k.

Enumerate S = { s₁ … s_N }.

We define the set distributions on S as D(S) = { x ∈ R^N | ∀ i[x_i ≥ 0] ∧ ∑ x_i = 1 }.

We also define for all i, 1 ≤ i ≤ N the unit distribution e_i ∈ D(S) such that (e_i)_i = 1.

A model Q of the environment is defined by two matrices A and B. Both matrices are square, have dimension N, have nonnegative elements, and have columns that sum to 1.

The i-th column of A or, equivalently, the product A e_i yields a distribution on S that describes the probabilities of the new state of the environment after taking action "0" in state s_i.

Likewise, the i-th column of B or, equivalently, the product B e_i yields a distribution on S that describes the probabilities of the new state of the environment after taking action "1" in state s_i.

A "policy" is a diagonal matrix P such that P_i,i ∈ [0,1] for all i. The value P_i,i is the probability that the learning system will choose action "0" in state s_i given policy P.

If x_t ∈ D(S) describes a probability distribution on the states at time t, then we can find the probability distribution on the states at time t+1 by

_t+1

where

Now assume we have a vector r ∈ R^N that lists the gains resulting from arriving at each state. The expected gain at time t equals the inner product

As far as I know, x_t will converge to an eigenvector of C that corresponds with the eigenvalue 1. I assume that, for a sufficiently well-behaved C, there is exactly one such a vector. As A and B are fixed and we want to vary P, we can treat this vector as an N-dimensional function of p=(P_1,1…P_N,N), say m(p).

For our purposes we need an algorithm that produces the p that maximizes m(p) ⋅ r.

Note that A, B, and P are pretty sparse, but N can be very large (if we consider a history of ten events, then N = 4¹⁰ > 10⁶). Maybe it is possible to reduce the cardinality of S, possibly at the expense of some of the simplicity of A and B.

The best solution I can think of with the help of Numerical Recipes in C is to bring the matrix C in Hessenberg form and determine its eigenvectors with the QR algorithm. This can be used in Powell's general multidimensional optimization algorithm. However, this is a very cumbersome and computationally intensive procedure. I would like a more "algebraic" and less "numerical" method.

How to determine the result vector

Of course we do not want to hard-code the values of r. That would totally contradict the spirit of LEIA. I propose the following method.

We take for k the depth of the model of observed behavior. Let s_i in S. We compute how often s_i would have to occur to change the model of expected behavior, say n_i times, as well as by how many bits it would change, say c_i bits. If the model of expected behavior would decrease in length we take the absolute value of the change. We set r_i = c_i/n_i. The vector r is re-evaluated, each time the model of expected behaviour changes, or when the model has not changed for some sensible number of events.

Visit the LEIA project page on GitHub for project details.