# Optimize Markov Policy

We have a learning algorithm that constructs a Markov model Q that describes how the environment reacts on the actions of the learning system. Now we need a complementary Markov process P that generates actions that, under the assumtion that Q is correct, optimize long-term rewards for the combined process of Q and P. A more formal and more detailed description of the problem follows.

The learning system and its environment alternately produce symbols on a bi-directional communication channel. The learning system produces actions ytY. The environment produces responses xtX. For our purposes is is sufficient to take Y = X = {0,1}. It would be quite straight forware to generalize to larger event sets. We define the set of events E = Y × X. The states of the model that our learning algorithm constructs are of the form ytk xtkyt1 xt1Ek = S.

Let N = |S| = |E|k.

Enumerate S = { s1sN }.

We define the set distributions on S as D(S) = { xRN | ∀ i[xi ≥ 0] ∧ ∑ xi = 1 }.

We also define for all i, 1 ≤ iN the unit distribution eiD(S) such that (ei)i = 1.

A model Q of the environment is defined by two matrices A and B. Both matrices are square, have dimension N, have nonnegative elements, and have columns that sum to 1.

The i-th column of A or, equivalently, the product A ei yields a distribution on S that describes the probabilities of the new state of the environment after taking action "0" in state si.

Likewise, the i-th column of B or, equivalently, the product B ei yields a distribution on S that describes the probabilities of the new state of the environment after taking action "1" in state si.

A "policy" is a diagonal matrix P such that Pi,i ∈ [0,1] for all i. The value Pi,i is the probability that the learning system will choose action "0" in state si given policy P.

If xtD(S) describes a probability distribution on the states at time t, then we can find the probability distribution on the states at time t+1 by

xt+1 = C xt

where

C = AP + B (IP) = (AB) P + B

Now assume we have a vector rRN that lists the gains resulting from arriving at each state. The expected gain at time t equals the inner product

xtr

As far as I know, xt will converge to an eigenvector of C that corresponds with the eigenvalue 1. I assume that, for a sufficiently well-behaved C, there is exactly one such a vector. As A and B are fixed and we want to vary P, we can treat this vector as an N-dimensional function of p=(P1,1PN,N), say m(p).

For our purposes we need an algorithm that produces the p that maximizes m(p) ⋅ r.

Note that A, B, and P are pretty sparse, but N can be very large (if we consider a history of ten events, then N = 410 > 106). Maybe it is possible to reduce the cardinality of S, possibly at the expense of some of the simplicity of A and B.

The best solution I can think of with the help of Numerical Recipes in C is to bring the matrix C in Hessenberg form and determine its eigenvectors with the QR algorithm. This can be used in Powell's general multidimensional optimization algorithm. However, this is a very cumbersome and computationally intensive procedure. I would like a more "algebraic" and less "numerical" method.

## How to determine the result vector

Of course we do not want to hard-code the values of r. That would totally contradict the spirit of . I propose the following method.

We take for k the depth of the model of observed behavior. Let si in S. We compute how often si would have to occur to change the model of expected behavior, say ni times, as well as by how many bits it would change, say ci bits. If the model of expected behavior would decrease in length we take the absolute value of the change. We set ri = ci/ni. The vector r is re-evaluated, each time the model of expected behaviour changes, or when the model has not changed for some sensible number of events. Visit the project page on GitHub for project details.