# Optimize Markov Policy

We have a learning algorithm that constructs a Markov model *Q* that describes how the environment reacts on
the actions of the learning system. Now we need a complementary Markov process *P* that generates actions
that, under the assumtion that *Q* is correct, optimize long-term rewards for the combined process of
*Q* and *P*. A more formal and more detailed description of the problem follows.

The learning system and its environment alternately produce symbols on a bi-directional communication channel.
The learning system produces actions *y*_{t} ∈ *Y*. The environment produces
responses *x*_{t} ∈ *X*. For our purposes is is sufficient to take
*Y* = *X* = {0,1}. It would be quite straight forware to generalize to larger event sets.
We define the set of events *E* = *Y* × *X*. The states of the model that our
learning algorithm constructs are of the form
*y*_{t−k} *x*_{t−k} …
*y*_{t−1} *x*_{t−1} ∈
*E*^{k} = *S*.

Let *N* = |*S*| = |*E*|^{k}.

Enumerate *S* = { *s*_{1} … *s*_{N} }.

We define the set distributions on *S* as *D*(*S*) = { *x* ∈
**R**^{N} | ∀ *i*[*x*_{i} ≥ 0] ∧
∑ *x*_{i} = 1 }.

We also define for all *i*, 1 ≤ *i* ≤ *N* the unit distribution
*e*_{i} ∈ *D*(*S*)
such that (*e*_{i})_{i} = 1.

A model *Q* of the environment is defined by two matrices *A* and *B*. Both matrices
are square, have dimension *N*, have nonnegative elements, and have columns that sum to 1.

The *i*-th column of *A* or, equivalently, the product *A* *e*_{i}
yields a distribution on *S* that describes the probabilities of the new state of the environment after
taking action "0" in state *s*_{i}.

Likewise, the *i*-th column of *B* or, equivalently, the product
*B* *e*_{i} yields a distribution on *S* that describes the probabilities
of the new state of the environment after taking action "1" in state *s*_{i}.

A "policy" is a diagonal matrix *P* such that *P*_{i,i} ∈ [0,1]
for all *i*. The value *P*_{i,i} is the probability that
the learning system will choose action "0" in state *s*_{i} given policy *P*.

If *x*_{t} ∈ *D*(*S*) describes a probability distribution on the states
at time *t*, then we can find the probability distribution on the states at time *t*+1 by

*x*

_{t+1}=

*C*

*x*

_{t}

where

*C*=

*A*

*P*+

*B*(

*I*−

*P*) = (

*A*−

*B*)

*P*+

*B*

Now assume we have a vector *r* ∈ **R**^{N} that lists the gains resulting
from arriving at each state. The expected gain at time *t* equals the inner product

*x*

_{t}⋅

*r*

As far as I know, *x*_{t} will converge to an eigenvector of *C* that corresponds
with the eigenvalue 1. I assume that, for a sufficiently well-behaved *C*, there is exactly one such
a vector. As *A* and *B* are fixed and we want to vary *P*, we can treat this vector as an
*N*-dimensional function of *p*=(*P*_{1,1}…*P*_{N,N}),
say *m*(*p*).

For our purposes we need an algorithm that produces the *p* that maximizes
*m*(*p*) ⋅ *r*.

Note that *A*, *B*, and *P* are pretty sparse, but *N* can be very large (if we consider
a history of ten events, then *N* = 4^{10} > 10^{6}). Maybe it is possible to reduce the
cardinality of *S*, possibly at the expense of some of the simplicity of *A* and *B*.

The best solution I can think of with the help of *Numerical Recipes in C* is to bring the matrix *C*
in Hessenberg form and determine its eigenvectors with the *QR* algorithm. This can be used in Powell's
general multidimensional optimization algorithm. However, this is a very cumbersome and computationally intensive
procedure. I would like a more "algebraic" and less "numerical" method.

## How to determine the result vector

Of course we do not want to hard-code the values of *r*. That would totally contradict the spirit of
LEIA. I propose the following method.

We take for *k* the depth of the model of *observed* behavior.
Let *s*_{i} in *S*. We compute how often *s*_{i} would have
to occur to change the model of *expected* behavior, say *n*_{i} times, as well as
by how many bits it would change, say *c*_{i} bits. If the model of expected behavior
would decrease in length we take the absolute value of the change.
We set *r*_{i} = *c*_{i}/*n*_{i}.
The vector *r* is re-evaluated, each time the model of expected behaviour changes, or when the model has
not changed for some sensible number of events.

Visit the LEIA project page on GitHub for project details.