Project definition

The project

The goal of the project is to design and implement a computer program that autonomously learns expectations in the form of a probabilistic model of its environment. The program has to learn autonomously in the sense that it does not depend on task related evaluations of its performance. The program has to show 'curiosity' rather than 'obedience'. Learning has to be reasonably efficient in the sense that the program does not get too much slower with time. An early prototype shows that such a program is possible. is focused on learning itself rather than the goal of the learning system. The danger of investigating, for example, the effect of medication on the health of patients is that we learn about the domain. If we preprogram that knowledge in a system, that system doesn't learn at all. Such a system makes more or less accurate predictions based on a priori knowledge. The real challenge in is to understand fundamental aspects of learning, like uncertainty, freedom of action, co-operation, etc.

Autonomous learning

The central hypothesis of the project is that it is possible for a computer program to learn from its environment without the requirement that the environment guides the learning process. From now on we will use the term learning system instead of learning computer program when other (natural) learning systems are implied too. We want to study possibilities for strategies that enable a learning system to learn in an arbitrary environment in the sense that it uses actions to find and verify regularities in the environment. These regularities can be used to update a probabilistic model of the environment that improves over time in a statistically quantifiable way. In short: we want a learning system that shows exploratory behavior rather than a learning system that relies on teaching behavior from its environment and preprogrammed a priori knowledge. We aim for a system that learns to form reasonable expectations of an arbitrary environment, rather than a system that approximates the true process behind a severely limited environment. It is widely believed that "A computer produces nothing more than you put into it." A program that interacts with its environment guided by nothing more than curiosity will prove that statement wrong.

Interactive learning

The first basic assumption for the project is that autonomous learning systems must be interactive. Theories like statistical inference and off-line learning are restricted to learning systems that can only predict. Those learning systems cannot influence their environment in any way. Of course there are situations where predictions are sufficient, but there are indications that interaction can improve the learning speed of a learning system dramatically. intends to find out if, and how, this could work.

Model growth

The hypothesis that autonomous learning is possible grew from the combination of two existing ideas: the minimum description length principle (MDL) and interactive learning.

The minimum description length principle.

A second basic assumption for the project is that knowledge about the environment can be measured independent of that environment. The rationale for that assumption is hidden in the theory behind the minimum description length principle (MDL).

MDL is a statistical inference technique that selects a probabilistic model for a data set. The data set might have been generated by a process that is more complex than can be described with the set of models that we use. Suppose we want to make a probabilistic model of natural language. We could use probabilistic models that predict the next word in a text, based on the previous word in the text only. We know that we use more context when producing prose. Nevertheless MDL can guide a learning algorithm to choose an optimal model (in a certain sense) from the available models.

The MDL principle is based on the fact that a statistical model can be used as a data compressor. The better a model fits the data, the more efficiently that model can be used to compress that data. MDL works by the comparison of two quantities: the complexity of the candidate model and the efficiency of that model for compressing the data set. In order to express these two quantities in such a way that they can be compared, the data set is re-encoded in a two part code. The first part of the code encodes the candidate model. The second part of the code encodes the data, using the candidate model as a compression scheme. An MDL model for the data is defined as a model that achieves the smallest total code length possible for any model in our class of probabilistic models.

Using the minimum description length principle to measure knowledge.

The code length of the MDL model of a data set is a good measure for the amount of knowledge that the learning system has obtained from the data set. Particularly noteworthy is the fact that this measure is not defined as the proximity to the 'true' process that determines the behavior of environment. That is just as well, because the process that determines the behavior of the environment may be unknown or uncomputable or just too complex for the type of models that we use. Even better: a learning system can compute the amount of knowledge in its own model. MDL defines a measure of statistical knowledge that can be used by the learning system itself.

Using the minimum description length principle interactively.

The third assumption for the project is that autonomous learning systems should optimize model growth. This assumption follows from the first two. In an interactive setting, the data set would describe the interaction between the learning system and its environment up to a certain time. The resulting model can be used together with a possible strategy to predict a possible future for the interaction. These predictions can in turn be used to select the 'best' strategy.

In itself this technique can be applied to every learning task that is studied in the area of Computational Learning Theory. (There is a theorem that states that every model class can be transformed in a class of probabilistic models and that MDL models perform well under a large variety of loss/credit functions.) In interactive learning, however, the measure of knowledge that is implied by MDL turns out to be very useful. The rate of growth of the complexity of the model can be used by an interactive learning system as the criterion to optimize when choosing actions. Of course the environment has a large part in determining the interaction, but the criterion to optimize is independent of the environment. In that sense a learning system that optimizes model growth can be truly autonomous.

Computational complexity

The final basic assumption for the project is that a computer program that implements the theory should be practical. This implies some severe limitations on our expectations of 'optimal' statistical models and 'optimal' strategies for choosing actions. Practicality of algorithms is enforced by placing limits on two kinds of complexity associated with computations. Time complexity measures the number of atomic steps that must be performed to complete the algorithm. Space complexity measures the amount of information that must be kept around during the calculations. Sometimes there is a trade-off between these forms of computational complexity: the algorithm can run faster when it keeps more data around or it can be space-economic but slow.

The largest class of probabilistic models that we could theoretically use is the class of semi-computable semi-measures. The computation of members of this class can be very expensive in both space and time. However, the most important reason why we cannot use this class of models, is that we cannot use MDL on this class. This follows from the fact that the length of the two-part code is not a computable function of the code for the model and the description of the data seen so far. Even for more structured model classes, like the class of all probabilistic context-free grammars, the time required to compute the length of the two-part code is still exponential in the length of the data. Therefore we want to search for efficient algorithms to compute the description length of two-part codes for model classes that are as rich as possible.

Once we have an MDL model (or a close approximation) we need to find a strategy or policy to choose actions that will lead as fast as possible to the largest possible model growth. We expect that techniques from Operations Research will enable us to find candidate policies efficiently.

Visit the project page on GitHub for project details.