English summary of my article on learning theory

Here is a link to the original Dutch version.

Contents:

Introduction
Goal
Approaches
Machine Learning

Appendix

Turing machines

(1) Introduction

Learning is like flying. It is does not presuppose an intrinsic quality of the animal that can do it. It is rather a clever way to interact with its environment. Therefore it is possible to study what makes this interaction work. Then it might be possible to duplicate the conditions and construct a mechanism that achieves the same goal. In the case of flying, the goal is to go from one place to another fast, without touching the ground and unhindered by anything on the ground. In the case of learning, the goal would be more knowledge about the environment.

The description above is an example of a learning goal. Especially the phrase "it is possible to study what makes this interaction work". In the case of flying, we observe that all flying animals have wings, but not all wings have feathers (insects and bats are counterexamples). This kind of research results in descriptions of regularities. Then we can try to construct devices that fit the regularities that we found and test them. If they do not work, we use the differences to guide the search for relevant regularies that we might have missed.

I propose the following definition of learning:

Learning is the search for regularities in the interaction with the environment

We use some abstact terminology. With the subject we mean the person, animal or hypothetical device that learns. With the environment we mean the real or simulated environment that can be accessed by the subject. With the interface we mean the sensors and actors with which the subject interacts with the environment.

(2) Goal

The goal of this article is to show the usefulness of the defininition of learning that was given in the introduction. In particular the possiblities it offers to contruct a formal theory of learning. A formal theory would greatly benefit the scattered research, because regularities found in one field, e.g., trying to teach language to apes, could be applied in another field, e.g., the design of computer human interfaces.

(3) Approaches

We will approach the definition from various angles. First we will discuss the appropriateness of the given definition of learning in psychology. Then the merits of subjectivist probability theory over logic are discussed as a formal framework for the description of regularities. After that we will investigate the constraints on learning by the requirement that the process must be effective. And finally we will show that the concept of randomness allows us to unify both subjectivist probability theory and effectiveness. On top of that, randomness can be turned into a measure for the amount of knowledge that the subject has at any given moment. This measure is a relative one, because it depends on the subject, but not as arbitrary as to render it useless.

(3.1) Psychology

Learning is a form of behavior exhibited by humans and other animals and is therefore studied by psychologists. I propose to study learning from the perspective of the learning subject. This in contrast to the behavioristic approach that studies behavior from the perspective of the environment.

Let us assume that any form of interaction between the subject and its environment can be called behavior. Even a stone on a riverbed exhibits behavior. It exchanges heat with the surrounding water, it releases small quantities of silicates, and it slowly assumes a round shape. All this happens on the surface of the stone, which can be called its interface. Of course this kind of behavior is studied by physics, not psychology, but it is behavior nevertheless. There is a continuum of forms of behavior between the stone in the river and humans building a dam. The distinctions in this continuum are vague so we will not try to limit it up front. Within this very broad range of behavior we study only a small aspect of that behavior. We only look at the contribution of behavior to the knowledge of the subject about its environment.

Knowledge is one of the big words of philosophy. It is the subject matter of epistemology. From the introductory philosophy course that I followed I got the impression that there are three really important questions in philosophy: what (ontology), why (epistemology), and why not (ethics). Epistemology is a problematic notion because it is susceptible to the liars paradox, i.e., can you know that you cannot know something. Likewise a theory of learning could have impact on the study of learning theory. This is not necessarily a fatal flaw in the principles of the field. The calculus of natural numbers has the same problem. We have to realize, however, that any consistent theory will be either incomplete or useless because it presupposes supernatural abilities.

A definition of learning in psychological terms would be:

Learning is the adaptation of behavior on the basis of experience to fulfill internal needs

I think that internal needs can always be stated in terms of required knowledge. Therefore we will concentrate on knowledge as the primary goal of a learning subject. With that in mind and the reasonable assumption that the subject has to adapt its behavior every once in a while to find new knowledge, this definition and the definition in the introduction can be seen to be equivalent.

(3.2) Probability theory

With probability theory, especially subjectivistic probability theory, we can describe the amount of belief or credence that the learning subject assigns to events. Events are things, e.g., perceptions, stimuli, or patterns of interaction, that could happen on the interface. Mutually exclusive events could both be credible. The subject only has to commit to actions not to predicates about events.

This solves a few problems with the classic artificial intelligence approach that uses logic to describe reality:

Subjects can assign positive credence to mutually exclusive events. This enables "as if" reasoning.
Subjects never have to commit to a universal predicate, although they might act on it. This circumvents the paradox of induction described by Hume.
Subjects can decrease the credence of events, whereas adding the negation of a previously asserted formula to a logic system would be disastrous.
Subjects, including scientists or even science itself, can have different opinions about reality, but if they share an interface they can achieve consensus about interactions on that interface.

(3.3) Machine behavior

Birds fly without resorting to magic. A learning subject should show the same restraint. All deductions that a learning subject makes about regularities of its interactions with its environment should be effective. Effective behavior is formalized in the nineteenthirties by Alan Turing. The thesis that his formalization covers all effective behavior has become famous as the Church-Turing-Post thesis. This leads to a possible definition of the vague term regularity that we used earlier.

A regularity is a pattern of interactions that can be effectively described

An additional benefit of Turings formalism is that every phenomenon that is effective is the result of a finite program. The size of the shortest program that describes a phenomenon can be taken as a measure of the complexity of that phenomenon. Therefore sum of the lengths of the programs that describe the regularities in the 'mental model' of the learning subject is a measure of the complexity of that model. Under certain conditions this would be a good measure of the amount of knowledge in the subject.

(3.4) Randomness

How random is a description? A description is a finite sequence of symbols from a predetermined alphabet. The alphabet can be very small. Computers use a minimal set: zero and one. A description can describe many things. It depends on how you decode it. It could be a literal transcription of interactions that happened on the interface, or a program that can be executed by a particular type of machine. The description of the interactions on an interface are typically pretty regular, i.e., they are not very random. How do we measure this? Let us try to describe the description with another description. Let us take an arbitrary universal Turing machine as a decoding mechanism. One sure way to describe the original description is to write a progam that contains the description. This program is slightly longer than the description itself. If the desciption is regular it can be compressed so the shortest program that produces the given description on our universal Turing machine is shorter than the description itself. It can be proven that the probability that a Turing machine will produce our description when given random input is close to 2^-n, where n is the length of the shortest program for which the Turing machine wil produce the description.

For a learning algorithm we use a bit more complex form of modeling to encode the literal transcription of the interactions on the interface. The model will be a finite program for a universal machine that accepts one argument. The program gets a description of a finite sequence of interactions and has to produce a rational number that encodes the probability that the next symbol will be a '1' (we assume a binary alphabet). So for a particular description we can compute a probability for each symbol on the basis of the previous symbols: if the symbol is '1' we take the number computed by the model, if the number is '0' we take 1 minus that number. The probability p of the entire sequence is the product of these probabilities. Using standard coding techniques we can encode the description given the probabilities supplied by the model in approximately -log₂ p bits. This would be the length of the description given the model. The minimum description length principle (MDL) tells us that the best model for a description is the model that minimizes the sum of the length of the model and the length of the description given the model.

The amount of knowledge in a subject that uses MDL to choose a model is the length of the model

(4) Machine Learning

Based on the approaches discussed in the previous secion I will sketch the outline of a possible formal theory of learning behavior.

The subsection about randomness emphasized the modeling of descriptions of interactions. Only a very small part of humanity devotes itself to the task of describing interactions. And even that part publishes their descriptions. An important part of learning systems is the part that proposes, evaluates, and chooses actions. The hard part when evaluating the actions of a learning subject is to choose a proper criterion for evaluation. In current artificial intelligence research the criterion is often an external one, e.g., how often is the classification of a data item by an expert system correct. This presupposes an objective reality. Even worse: it presupposes that we already know the correct answer. The criterion of a human learning subject is often internal, e.g., am I confident that I know how this works and that I can get the desired results by choosing a line of action based on my current knowledge? Now we have a proper measure of the amount of knowledge in the 'mental model' of a learning subject we can define the value of the actions of a learning subject as follows:

The value of the actions of a learning subject is the contribution of those actions to the expected increase in the amount of knowledge

In this setup a learning system consists of two parts:

The inference part tries to minimize the model, constrained by the actual interactions.
The action part tries to maximize the model that the inference part tries to minimize by stimulating the environment to show its complexity.

(A) Turing machines

This appendix describes the standard concept of Turing machines as can be found elsewhere.

Visit the LEIA project page on GitHub for project details.