AIT Lecture Notes


Introduction

Algorithmic Information Theory (AIT) merges information theory with computability theory. It defines complexity $K(x)$ as the length of the shortest program capable of producing an object $x$ using a universal prefix Turing machine $U$:

$$ K(x) = \min \{|p| : U(p) = x\} $$

Seeking an answer to the problem of induction, Solomonoff formalized Occam's Razor via the Universal Distribution $M(x)$. In this context, $U$ is understood as a universal monotone Turing machine, which can produce an infinite output stream starting with $x$:

$$ M(x) = \sum_{p : U(p) = x *} 2^{-|p|} $$

He proved that compression leads to optimal prediction. This major result, known as the Solomonoff Completeness Theorem, ensures that the cumulative prediction error is bounded:

$$ \sum_{t=1}^{\infty} \sum_{x_{\lt t} \in \mathcal{B}^{t-1}} \mu(x_{\lt t}) \left( M(0|x_{\lt t}) - \mu(0|x_{\lt t}) \right)^2 < \frac{1}{2} \ln 2 \cdot K(\mu) < \infty $$

Marcus Hutter extended this framework to decision-making with the AIXI agent. He provided the first mathematical definition of a universally intelligent agent:

$$ a_k := \arg\max_{a_k} \sum_{o_k r_k} \dots \max_{a_m} \sum_{o_m r_m} [r_k + \dots + r_m] \sum_{q : U(q, a_{1:m}) = o_{1:m} r_{1:m}} 2^{-|q|} $$

These lecture notes trace my learning journey. I have therefore reformulated many of the proofs and theorems I encountered. Where certain definitions were missing, I had to create or reformulate them to ensure the coherence of the whole. Although the AIXI agent is not explicitly covered in these lecture notes, the themes explored here lay the necessary conceptual groundwork for its study. Hutter's recent book (chapters 6, 7, and 8) offers an excellent introduction.


Lecture Notes

Chapter 0: Prerequisites & Information Theory

Chapter 1: Algorithmic "Turing" Computability

Chapter 2: Algorithmic "Kolmogorov" Complexity

Chapter 3: Algorithmic "Solomonoff" Probability

Chapter 4: Algorithmic "Martin-Löf" Randomness

Chapter 5: Algorithmic "Hutter" Intelligence


Context

Before diving into the subject, I want to clarify the context of this work. These lecture notes do not claim to be an academic reference but reflect above all my learning journey. As this is the path of someone discovering and appropriating the field, it is very likely that errors or clumsiness have escaped me. The goal is simply to share my process over the months, hoping it might help others.

Since AI is becoming such a central subject, I wanted to understand its underlying principles. As a novice, I first studied the classical foundation of statistical learning (PAC learnability, VC dimension, etc.), but this approach seemed incomplete to answer certain fundamental questions about the nature of intelligence.

It was particularly while studying the proof of the No Free Lunch (NFL) theorem, which assumes a uniform distribution over the set of all possible problems, that I had a question. Sometimes, the NFL is presented as a theorem that sets the limits of what can be learned. I find it should rather be considered as demonstrating the necessity of choosing one's priors carefully. Indeed, priors like the uniform distribution lead to results inconsistent with our daily experience. For example, no one assigns equal probability to gravity continuing to follow Newton's laws and to it reversing abruptly in the next second. Yet, this is the hypothesis made in this theorem. Googling my doubts, I stumbled upon course materials by Marcus Hutter that pointed out exactly this problem, calling it the 'myth of the uniform distribution'.

Digging deeper, I came across an interview between Marcus Hutter and Lex Fridman that provided the missing pieces. Hutter explains how Solomonoff's universal distribution formalizes Occam's Razor, making it the most rational prior. I also found Jürgen Schmidhuber’s interview with Lex Fridman very interesting, as he perfectly explains how compression can even account for things like beauty, science, or creativity. These exchanges bridged the gap between compression and intelligence, pushing me to study AIT to better grasp fundamental questions like "What is intelligence?" or "What does it really mean to understand?". The reference guide proposed on Hutter's site as well as his paper “Algorithmic Information Theory: a brief non-technical guide to the field[9] then allowed me to gather valuable sources.

Initially, I only intended to clean up a few definitions so as not to forget them. But to try to grasp the logic of certain results, I needed to rewrite them in my own way, little by little, until it became this version. This material is, however, not exhaustive and remains subject to further improvement.


Sources

Chapter 0 (Prerequisites):
Synthesis of [1, 2, 8, 13]. The Information Theory part relies on [5] and [8].

Chapter 1 (Computability):
These are the classic definitions and proofs of computability. I chose to detail the definitions of prefix encodings at greater length, as I personally found the more condensed presentation in [1] difficult to follow at my own pace. I may have sometimes wanted to formalize everything too strictly: thus, many demonstrations concern trivial results, but I insisted on demonstrating them entirely for the sake of rigor.

I introduced the definition of a "compact enumeration" to resolve an ambiguity: one might think that an enumeration is a surjection, whereas it should be understood as a simulation capability. I am not sure in hindsight that this distinction was a good idea.

Regarding Monotone Turing Machines, several definitions exist. I chose the one from [1] and [8] because it suited Chapter 3 best. I did not modify this point since I also posited my own definition, which is a formal version of that in [1] and [8]. Not having found a demonstration I could follow in the consulted literature regarding the existence of a universal monotone Turing machine, I attempted to write one myself (Section 1.9.4), simply to convince myself the result was true. I reread the proof several times, but it is very likely that errors or clumsiness have escaped me.

Chapter 2 (Kolmogorov):
The definitions of classic and prefix Kolmogorov complexity are the standard definitions and rely on the Universal Turing Machines defined in Chapter 1. The results for plain complexity come mostly from [1] and [6].

For non-sub-additivity (concatenation), I took up and formalized the arguments from a video by Jan Reimann [15].

For the rest of the chapter, I insisted on not admitting any results. The Kraft-Chaitin theorem comes from three sources: [4], [6] (Shen), and [15]. The proof of Levin's coding theorem as well as that of the symmetry of information come from the IIT Kanpur course [13] and Downey & Hirschfeldt [3]. I adapted these demonstrations to the formalism I laid out in Chapter 1, which allowed me to cover the entire chapter without admitting any results.

Chapter 3 (Solomonoff Probability):
I based almost this entire chapter on the structure of paper [9], adapting the proofs to the formalism I used in Chapter 1. In section 1.1.2, I tried to formalize the algorithm proposed by J. Tyszkiewicz in [1], but I’m worried I just made the whole thing more complicated than it needs to be.

Based on my understanding, there are different ways to approach the definition of monotone machines (and the notations that go with them):

I chose to follow the definition from [9] and formalize it. In hindsight, I think I probably should have used the construction from [8], but that would have required proving that a universal Turing machine exists for such a specific definition. I know my current choice might not be the best for defining Monotone Kolmogorov Complexity ($Km$), but it seemed sufficient for Solomonoff Probability, which is the focus here. I also included some theorems and proofs on Bayesian Mixtures. I don't actually use them in these notes, but some of these results might be interesting for those intending to study [8].

Technical Remark on Theorem 4.5.2:
Li & Vitányi's Theorem 4.5.2 (p. 306) concludes that the construction of the sets $\Phi_k$ is straightforward. Although the authors qualify this step as obvious, I found it quite subtle to formalize. While attempting to formalize this step, I introduced a specific type of function in Section 1.10.5, though I could not rigorously prove that this construction can approximate any lower semi-computable semi-measure. It is likely that my approach is unnecessarily complex and that a more elegant path has escaped me. I would be very grateful if any reader could point me toward a reference or an argument that clarifies the simpler construction I might have overlooked.

Chapter 4 (Randomness):
The entire structure is inspired by Jan Reimann's course on Scholarpedia [15]. Section 4.1 is a formal reformulation of the results already present in Shen's paper [6]. Similarly, the characterization by $K$ (Schnorr's Theorem showing the equivalence between incompressibility and Martin-Löf randomness) comes entirely from [6], whose arguments I reformulated and formalized.

The characterization by $C$ comes from two sources, notably the paper A simple proof of Miller–Yu theorem [19]. However, as this paper is addressed to experts, some passages are condensed, and I struggled to understand it entirely. I therefore contacted Wolfgang Merkle, who provided me with another proof. I thank him for helping me.

I covered the section on Martingales (characterization by unpredictability) with Downey & Hirschfeldt [3]. The proofs regarding Chaitin's Omega constant come from [1].

Bibliography


Further Material