8 Drift as an Average: A Closer Look on the Conditioning of Drift

Consider a deterministic process which starts at $100$ and goes down by $1$ in each iteration; we trivially see that the expected time until the process reaches $0$ is $100$. If the process goes down by $1$ or $1/2$, then a worst-case view would state an upper bound of $200$ until the process reaches $0$.
Drift theory allows for an average case view. In fact, the main strength of drift theory is that even the possibility of going away from the target is incorporated, as long as in expectation we have a bias towards the target. For example, if a process goes down by $10$ with probability $11/20$ and up by $10$ with probability $9/20$, then the progress is in expectation also $1$, but not in the worst case. The additive drift theorem tells us that, also in this case, we expect to arrive at $0$ after $100$ steps. Thus, instead of a worst case bound, averaging different outcomes leads to a useful bound.
In this section we investigate three questions.

How do drift theorems allow to exploit that drift is an average over a range of possibilities?
Why do drift theorems in the literature condition on various different things?
How do we account for insufficient drift after reaching the target?

8.1 Drift as an Average

In Theorem 2.1 [Additive Drift, Upper Bound] we have seen the standard (additive) drift condition to be

(D): there is a $\delta > 0$ such that, for all $t \lt T$, it holds that $\Ew{X_t - X_{t+1} \mid X_0,…,X_t} \geq \delta$.

In this section we want to take a closer look at the conditioning on $X_0,…,X_t$. First we show that not conditioning on anything leads to counterexamples.

Example 8.1 [Global Averaging]. Suppose we first flip a coin in secret. Then, if the coin shows tails, for all $i \in \natnum$ we let $X_i = 1$. If, on the other hand, the coin shows heads, we let $X_0 = 1$ and, for all $i \in \natnum$, we draw independently uniformly at random bits $B_i \in \{0,1\}$ and set $X_{i+1} = X_i - B_i$. For any $t$ with $t \lt T$ we now have $\Ew{X_t - X_{t+1}} \geq 1/4$ (we make a progress of exactly $1$ if the coin shows tails and the bit is $1$, and otherwise of $0$). Thus, the conclusion of the additive drift theorem states an upper bound of $4$ on the expected time to hit $0$, while, in fact, $\Ew{T}$ is infinite.

The previous example shows that the conclusion of the additive drift theorem can be false while the drift condition holds, averaged over all possible situations. What we can do is average over all possible situations with the same history of $X_0, \ldots, X_t$, as stated by the additive drift theorem. To illustrate this, we have the following example.

Example 8.2 [Local Averaging]. Let us play a game where your goal is to draw a total number of $10$ red balls. In each iteration I randomly fill in secret a bag of balls of different colors, and draw a ball uniformly at random from that bag. Suppose I either fill the bag with $10$ balls, $9$ of which are blue and one is red, or with $100$ balls, where $99$ are blue and one is red. Suppose I choose either situation with equal probability of $1/2$. Then, on average, in any iteration your probability to pick a red ball is $11/200$. Thus, you arrive at a value of $10$ drawn red balls after an expected number of $2000/11$ iterations.

Note that in this example there seem to be two different possible situations in each iteration with different drift. One way to address this is to bound drift by the smaller of the two drift values; but the drift theorem allows for averaging the drift of the different situations, since we are given the probabilities of the two values.
We cannot average over global decisions that we can learn about from the history, see Example 8.1 [Global Averaging]; this depends on our choice of what is the history in this context. If we cannot learn about the global decisions from the history, we can use the principle of deferred decision to model the random decision as a decision in that given iteration, as illustrated by the following example.

Example 8.3 [Deferred Decisions]. Let us again play a game where your goal is to draw a total number of $10$ red balls. This time, before the game starts, I fill an infinite sequence of bags, making for each the exact same decision as given in Example 8.2 [Local Averaging]. In each iteration you get the next bag from this sequence. Since the outcome of one bag is independent of other bags, we cannot learn anything about future bags from the history. Thus, using the principle of deferred decision, we compute as if the bag was only packed in the current iteration, after all previous (random) decisions have been made. Thus, the analysis proceeds exactly as in Example 8.2 [Local Averaging] and you arrive at a value of $10$ drawn red balls after an expected number of $2000/11$ iterations.

8.2 The Conditioning of Drift

Let us take a closer look at the drift condition. For this section, we define four variants with different conditioning of the drift as follows.

Definition 8.4 [Variants on the Conditioning of Drift]. Let $(X_t)_{t \in \natnum}$ be an integrable random process over $\realnum$ and let $(F_t)_{t \in \natnum}$ be a filtration such that $(X_t)_{t \in \natnum}$ is adapted to $(F_t)_{t \in \natnum}$, let $f$ be a measurable function and $t \in \natnum$.

(D-filtration): $\Ew{X_t - X_{t+1} \mid F_t} \geq f(X_t)$ with probability $1$.
(D-history): $\Ew{X_t - X_{t+1} \mid X_0,…, X_t} \geq f(X_t)$ with probability $1$.
(D-events): For all $s_0,…, s_t$ with $\Pr{X_0=s_0, …, X_t = s_t} \gt 0$,
$\Ew{X_t - X_{t+1} \mid X_0=s_0,…, X_t=s_t} \geq f(s_t)$.
(D-Markov): For all $s_t$ with $\Pr{X_t = s_t} \gt 0$, $\Ew{X_t - X_{t+1} \mid X_t = s_t} \geq f(s_t)$.

Some researchers prefer to state all drift theorems in terms of filtrations, see, for example, [Wit23How majority-vote crossover and estimation-of-distribution algorithms cope with fitness valleys. Carsten Witt. Theor. Comput. Sci. 2023.]; some prefer conditioning on the history [KK19First-hitting times under drift. Timo Kötzing, Martin S. Krejca. Theor. Comput. Sci. 2019.]. In Section 2.1.2 of [Len20Drift Analysis. Johannes Lengler. In: Theory of Evolutionary Computation 2020.], Lengler discusses the differences between (D-filtration) and (D-Markov), phrasing drift theorems in terms of (D-Markov). Many applications of drift theorems involve states of algorithms which typically behave as Markov chains. This ubiquity of Markov chains sometimes leads to drift theorems being stated for Markov chains only. However, the states of algorithms need to be mapped to real numbers in order to apply drift theorems (see Section 3 [The Art of Potential Functions] for a discussion on potential functions). If this mapping is not 1-to-1, then, in general, the resulting mapped process is not Markovian any more, so drift theorems applicable for Markov chains on $\realnum$ are no longer applicable. As we will see, (D-history) is a necessary condition for (D-Markov) (see Theorem 8.6 [Conditioning on Filtration vs. History vs. Events]), which is why in this work all drift theorems are stated analogously to (D-history).

In the following we want to discuss how the four given conditions differ and in what sense they are equivalent. First we recall that conditioning on the history $X_0, …, X_t$ is defined as conditioning on the $\sigma$-algebra $\sigma(X_0,…, X_t)$, leading to the canonical filtration. In this sense, the condition (D-history) implies that there is a filtration such that (D-filtration) holds.

Proposition 8.5 [Canonical Filtration as Filtration]. Let $(X_t)_{t \in \natnum}$ be an integrable random process $\realnum$, $f$ a measurable function and $t \in \natnum$. Suppose (D-history). Then there is a filtration $(F_t)_{t \in \natnum}$ such that $(X_t)_{t \in \natnum}$ is adapted to $(F_t)_{t \in \natnum}$ and such that (D-filtration) holds.

Proof. Using $F_t = \sigma(X_0,…,X_t)$, (D-history) and (D-filtration) are identical.

The question now arises whether anything can be gained from using other filtrations than the canonical filtration. Sometimes it can be easier to assume a different filtration, which gives more information for the analysis to work with; more outcomes of random variables can be fixed, allowing the analysis to proceed with these concrete outcomes (see, for example, the proof of Theorem 4.1 [Unbiased Random Walk on the Line]). But how should the drift theorem be stated? Using the following theorem, we see that if the drift theorem is only stated conditional on the history, any other filtration (where the process is adapted to) can also be used, since (D-filtration) implies (D-history).

Theorem 8.6 [Conditioning on Filtration vs. History vs. Events]. Let $(X_t)_{t \in \natnum}$ be an integrable random processover $\realnum$ and $f$ a measurable function. Suppose further $(F_t)_{t \in \natnum}$ is a filtration such that $(X_t)_{t \in \natnum}$ is adapted to $(F_t)_{t \in \natnum}$. We then have, for all $t \in \natnum$, the following implications:
(D-filtration) $\Rightarrow$ (D-history) $\Rightarrow$ (D-events) $\Rightarrow$ (D-Markov)
where the last implication holds for discrete $(X_t)_{t \in \natnum}$.

Proof. Suppose first, for “(D-filtration) $\Rightarrow$ (D-history)”, $\Ew{X_{t+1} \mid F_t} \geq f(X_t)$ with probability $1$. Since $(X_t)_{t \in \natnum}$ is adapted to $(F_t)_{t \in \natnum}$, we have that, for all $t$, $$ \sigma(X_0,…, X_t) \subseteq F_t. $$ Using Lemma 9.15 [Tower Property for Sub-$\sigma$-Algebra] in the second equality, we have, with probability $1$, \begin{align*} \Ew{X_t - X_{t+1} \mid X_0,…, X_t} & = \Ew{X_t - X_{t+1} \mid \sigma(X_0,…, X_t)}\\ & = \Ew{\Ew{X_t - X_{t+1} \mid F_t} \mid \sigma(X_0,…, X_t)}\\ & \geq \Ew{f(X_t) \mid \sigma(X_0,…, X_t)}\\ & = f(X_t). \end{align*} Suppose now, for “(D-history) $\Rightarrow$ (D-events)”, $\Ew{X_t - X_{t+1} \mid X_0,…, X_t} \geq f(X_t)$ with probability $1$. Let $s_0,…, s_t$ and let $A$ be the event such that $X_0=s_0, …, X_t = s_t$. Suppose $\Pr{A} \gt 0$. Since $A \in \sigma(X_0,…, X_t)$, we get (using Lemma 9.13 [Conditional and Indicator] in the first step and Definition 9.14 [Filtration Conditional] in the second), \begin{align*} \Ew{X_t - X_{t+1} \mid A}\Pr{A} &= \Ew{(X_t-X_{t+1}) \ind{A}}\\ &= \Ew{\Ew{X_t-X_{t+1} \mid X_0,…, X_t} \ind{A}}\\ &\geq \Ew{f(X_t)\ind{A}}\\ &= f(s_t)\Pr{A}. \end{align*} Regarding (D-events) $\Rightarrow$ (D-Markov), we note that, for discrete $(X_t)_{t \in \natnum}$, and any $s_t$ with $\Pr{X_t = s_t} \gt 0$, we can find $s_0,…,s_{t-1}$ such that $\Pr{X_0=s_0, …, X_t = s_t} \gt 0$. For such $s_0,…,s_t$ we then have $$\Ew{X_t - X_{t+1} \mid X_t=s_t} = \Ew{X_t - X_{t+1} \mid X_0=s_0,…, X_t=s_t}.$$ This gives the desired implication.

Note that (D-events) and (D-Markov) implicitly consider the process to be discrete. As we see in the example given next, a drift theorem based on (D-events) without the the requirement of a discrete process would in this generality be wrong.

Example 8.7 [Conditioning on Events of Continuous Processes]. Let $X_0$ be a uniformly real random number from $[1,2]$ and let, for all $t \in \natnum$, $X_{t+1} = X_t$. Then we have

(D-events): for all $s_0,…, s_t$ with $\Pr{X_0=s_0, …, X_t = s_t} \gt 0$, $\Ew{X_t - X_{t+1} \mid X_0=s_0,…, X_t=s_t} \geq 1$.

This follows since, for all $t \in \natnum$ and all $s_0,…, s_t$, we have $\Pr{X_0=s_0, …, X_t = s_t} = 0$ (we have a truly continuous random variable). Thus, (D-events) is vacuously true. Furthermore, for all $t \in \natnum$, $X_t \geq 1$, so there is no $t$ such that $X_t \leq 0$.

The example shows that the problem arises when considering continuous random variables. The next proposition shows that, for discrete processes, (D-events) implies (D-history), making these two conditions equivalent. The proof is due to Marcus Pappik (private communication).

Proposition 8.8 [Equivalence of Conditionals for Discrete Search Spaces]. Let $(X_t)_{t \in \natnum}$ be an integrable and discrete random process over $\realnum$, $f$ a measurable function and $t \in \natnum$. Then (D-events) implies (D-history).

Proof. Fix $t \geq 0$, and let $R := \mathrm{range}(X_0) \times … \times \mathrm{range}(X_t)$. For a tuple $s = (s_i)_{0 \le i \le t} \in R$, let $A(s) := \{\forall 0 \leq i \leq t: X_i = s_i\}$. Let $S := \set{s \in R}{\Pr{A(s)} \gt 0}$. For all $s = (s_i)_{0 \le i \le t} \in S$ and $\omega \in A(s)$, Lemma 9.17 and (D-event) yield $$ \Ew{X_t - X_{t+1} \mid X_0, …, X_t}(\omega) = \Ew{X_t - X_{t+1} \mid A(s)} \geq f(s_t) = f(X_t(\omega)). $$ Since further every $X_i$ has countable range, it holds in particular that $S$ and $R$ are countable. Hence, we have that $\bigcup_{s \in S} A(s)$ is measurable and $$ \Pr{\bigcup_{s \in S} A(s)} \geq 1 - \sum_{s \in R \setminus S} \Pr{A(s)} = 1 . $$ Therefore, we have that $\Ew{X_t - X_{t+1} \mid X_0, …, X_t} \geq f(X_t)$ holds with probability $1$.

Just as (D-events) implicitly assumes a discrete space, (D-Markov) assumes the process to be Markovian. The following theorem shows that, for discrete Markov chains, (D-Markov) implies (D-events), making also these two conditions equivalent in this case.

Proposition 8.9 [Equivalence for Markov Chains]. Let $(X_t)_{t \in \natnum}$ be an integrable and discrete Markov chain over $\realnum$, $f$ a measurable function and $t \in \natnum$. Then (D-Markov) implies (D-events).

Proof. This follows directly from the Markov property that, for all $s_0,…,s_t$ with $\Pr{X_0=s_0,…, X_t=s_t} \gt 0$, $\Ew{X_{t+1} \mid X_t=s_t} = \Ew{X_{t+1} \mid X_0=s_0,…, X_t=s_t}$.

We have seen that for discrete spaces and for Markov chains, one can give specialized formulations of drift theorems. However, drift theorems conditioning on the history are strictly more general. Furthermore, conditioning on the history is easy to state and understand for users of the theorem.
Conditioning on a filtration results in drift theorems equally general as those conditioning on the history, since the history is one possible filtration, and in fact the least restrictive.

8.3 Reaching the Target

Let us consider again the drift condition

(D): there is a $\delta > 0$ such that, for all $t \lt T$, it holds that $\Ew{X_t - X_{t+1} \mid X_0,…,X_t} \geq \delta$.

We want to take a closer look at the requirement “for all $t \lt T$”. Since $T$ is a random variable, this does not properly define a range for $t$. This is desirable to allow for processes which naturally do not go down anymore after reaching the target. One clean way to write it would be

(D): there is a $\delta > 0$ such that, for all $t \in \natnum$, $\Ew{X_t - X_{t+1} \mid X_0,…,X_t} \ind{t \lt T} \geq \delta \ind{t \lt T}$.

This notation resorts to indicator random variables and, while quantifying $t$ over all of $\natnum$, effectively requires the inequality to hold only in case of $t \lt T$. This inspires the following convention.

Convention 8.10 [Drift While not at Target]. We state inequalities that only need to hold for points in time when a random process did not reach its target yet. Formally, let $T$ be a random variable over $\natnum \cup \{\infty\}$, let $X$ and $Y$ be random variables over $\realnum$. Further, let $\sim$ denote a relation symbol, such as $=$, $\leq$, or $\geq$. We define the phrase “for all $t \lt T$, it holds that $X \sim Y$” to be equivalent to “for all $t \in \natnum$, it holds that $X \cdot \ind{t \lt T} \sim Y \cdot \ind{t \lt T}$”.

Note that, alternatively, we can condition on $t \leq T$, the way chosen in Theorem 5.1 [Additive Drift, Upper Bound, Time Condition] and Theorem 5.2 [Additive Drift, Lower Bound, Time Condition]. This makes the dependence on $t \lt T$ very explicit (this was the reason for stating it in this way in the two named theorems). However, now one has to quantify over “all $t \in \natnum$ such that $\Pr{t \lt T}\gt 0$”, which is again somewhat cumbersome.

This work is licensed as Creative Commons Attribution - ShareAlike 4.0 International.
All mathematical statement in this script are freely usable, of course. As a collection and with comments they are licensed as stated above.