3 The Art of Potential Functions

Drift theorems can be applied to random processes on $\realnum$. For the analysis of randomized algorithms, this typically means that one has to map the state of the algorithm to a real number, so that the resulting process will be a process on $\realnum$. Such a mapping is called potential function and we already saw multiple in Section 2 [A Gentle Introduction to Classic Drift Theorems]. Especially in the proof of Theorem 2.10 [Winning Streaks] we saw that sometimes unintuitive potential functions can lead to very strong results. In fact, one could say that the art of applying drift theorems is in choosing the right potential function. Later in this work, for example in the proof of Theorem 4.1 [Unbiased Random Walk on the Line], we see yet more intricate potential functions. In this section, we want to discuss a few cardinal examples.

3.1 A Simple Heuristic for Choosing Potential Functions

There is a very important rule of thumb to designing potential functions: better search points should have better potential. The following example showcases this.

Example 3.1 [Better Search Points with Better Potential]. Let $(X_t)_{t \in \natnum}$ be a Markov chain with $X_0 = 2$ and, for all $t$, if $X_t = 2$ then, with probability $1-1/100$, $X_{t+1} = 0$ and otherwise $X_{t+1} = 1$; if $X_t=1$ then $X_{t+1}=0$ with probability $1/100$ and $X_{t+1}=1$ otherwise. Additive drift provides an upper bound of an expected $200$ iterations to reach $0$ (since the lowest drift of $1/100$ is encountered in state $1$ and we start in state $2$).
This process is very much misleading in that state $1$ sounds like it is closer than $2$ to the target of $0$, when actually it is not. Consider the potential function $f(0)=0$, $f(1) = 100$ and $f(2) = 2$. Now both states $1$ and $2$ have an drift of exactly $1$ towards the target $0$, and we start in a state with potential $2$, so we get an expected time of $2$ to reach $0$ from the additive drift theorem.

From this example we see that what seems “natural” (because some process on the reals presents itself) might not be the best for drift. In fact, as we will see later in this section, a potential can turn drift away from the optimum into drift towards the optimum.
The example can be generalized to arbitrary processes: on time-homogeneous Markov chains, the best potential for getting a tight bound with the additive drift theorem is the potential which assigns each state the time until finding the target from starting in that state. The next theorem from [HY04A study of drift analysis for estimating computation time of evolutionary algorithms. Jun He, Xin Yao. Nat. Comput. 2004.] makes this formal.

Theorem 3.2 [Expected Time as Potential]. Let $\CalX$ be some state space and let $(X_t)_{t \in \natnum}$ be a time-homogeneous Markov chain on $\CalX$ and let $O \subseteq \CalX$ be a set of targets. For any $x \in \CalX$, let $T(x)$ be the random variable describing the number of steps until reaching an element in $O$ (for the first time) when starting in $x$, and suppose that all such $T(x)$ have finite expectation. We define $$ g\colon \CalX \rightarrow \realnum, x \mapsto \Ew{T(x)}. $$ Then, for all $t$ with $X_t \not\in O$, $$ \Ew{g(X_{t}) - g(X_{t+1}) \mid t \lt T(X_0)} = 1. $$

Proof. Since $(X_t)_{t \in \natnum}$ is a time-homogeneous Markov-chain, let an operator $\theta$ be given such that, for all $t \in \natnum$, $X_{t+1} = \theta(X_t)$. For all $i \in \natnum$, we use $\theta^i$ to denote the $i$-times self-composition of $\theta$. In particular, for all $t \in \natnum$, we have $$ T(X_t) = \min_{i \in \natnum} \theta^i(X_t) \in O = \min_{i \in \natnum} X_{t+i} \in O. $$ For all $t \in \realnum$, conditional on $t \lt T(X_0)$ (which implies $X_t \not\in O$) we thus have $$ T(X_{t+1}) = \min_{i \in \natnum} X_{t+i+1} \in O = \left(\min_{i \in \natnum} X_{t+i} \in O\right) - 1 = T(X_{t}) - 1. $$ In particular, \begin{align*} \Ew{g(X_{t}) - g(X_{t+1}) \mid t \lt T(X_0))} & = \Ew{\Ew{T(X_t)} - \Ew{T(X_{t+1})} \mid t \lt T(X_0) }\\ & = \Ew{\Ew{T(X_t)} - \Ew{T(X_{t})-1} \mid t \lt T(X_0)}\\ & = 1. \end{align*} This shows the claim.

The theorem is interesting for understanding what a good potential should be; in order to apply a drift theorem it is, however, completely useless: We could now use upper and lower additive drift theorems and the proven drift of $1$ to derive an expected time of $\Ew{g(X_0)}$ to find an element of $O$ when starting in $X_0$. But $g(X_0)$ is defined to be the expected time to find an element from $O$ when starting in $X_0$, so we arrived where we started.
Note that, in general, after mapping a Markov chain with a potential function, the resulting process is not necessarily a Markov chain anymore. This is not a problem at all, since a well-formulated drift theorem does not need the requirement of the process being a Markov chain, see Section 8 [Drift as an Average: A Closer Look on the Conditioning of Drift] for a discussion.

When considering optimization algorithms, it is sometimes easy to show that the distance to the optimum or directly the fitness decreases in expectation in each step. This means that this would make for a good potential function; but sometimes this expected change is either too weak, too hard to analyze or even negative. In this case, more inventive potential functions are sought, which is the concern of the remainder of this section.

Before we dive into developing concrete potential functions, we discuss normalizing potential functions. This will later make one decision very easy for us.

Theorem 3.3 [Normalizing Additive Drift]. Let $\CalX$ be some state space and let $(X_t)_{t \in \natnum}$ be a random process on $\CalX$. Let $c \in \realnum_{\gt 0}$ and let $g\colon \CalX \rightarrow \realnum$ be any potential function such that $$\Ew{g(X_{t}) - g(X_{t+1}) \mid g(X_0), …, g(X_t)} \geq c.$$ Then there is a potential function $\overline{g}\colon \CalX \rightarrow \realnum$ such that $$ \Ew{\overline{g}(X_{t}) - \overline{g}(X_{t+1}) \mid \overline{g}(X_0), …, \overline{g}(X_t)} \geq 1, $$ and $\Ew{\overline{g}(X_{0})} = \Ew{g(X_{0})}/c$.

Proof. We choose $\overline{g} = g/c$.

The theorem shows that, whenever there is any potential function at all amenable to analysis by additive drift, there is one with an drift of $1$. Note that normalization has no impact on the resulting time bound, since the starting value is scaled correspondingly (and now equals the time bound derived by the additive drift theorem).

3.2 Potential Functions for Two-Part Drift

In our first example, we want to “glue together” two drift regimes.

Example 3.4 [Gluing Together Fitness Functions]. Let $(X_t)_{t \in \natnum}$ be a discrete integrable process on $[0,n]$ with $X_0 = n$ and let $k \in [0..n]$. Let $T$ be the first time $t$ such that $X_t = 0$. Suppose that we have $$ \Ew{X_{t} - X_{t+1} \mid X_t \geq k} \geq 2 $$ and $$ \Ew{X_{t} - X_{t+1} \mid 0 \lt X_t \lt k} \geq 1. $$ In other words: while the potential is high, we get a drift of $2$, for small values only a drift of $1$. Further assume that, if $X_t \lt k$, then also $X_{t+1}\lt k$. We want to show that $$ \Ew{T} \leq \frac{n+k}{2}. $$ In order to make use of the stronger drift for large values of the process, we choose the potential function $$ g\colon \realnum \rightarrow \realnum, x \mapsto \begin{cases} x, &\mbox{if }x\lt k;\\ (x+k)/2, &\mbox{otherwise.} \end{cases} $$ We have $g(X_0) = (n+k)/2$, so it is sufficient to show a drift of at least $1$ to get our desired bound. This holds trivially for $X_t \lt k$, heavily relying on the fact that this implies $X_{t+1} \lt k$. For the following reasoning, note that, for all $x \in \realnum$, $g(x) \leq (x+k)/2$. For $X_t \geq k$, we see \begin{align*} \Ew{g(X_{t}) - g(X_{t+1}) \mid X_t \geq k} & \geq \Ew{(X_{t}+k)/2 - (X_{t+1}+k)/2 \mid X_t \geq k}\\ & = \Ew{X_{t} - X_{t+1} \mid X_t \geq k} / 2\\ & \geq 1. \end{align*} This shows a drift of $1$ in potential and thus gives the desired bound using Theorem 2.1 [Additive Drift, Upper Bound].

Note that the potential function $g$ used in the proof above is concave, so in the main derivation in the proof we could have reasoned with Jensen's Inequality. Thus, this approach generalizes to other concave potential functions.

3.3 $(1+1)$ EA on Linear Functions

One of the most famous examples of an analysis with potential functions and drift theory is the analysis of the $(1+1)$ EA (see Section 9.1 [Algorithms]) on linear functions. In fact, the paper introducing the multiplicative drift analysis [DJW12Multiplicative Drift Analysis. Benjamin Doerr, Daniel Johannsen, Carola Winzen. Algorithmica 2012.] used this drift theorem with a suitable potential function to show an upper bound of $(1+o(1))\; 1.39e \; n \ln(n)$ on arbitrary linear functions. This bound was later improved to $(1\pm o(1))\; e \; n \ln(n)$, including a matching lower bound, in [Wit13Tight Bounds on the Optimization Time of a Randomized Search Heuristic on Linear Functions. Carsten Witt. Comb. Probab. Comput. 2013.], with a more intricate potential function that crucially depended on the concrete linear function.

Here we give a simple proof from [DJW12Multiplicative Drift Analysis. Benjamin Doerr, Daniel Johannsen, Carola Winzen. Algorithmica 2012.], showcasing the use of potential functions which achieves a bound of $(1+o(1))\; 4e \; n \ln(n)$. We will further restrict the linear functions to have no duplicate weights, avoiding to treat this edge case.

Theorem 3.5 [$(1+1)$ EA on Linear Functions, no duplicated weights]. Let $f$ be any linear function without duplicate weights. The expected time until the $(1+1)$ EA on $f$ samples the optimum for the first time is $(1+o(1))\; 4e \; n \ln(n)$.

Proof. Let $w_1, …, w_n$ be the weights of $f$. We note that the $(1+1)$ EA is unbiased, that is, “reordering” the bit positions and swapping the “meaning” of $0$ and $1$, leads to analogous behavior of the algorithm [LW12Black-Box Search by Unbiased Variation. Per Kristian Lehre, Carsten Witt. Algorithmica 2012.]. Thus, we can assume, without loss of generality, that the weights are ordered decreasingly and are positive, that is, $$ w_1 \gt w_2 \gt … \gt w_n \gt 0. $$ Now we define our potential function as follows. Let $g\colon \{0,1\}^n \rightarrow \realnum$ be such that, for all $x \in \{0,1\}^n$, $$ g(x) = \sum_{i=1}^n (2-i/n)(1-x_i). $$ This potential essentially awards a potential weight between $1$ and $2$ to any incorrectly set bit, where higher potential weight for a bit position corresponds with higher weight of this bit position in the objective function. We call $2-i/n$ also the potential of bit $i$. For each $t \geq 0$, let $X_t$ be the current best bit string found by the $(1+1)$ EA after $t$ iterations. We want to show that there is multiplicative drift in $(g(X_t))_{t \in \natnum}$. To that end, fix $t \in \natnum$. Let $I = \set{i \leq n}{x_i=0}$ be the set of positions where the current best bit string has a $0$. We define a number of events that we want to distinguish; these events will be a partition of the entire event space at iteration $t$.

For each $i \in I$, let $A_i$ be the event that bit $i$ is the only $0$-bit which is flipped by mutation and the final offspring is accepted.
Let $C$ be the event that at least 2 of the $0$ bits are flipped.
Let $D$ be the event that no $0$ bit is flipped.

Let $\Delta = g(X_t) - g(X_{t+1})$ be the drift. We can now get the following breakup of the drift by the law of total expectation. \begin{align*} \Ew{\Delta \mid g(X_t)} = \;\; &\Ew{\Delta \mid g(X_t), C}\Pr{C \mid g(X_t)}\\ &+ \Ew{\Delta \mid g(X_t), D}\Pr{D \mid g(X_t)}\\ &+ \sum_{i \in I} \Ew{\Delta \mid g(X_t), A_i}\Pr{A_i \mid g(X_t)}. \end{align*} If no $0$ bit flips (event $D$), then any flip of a $1$ bit will result in worse offspring, which will be discarded; thus, $\Ew{\Delta \mid g(X_t), D} = 0$.
Now consider event $C$. At least two $0$ bits flip, leading to an increase in potential of at least $2$. There are at most $n$ many $1$ bits, each flipping with a probability of $1/n$ and the potential associated with that bit is strictly less than $2$. Thus we lose (in expectation) strictly less than a potential of $2$ from flipping $1$ bits, but we gain a potential of at least $2$ from flipping $0$ bits. Furthermore, the result might be accepted or not, and if it is not accepted, then $\Delta = 0$. Note that, the more $1$ bits are flipped, the less likely the offspring is accepted, so there is a negative correlation between number of $1$ bits flipped and the probability of acceptance. This shows $\Ew{\Delta \mid g(X_t), C} \geq 0$.
Now we consider the events $A_i$. Note that, for all $i \in I$, $P(A_i) \geq (1-1/n)^{n-1}/n \geq 1/en$, since the event that bit $i$ flips and no other is a subevent of $A_i$. Not flipping $n-1$ bits has a probability of $(1-1/n)^{n-1}$, and flipping a specific bit has a probability of $1/n$. Crucially, it is impossible that any bit with position $\lt i$ flips and the offspring is accepted, since it has a (strictly!) higher weight than bit $i$ (and bit $i$ is the only $0$ bit that flips).
Overall, we have the following. \begin{eqnarray*} \Ew{\Delta \mid g(X_t)} & \geq & \sum_{i\in I} \Ew{\Delta \mid g(X_t), A_i}\Pr{A_i \mid g(X_t)}\\ & \geq & \sum_{i\in I} \frac{1}{en} \Ew{\Delta \mid g(X_t), A_i}\\ & \geq & \frac{1}{en} \sum_{i\in I} \left[ \left(2- \frac{i}{n} \right) - \sum_{j = i+1}^n\frac{1}{n} \left(2 - \frac{j}{n} \right)\right]\\ & = & \frac{1}{en} \sum_{i\in I} \left[ \left(2- \frac{i}{n} \right) - \frac{1}{n} \left(2(n-i) - \frac{\sum_{j = i+1}^n j}{n} \right) \right]\\ & = & \frac{1}{en} \sum_{i\in I} \left[ 2- \frac{i}{n} - \frac{2(n-i)}{n} + \frac{\sum_{j = i+1}^n j}{n^2} \right]\\ & = & \frac{1}{en} \sum_{i\in I} \left[ \frac{- i + 2i}{n} + \frac{n(n+1) - (i+1)i}{2n^2} \right]\\ & \geq & \frac{1}{en} \sum_{i\in I} \left[ \frac{i}{n} + \frac{n(n+1) - (i+1)n}{2n^2} \right]\\ & = & \frac{1}{en} \sum_{i\in I} \left[\frac{i + n/2-i/2}{n} \right]\\ & = & \frac{1}{en} \sum_{i\in I} \left[\frac{1}{2} + \frac{i}{2n} \right]\\ & \geq & \frac{1}{en} \sum_{i\in I} \frac{1}{2}\\ & = & \frac{|I|}{2en}. \end{eqnarray*} Since we have $|I| \geq g(X_t)/2$, we get a multiplicative drift with drift constant $\delta = 4en$. Using $g(X_0) \leq 2n$, we can apply Theorem 2.5 [Multiplicative Drift] to get $$ \Ew{T} \leq (1+o(1)) \; 4e \; n \ln(n). $$

3.4 Designing a Potential Function via Step-Wise Differences

In this section we give a method for finding a suitable potential function by defining the potential differences of “neighboring” states. Note that this method was used to find the proof given in Theorem 2.10 [Winning Streaks] and is also the basis of the work in Section 5.6 [Finite State Spaces], with details in [KK18First-Hitting Times for Finite State Spaces. Timo Kötzing, Martin S. Krejca. PPSN (2) 2018.]. Furthermore, a version of this method for overcoming negative drift was given in [GK14Robustness of populations in stochastic environments. Christian Gießen, Timo Kötzing. GECCO 2014., GK16Robustness of Populations in Stochastic Environments. Christian Gießen, Timo Kötzing. Algorithmica 2016.]. Early work using such an approach can be found in [DJW00Dynamic Parameter Control in Simple Evolutionary Algorithms. Stefan Droste, Thomas Jansen, Ingo Wegener. FOGA 2000.].

We consider the optimization of a test function which looks like ${\mathrm O{\scriptsize NE}M{\scriptsize AX}}$ for most of the search space, but around the optimum is a plateau of constant fitness. This is a fitness function defined as follows, for a given parameter $k \in \natnum$ (for $x \in \{0,1\}^n$ we use $|x|_1$ to denote the number of $1$s in $x$). $$ {\mathrm P{\scriptsize LATEAU}}_k\colon \{0,1\}^n \rightarrow \{0,1\}^n, x \mapsto \begin{cases} |x|_1, &\mbox{if }|x|_1 \leq n-k \mbox{ or }|x|_1 = n;\\ n-k, &\mbox{otherwise.} \end{cases} $$ This function is maximized by the bit string $1^n$. All bit strings with a distance between $1$ and $k$ to the optimum have identical fitness, so there is no guiding signal towards the optimum on that so-called plateau.

We want to study the random search heuristics $(1+1)$ EA and RLS on the plateau function (see Section 9.1 [Algorithms]). For the $(1+1)$ EA, we can analyze the performance as follows. Within $\bigO{n \log n}$ iterations the algorithm will have found a search point on the plateau, that is, at distance at most $k$ to the optimum (this follows analogously to the analysis of $(1+1)$ EA on ${\mathrm O{\scriptsize NE}M{\scriptsize AX}}$). From now on at most $k$ bits will be incorrect, and correcting exactly those $k$ bits and no others has a probability of $$ \left(\frac{1}{n}\right)^k\left(1-\frac{1}{n}\right)^{n-k} \geq \frac{1}{en^k}. $$ Thus, for $k \geq 2$, the total expected optimization time will be $\bigO{n^k}$. This reasoning disregards the analysis of the random walk performed by the algorithm on the plateau.

The search heuristic Random Local Search (RLS) exchanges the mutation operator of the $(1+1)$ EA for an operator which flips exactly one bit. The analysis of the $(1+1)$ EA above explicitly makes use of large steps which are not performed by RLS, so a different analysis is required. We now have to understand the random walk on the plateau as an essential part to finding the optimization time, and analyzing it with drift theory provides a nice example of the power of potential functions. In fact, it is somewhat surprising that, also for this fitness function, an analysis with drift theory can find a good bound on the expected optimization time: Using the fitness function as the potential function, the drift for “inner” points on the plateau (where all neighbors are also points on the plateau) is $0$.

We want to develop a potential function that is $0$ at the optimum. We aim for an drift for the RLS optimizing ${\mathrm P{\scriptsize LATEAU}}$ of at least $1$, given that Theorem 3.3 [Normalizing Additive Drift] shows that we can always find a normalized drift function. For reasons of symmetry, all bit strings at the same distance to the optimum should have the same potential, so we now wonder what should be the potential of a bit string with exactly $d$ many $0$s (so $d$ is the Hamming distance to the optimum). Let us simplify and consider first the case of $d=1$.
On the plateau, any change is accepted by RLS. Thus, if there is only one incorrect bit, RLS will correct it with probability $1/n$ and otherwise lose a different bit with probability $1-1/n$. If we think about the potential difference between $d=0$ and $d=1$ as $a(0)$, and the potential difference between $d=1$ and $d=2$ as $a(1)$, then the expected gain in potential is given by $$ \frac{1}{n} \cdot a(0) - \left(1 - \frac{1}{n}\right) \cdot a(1) = \frac{a(0) - (n-1)a(1)}{n}. $$ We want this quantity to be a least $1$, so, for fixed $a(0)$, we get $a(1) \leq (a(0)-n) / (n-1)$. This is a rather complex term, but note that for $a(0) \geq 2n$, we can choose $a(1) = a(0)/(2n)$, a much simpler term.
Turning to the general case of arbitrary $d$, we get an drift of $$ \frac{d}{n} \cdot a(d-1) - \left(1 - \frac{d}{n}\right) \cdot a(d) = \frac{d\cdot a(d-1) - (n-d)a(d)}{n}. $$ Thus, if again $a(d-1) \geq 2n$, we could work with $a(d) = a(d-1) / (2n)$. Inductively, we now have $a(d) = a(0) / (2n)^d$.
Note that we need this to hold for $d$ starting at $d=0$ up until $d=k-1$, since on the plateau we cannot go outward from being exactly $k$ away. We can now choose $a(0)$ to suit all requirements. Concretely, for all $d \in [0..k-1]$ we need $$ \frac{a(0)}{(2n)^d} = a(d) \geq 2n, $$ so $a(0) \geq (2n)^{d+1}$. This restriction is strongest for $d = k-1$, leading to $a(0) \geq (2n)^k$.
After we established the size of the different gaps between different states of the algorithm, we now define the potential function (denoting the number of $0$s in a bit string $x$ as $|x|_0$) as $$ g\colon \{0,1\}^n \rightarrow \realnum_{\geq 0}, x \mapsto \sum_{i=0}^{|x|_0-1} a(i). $$ This way of defining a potential function as a sum of “gaps” has a the advantage that the difference in potential of similar search points is easy to compute.
We now get to the final proof, where we will also need to worry about the “easy” part of the search space (again “gluing together” the drift regimes as in Section 3.2 [Potential Functions for Two-Part Drift]). Note that we simplify the terms $a(i)$ by changing the base from $2n$ to $n$; while the $2n$ suggested itself during proof discovery, the final computations can do without.

Theorem 3.6 [RLS on Plateau, upper bound]. Let $k \geq 2$. The expected time for RLS to optimize ${\mathrm P{\scriptsize LATEAU}}$ is $\bigO{n^{k}}$.

Proof. Let $(X_t)_{t \in \natnum}$ be the current search point of RLS after $t$ iterations. For all $d \in \natnum$ we define $a(d) = n^{k-d}$ and $g_0 = \sum_{i=0}^{k-1} a(i)$. Now we define a potential function $g$ as $$ g\colon \{0,1\}^n \rightarrow \realnum_{\geq 0}, x \mapsto \begin{cases} \sum_{i=0}^{|x|_0-1} a(i), &\mbox{if }|x|_0 \leq k;\\ g_0 + (|x|_0 - k)n, &\mbox{if }|x|_0 \gt k. \end{cases} $$ Intuitively, we artificially distort the gaps where the plateau does not provide a fitness signal and use unit gap sizes in the easy part. Note that this is suboptimal for the easy part, but the impact on the overall bound of the theorem will only be in lower order terms, since the time to cross the plateau dominates. Let now $t$ be given and let $x = X_t$ and $x' = X_{t+1}$. We are interested in bounding $\Ew{g(x)-g(x')}$. We will use the law of total expectation and make a case distinction on $|x|_0$. First, let $d \lt k$ be given and consider an iteration of RLS on a bit string with exactly $d$ many $0$s (where it flips exactly $1$ bit). RLS either gains a potential of $a(d-1)$, with probability $d/n$, or loses a potential of $a(d)$, otherwise. We have \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = d} & = \frac{d}{n} a(d-1) - \frac{n-d}{n}a(d)\\ & = \frac{d}{n} \cdot n^{k-d+1} - \frac{n-d}{n} \cdot n^{k-d}\\ & = (dn-n+d) \cdot n^{k-d-1}. \end{align*} For $d=1$ this equals $n^{k-2} \geq 1$; for $d \gt 1$ this is at least $(d-1)n \cdot n^{k-d-1} \geq n^{k-d}$. Since $d \leq k$, this value is at least $1$.
We now consider the case of $d = k$. Note that, in this case, we cannot lose potential, as the selection of RLS discards any strictly worse search point. Thus, we have \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = k} & = \frac{k}{n} a(k-1)\\ & = \frac{k}{n} \cdot n^{k-k+1}\\ & = k \geq 1. \end{align*} Finally, we consider the case of $d \gt k$. Also in this case we cannot lose potential; we have \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = d} & = \frac{d}{n} \cdot n\\ & = d \geq 1. \end{align*} Thus, Theorem 2.1 [Additive Drift, Upper Bound] gives an upper bound on the expected optimization time of the maximal potential value of $g_0 + n(n - k) \leq n \cdot 2n^{k} + n^2 = \bigO{n^{k}}$.

For $k$ constant, we can use drift theory with a similar potential function to find a matching lower bound. Note that this a particular strength of the additive drift theorem: it comes with a matching lower bound without additional requirements to the random process (such as concentration in each step).

Theorem 3.7 [RLS on ${\mathrm P{\scriptsize LATEAU}}$, lower bound]. Let $k \geq 2$ be constant. The expected time for RLS to optimize ${\mathrm P{\scriptsize LATEAU}}$ is $\Omega(n^{k})$.

Proof. Let $(X_t)_{t \in \natnum}$ be the current search point of RLS after $t$ iterations. For all $d \in \natnum$ we define $a(d) = ((n-k)/k)^{k-d}$ and $g_0 = \sum_{i=0}^{k-1} a(i)$. Now we define a potential function $g$ as $$ g\colon \{0,1\}^n \rightarrow \realnum_{\geq 0}, x \mapsto \begin{cases} \sum_{i=0}^{|x|_0-1} a(i), &\mbox{if }|x|_0 \leq k;. g_0, &\mbox{if }|x|_0 \gt k. \end{cases} $$ Intuitively, we ignore the run time outside of the plateau, since it does not contribute to the asymptotic bound. Let now $t$ be given and let $x = X_t$ and $x' = X_{t+1}$. We are interested in bounding $\Ew{g(x)-g(x')}$, this time we want an upper bound. We will again use the law of total expectation and make a case distinction on $|x|_0$. First, let $d \lt k$ be given and let the current bit string $x$ have $|x|_0 = d$. Since RLS will flip exactly $1$ bit, we either gain a potential of $a(d-1)$ (with probability $d/n$) or lose a potential of $a(d)$, otherwise. From $d \leq k$ we see $d(n-k) \leq k(n-d)$ and thus $$ \frac{d}{n} \; \frac{n-k}{k} \leq \frac{n-d}{n}. $$ Now we can derive \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = d} & = \frac{d}{n} a(d-1) - \frac{n-d}{n}a(d)\\ & = \frac{d}{n} \cdot ((n-k)/k)^{k-d+1} - \frac{n-d}{n} \cdot ((n-k)/k)^{k-d}\\ & \leq \frac{n-d}{n} \cdot ((n-k)/k)^{k-d} - \frac{n-d}{n} \cdot ((n-k)/k)^{k-d}\\ & = 0. \end{align*} An upper bound of $0$ might seem surprising, but in this area of the search space the distortion by the potential function is large enough to arrive at negative drift. We now consider the case of $d = k$. In this case, we cannot lose potential. We have \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = k} & = \frac{k}{n} a(k-1)\\ & = \frac{k}{n} \cdot ((n-k)/k)^{k-k+1}\\ & = (n-k)/n \leq 1. \end{align*} Finally, we consider the case of $d \gt k$. Since in this case all neighboring search points have the same potential, we again have a drift of $0$. \begin{align*} \Ew{g(x) - g(x') \mid |x|_0 = d} & = 0. \end{align*} The initial potential is $g_0$ with a probability of at least some constant $c$. Thus, Theorem 2.3 [Additive Drift, Lower Bound] gives a lower bound on the expected optimization time of the initial potential value of $c g_0 \geq c(n/k)^k = \Omega(n^{k})$.

As can be seen from the proof, the lower bound extends to super-constant $k$ as $\Omega((n/k)^k)$. Note that both this lower bound and the upper bound of $\bigO{n^k}$ are no longer optimal for super-constant $k$. In particular, the extreme case of $k=n$ is known as the ${\mathrm N{\scriptsize EEDLE}}$ function (see [GKS99Rigorous Hitting Times for Binary Mutations. Josselin Garnier, Leila Kallel, Marc Schoenauer. Evol. Comput. 1999.] for the first analysis on ${\mathrm N{\scriptsize EEDLE}}$).

3.5 Further Potential Functions

The literature knows many more example applications of potential functions in order to allow for the applications of drift theory. For example, in [FKN+23Analysis of (1+1) EA on LeadingOnes with Constraints. Tobias Friedrich, Timo Kötzing, Aneta Neumann, Frank Neumann, Aishwarya Radhakrishnan. GECCO 2023.] a clever potential function is used to incorporate a state of the algorithm into the general progress of the algorithm towards the goal. In Section 4.2 of [DKLL17Bounding bloat in genetic programming. Benjamin Doerr, Timo Kötzing, J. A. Gregor Lagodzinski, Johannes Lengler. GECCO 2017.], the potential function essentially has two parts to allow for a unified drift argument, rather than arguing over two phases.
Further interesting potential functions can be found in [DDK18Static and Self-Adjusting Mutation Strengths for Multi-valued Decision Variables. Benjamin Doerr, Carola Doerr, Timo Kötzing. Algorithmica 2018.]. One function incorporates speed (a self-adjusting parameter) of the algorithm and the distance to the optimum into a single potential. Another combines the distances in different dimensions in a suitably scaled way to arrive at a useful potential function.

3.6 Conclusion

While building a potential function is more of an art than a science, there are heuristics which can help.

As we saw in Example 3.1 [Better Search Points with Better Potential], states that are “closer” to the target should have potential “closer” to that of the target; in fact, as shown by Theorem 3.2 [Expected Time as Potential], the most accurate potential assigns each search point the “distance” to the target.
Drift might be different in different parts of the search space; in this case, we can use potential functions to “glue together” these parts, as showcased by Example 3.4 [Gluing Together Fitness Functions].
In Section 3.4 [Designing a Potential Function via Step-Wise Differences] we saw one way of iteratively building a potential function by comparing “neighboring” states.

This work is licensed as Creative Commons Attribution - ShareAlike 4.0 International.
All mathematical statement in this script are freely usable, of course. As a collection and with comments they are licensed as stated above.