A BASIC MODEL OF MUTATIONS

. We study a basic model for mutations. We derive exact formulae for the mean time needed to discover the master sequence, the mean returning time to the initial state, or to any Hamming class. These last two formulae are the same than the formulae obtained by Mark Kac for the Ehrenfest model.


Introduction
According to the Darwinian paradigm, the evolution of living creatures is driven by two main forces: mutations and selection.Mutations create new forms of behaviour or new characters, some less fit to their environment, some more, whereas selection favors the reproduction of fitter individuals.On the one hand, mutations may discover very fit characters but without selection, they would be quickly erased by further mutations.On the other hand, selection alone would result in uniform populations, lacking in genetic diversity.The success of an evolutionary process rests on a subtle interaction between mutations and selection.
Let us consider for instance a population of HIV viruses or Drosophila melanogaster.The genetic material of one individual, also called its genotype, is encoded into its DNA, which is a long chain of nucleobases A,T,G or C. To simplify the analysis, we suppose here that there are only two types of nucleobases instead of four, and we denote them by 0 or 1. Selection enters the game through fitness.The fitness describes the adaptation of the individuals to the environnement.The fitness of an individual can be thought as a function of its genotype.For instance, a possible choice for the fitness function is the expected number of offspring of the individual.We consider the situation where all the genotypes are equally fitted, except one, say 0 • • • 0, which has a superior fitness.There exist several mathematical models of evolution combining mutations and selection.The simplest one is perhaps the Moran model, whose dynamics is the following.At each time step, one individual dies, while one individual gives birth to a child (in particular, the population size stays constant).All individuals are equally likely to die, but the fitter individuals having genotype 0 • • • 0 reproduce more often.Mutations occur during reproduction: the genotype of the child is not an exact copy of the one of its parent.
A central question is then to determine the proportion of individuals with genotype 0 • • • 0 in the population after a long time.To answer this question, one should understand how long it takes for a population to escape from the set of the selectively neutral genotypes.We perform this crucial step here.We consider one single individual, we follow his lineage and we compute the time needed for the genotype 0 • • • 0 to be discovered.A model for mutations.We follow the genotypes of the individuals along a lineage.We fix the probability of mutation p ∈ (0, 1).We start from a string X 0 in {0, 1} N , which represents the genotype of the first individual.We denote by X n the genotype of the n-th individual in the lineage.At time n, for each bit of X n , we toss a coin of parameter p to decide whether a mutation occurs on the bit, in which case it is transformed into the complementary digit.All the coins are taken independent, so the probability of having no mutations at all at time n is (1 − p) n .The assumption of independance simplifies considerably the mathematical analysis, and it is also biologically plausible.Indeed the mutations arise because of transcription errors during the replication process, and for long genomes, they are not correlated.It seems however that the mutation probability p varies along the genome, so the next modelling step would be to incorporate this spatial dependance into the model.We end up with a random walk (X n ) n∈N on the hypercube { 0, 1 } N , for which the transition probabilities only depend on the number of differences between two states.Let τ 0 be the hitting time of the master sequence 0 • • • 0, i.e., The goal is then to compute the expected value of τ 0 .Lumping.The smart way to analyze this random walk is to lump together the states of the hypercube into Hamming classes.The Hamming class number i consists of the points which have i digits equal to 1 and N − i equal to 0. So we define a new process (Y n ) n∈N by setting Y n = number of digits of X n equal to 1 .
We obtain a Markov chain with state space { 0, . . ., N }, and we shall provide explicit formulas for its mean passage times.Discovering and recovering time.For 0 ≤ j ≤ N , we define the hitting time of the Hamming class j by The three theorems below give exact formulas for the expected value of τ j , when starting from 0, N , or j.These formulas are surprisingly simple and come out from tricky computations.The discovering time of the master sequence is bounded from above by the traversal time, which is the time needed to reach the class 0 starting from the class N .This corresponds to the situation where we start with a string containing only ones, and we wait until we see a string containing only zeroes.
Theorem 1.1 The mean traversal time is given by The recovering time of the master sequence corresponds to the returning time to the class 0 when starting away from a genotype different from 0 • • • 0. This time is bounded from below by the mean return time to 0 starting from 0, which we compute next.
Theorem 1.2 The mean returning time to the class 0 is given by Of course, the formula of theorem 1.2 is also a straightforward consequence of the classical result expressing the invariant probability measure of a Markov chain in terms of mean recurrence times.However, it does not seem that the other formulas presented in theorems 1.1 or 1.3 are easy consequences of more general results.We compute next a beautiful formula for the returning time to the class j when starting away from a genotype of the same class.
From theorems 1.1 and 1.2, it is easy to infer an estimate on the mean returning time of 0 which is uniform with respect to the starting point.Indeed, a standard coupling argument yields that, for any starting string Taking into account theorems 1.1 and 1.2, we conclude that These inequalities show that the discovering and the recovering times of the master sequence are of order 2 N .It turns out that these results are akin to those related to an old classical model, the Ehrenfest model, that we describe briefly.The Ehrenfest model.Let us consider N balls and two boxes.Initially, all the balls are in the first box.At each time step, one ball is selected at random and is moved from its current box to the other box.The central question is then the following: On average, how long will it take to return to the initial state?In 1947, Mark Kac gave a simple answer to this question in a celebrated paper [?].He considered the evolution of the number of balls in the first box.This process is a Markov chain on { 0, • • • , N }, which is quite different from our process (Y n ) n∈N .For instance, its increments are either −1, 0 or +1.Mark Kac showed that, starting from 0, the average time for the Ehrenfest process to return to its initial state is equal to 2 N , which is the analog of theorem 1.2.He showed also that, when starting from the state j, the average time until return to j is equal to 2 N / N j .Theorem 1.3 gives the analogous result for our model of mutations.
A glimpse of potential theory.We attack here the general case, i.e., we look for a formula for the mean returning time to the class j, when the process starts from the class i.We start from the formula obtained in theorem 1.1, for the specific case where i = N and j = 0. We expand in a geometric series the denominator and we get We exchange the order of the summations and, using the formulas (5.1) and (5.2), we obtain From theorem 1.2, we have also that E τ 0 Y 0 = 0 = 2 N .The above display formula is actually a particular case of a more general identity valid for a large class of Markov chains, that we state in the next theorem.
Theorem 1.4 For any i, j in { 0, . . ., N }, we have In the case of our specific model, we have further a formula for each term in the sum (see formula (4.1)).This way, we get an exact formula for E τ j Y 0 = i , which we present in the next theorem.
Theorem 1.5 For 1 ≤ j ≤ N , the mean returning time to the class j starting from class i is given by Strategy of the proof.We will employ the same method that Mark Kac used for the Ehrenfest model.We shall try to compute E τ j Y 0 = i for 0 ≤ i, j ≤ N .These computations are tricky.We will in fact compute the generating functions of the event { Y n = j } and of the random variable τ j , and we will relate them through a functional equation that we derive in the next section.The mean passage times are equal to the left derivative at 1 of the generating function of τ j .We compute these derivatives by performing a local expansion of the functions around 1. Finally, in the last section, we prove a general formula for E τ j Y 0 = i for 0 ≤ i, j ≤ N , which is based on the potential theory for Markov chains.

A classical probabilistic identity
Throughout the computations, we shall denote by P i the probability conditioned on the event that Y 0 = i.Let us fix 0 ≤ i, j ≤ N .For n ≥ 1, we compute P i (Y n = j) by decomposing the event { Y n = j } according to the hitting time τ j , as follows: We perform a conditioning in the probability in the sum and we get The constitutive property of a Markov chain is that the past influences the future only through the present state.This is rigorously formalized in the Markov property, which yields that Since in addition the Markov chain is time homogeneous, we have that Plugging the last two equalities in the sum, we conclude that To take advantage of this identity, we introduce the generating functions of the event { Y n = j } and of the random variable τ j , i.e., we consider the series Their radius of convergence is at least one.In equation (2.1), we isolate the term corresponding to k = n, and we multiply by z n to get Looking at the sum, we recognize the Cauchy product of the two series.Summing this identity, we conclude that 2) The strategy is now the following.We try to compute the functions F ij .We use then the above identity to obtain the functions G ij .Finally, the mean passage times can be computed from G ij by taking its left derivative at 1:

Single nucleotide dynamics
We suppose here that N = 1, i.e., we focus on the dynamics of a single nucleotide.In this case, the process (X n ) n≥0 is the Markov chain with state space {0, 1} and transition matrix The eigenvalues of M are 1 and 1 − 2p.We compute, for n ≥ 1, Here is a simple illuminating way to realize the dynamics and to understand the expression of the n-th power M n .Let (ε n ) n≥1 be an i.i.d.sequence of Bernoulli random variables with parameter p.At each time step, we use the variable ε n to decide whether X n mutates or not.More precisely, we set Now, the event X n = X 0 occurs if and only if the total number of mutations which happened until time n is even, i.e., Here is a little trick to compute the probability that S n is even.We compute in two different ways the expected value of (−1) Sn .Indeed, we have Obviously, we have therefore we obtain that This way we recover the expression of the diagonal coefficients of M n .Let us define From the above computations, we conclude the following.Conditionally on X 0 = 1, X n is a Bernoulli random variable with parameter p n , i.e., Similarly, conditionally on X 0 = 0, X n is a Bernoulli random variable with parameter 1 − p n .

Multiple nucleotides dynamic
We consider now the case where the number N of nucleotides is larger than one.In our model, the mutations occur independently at each site.
An important consequence of this structural assumption is that the components of X n , (X n (i), 1 ≤ i ≤ N ), are themselves Markov chains like the one studied in the previous section, and these Markov chains are moreover independent.This remark, combined with the results of the previous section, allows to derive explicitly the distribution of Y n .Indeed, suppose that we start from Y 0 = i.This means that i digits in X 0 are equal to 1 and N − i to 0. At time n, in X n , the i digits which were initially equal to 1 are distributed according to a Bernoulli law of parameter p n , the others are distributed according to a Bernoulli law of parameter 1 − p n .The evolution of the nucleotides being independent, these Bernoulli variables are independent, so their sum is distributed as the sum of two independent Binomial random variables: This yields for instance the following formula: This formula is quite complicated.Yet it becomes particularly simple in the cases where i or j is equal to 0 or N .Indeed, we have, for 0 ≤ i ≤ N , and for 0 ≤ j ≤ N , For once, surprisingly enough, these two cases are also the most relevant for genetic applications, so we treat them first.We will indeed compare these extreme cases to the general chain and deduce an estimation on the discovering and returning time.
This section is devoted to the completion of the proof of theorems 1.1 and 1.2.We shall implement the strategy explained at the end of section 2.
Our first goal is to compute the generating function From formulas (4.3) and (3.1), we have We use the binomial expansion to develop the N -th power in order to compute the generating function F N 0 as a sum of geometric series: (5.1) Notice that P N Y 0 = 0 = 0.For convenience, we start the sum defining F N 0 at n = 0 and we obtain a finite number of geometric series: Our next goal is to compute the generating function From formulas (4.2) and (3.1), we have, after binomial expansion: This time, we have P 0 Y 0 = 0 = 1.Adding this term to F 00 , we get again nice geometric series: For 0 ≤ k ≤ N , we introduce the auxiliary functions and we rewrite the expressions of F N 0 and 1 + F 00 as We have computed F N 0 and 1 + F 00 .From the probabilistic identity (2.2), we obtain .
Remember that our ultimate goal is to compute the left derivative of G N 0 at 1.The functions φ k are regular around 1, except the first one, φ 0 , indeed, To get G ′ N 0 (1), we perform a local expansion of G N 0 around 1, as follows: This expansion readily yields the value of the left derivative of G N 0 at 1: .
Replacing φ k (1) by its value, we obtain the formula stated in theorem 1.1.
We proceed similarly to prove theorem 1.2.In fact, we have to compute G ′ 00 (1), and the probabilistic identity (2.2) yields .
We have already computed 1 + F 00 (z) in formula (5.3).We use this expression and we expand around z = 1: This expansion shows that G ′ 00 (1) = 2 N .
6 Proof of theorem 1.3 We shall finally prove the analog of Kac theorem on the mean returning time to the class j, when the process starts from the class j.We write the formula (4.1) with i = j, we reindex the sum by setting ℓ = j − k and we perform the two binomial expansions: For n = 0, we have P j (Y 0 = j) = 1, therefore, after a geometric summation, we get We expand this function around z = 1 and we get thanks to the combinatorial identity stated in the next lemma.
Lemma 6.1 For 0 ≤ j ≤ N , we have Proof.Let us fix j in { 0, . . ., N }, and let us consider a set E having cardinality N .We fix also a subset A of E having j elements.We classify the subsets of E having cardinality j according to the cardinality of their intersection with A and we readily obtain the formula of the lemma.
From the probabilistic identity (2.2), we have thus G jj (z) admits the following expansion around z = 1: From this expansion, we infer that and this concludes the proof of theorem 1.3.
7 Proof of theorem 1.4 It has been observed a long time ago that the equations defining the invariant measure, or the returning time to a set for a random walk, or more generally a Markov chain, are formally equivalent to the equations arising in potential theory, if one interprets the transition probabilities as conductances (see the very nice book [2]).In fact, the formula presented in theorem 1.4 takes its roots in potential theory [4].Let us denote by P the transition matrix of the process (Y n ) n≥0 , defined by ∀i, j ∈ { 0, . . ., N } ∀n ≥ 0 The arguments presented below are in fact valid for a general class of Markov chains with finite state space.For instance, it suffices that P , or one of its powers, has all its entries positive.In this situation, the classical ergodic theorem for Markov chains ensures the existence and uniqueness of an invariant probability measure and the following convergence holds: From now on, we fix j in { 0, . . ., N } and we try to compute E τ j Y 0 = i .The idea is to study the behavior of the Markov chain until the time τ j .
To do so, we introduce the companion matrix G defined by ∀i, k ∈ { 0, . . ., N } G(i, k) = E i τj −1 n=0 The matrix G is called the potential matrix associated with the restriction of P to { 0, . . ., N } \ {j}.The quantity G(i, k) represents the average number of visits of the state k before reaching the state j when starting from i.We introduce also the matrix H given by ∀i, k ∈ { 0, . . ., N } The matrix H describes the distribution of the exit point from the set { 0, . . ., N } \ { j }.In our case, it is necessarily the Dirac mass on j, yet in the general case, the matrix H is more involved!The three matrices P, G, H are linked through a simple identity.Proof.
The matrix G encodes the behaviour of the process until it hits j.Multiplying on the right G by the transition matrix P amounts to perform one further step of the process.This step might either stay inside { 0, . . ., N } \ { j }, in which case we recover the matrix G − I, or it might land in j, and this is where the matrix H enters the game.Let us make this argument rigorous.We have to check that ∀i, k ∈ { 0, . . ., N } GP (i, k) = H(i, k) + G(i, k) − I(i, k) .
If k = j, the formula becomes This ends the proof, since H(i, k) = 0 in this case.

Lemma 7. 2
Denoting by I the identity matrix, we haveGP = H + G − I .