On the curved exponential family in the Stochatic Approximation Expectation Maximization Algorithm

The Expectation-Maximization (EM) Algorithm is a widely used method allowing to estimate the maximum likelihood of models involving latent variables. When the Expectation step cannot be computed easily, one can use stochastic versions of the classical EM such as the Stochastic Approximation EM (SAEM). This algorithm, however, has the disadvantage to require the joint likelihood to belong to the curved exponential family. This hypothesis is a bottleneck in a lot of practical situations where it is not veriﬁed. To overcome this problem, Kuhn and Lavielle (2005) introduce a rewriting of the model which “exponential-izes” it. It consists in considering the parameter as an additional latent variable following a Normal distribution centered on the newly deﬁned parameters and with ﬁxed variance. The likelihood of this new exponential-ized model now belongs to the curved exponential family and stochastic EM algorithms apply. Although often used, there is no guarantee that the estimated mean will be close to the maximum likelihood estimate of the initial model. In this paper, we will quantify the error done in this estimation while considering the exponen-tialized model instead of the initial one. More precisely, we will show that this error tends to 0 as the variance of the new Gaussian distribution goes to 0 while computing an upper bound. By verifying those results on an example, we will see that a compromise must be made in the choice of the variance between the speed of convergence and the tolerated error. Finally, we will


Introduction
With the increase of data, parametric statistical models have become a crucial tool for data analysis and understanding.To be able to describe complex natural phenomena (epidemiology, ecology, finance, disease evolution, etc.), the models have an increasing complexity.Some of them are based on observed features or data which are assumed to be generated from a latent random effect.A usual example is the family of mixed effects models which have been used in pharmacokinetic, pharmacodynamic, shape analysis, etc.In such a context, one aims at optimizing the model parameter to maximize the likelihood of the observed dataset.This likelihood is also called the incomplete one as the latent variables are unknown.Formally, this writes as follow: let y ∈ R n be the observation and θ ∈ Θ the model parameter.We call g the incomplete likelihood: g(y, θ) = Z f (y, z, θ) dz .
In that case, z is the latent or missing variable and f is the joint likelihood of the observations and latent variables, depending on a parameter θ ∈ Θ.The Expectation Maximization (EM) algorithm provides a numerical process to answer this problem by computing iteratively a sequence of estimates (θ n ) n∈N which, under several conditions (see Dempster et al. (1977); Wu (1983)), converges towards the maximum likelihood estimate.It proceeds in two steps for each iteration k.First, in the Expectation step (E), the function Q(θ|θ k−1 ) = Z log(f (y, z, θ))p(y, z, θ k−1 ) dz is computed where p is the conditional distribution of z given the observations: p(y, z, θ) = f (y, z, θ)/g(y, θ).θ k is then updated in the Maximization step (M) as the argument of the maximum of the function Q(.|θ k−1 ).
In many cases, the (E) step is in fact intractable as we have no closed form for Q.Different algorithms, both deterministic and stochastic, have been introduced in the literature to overcome this problem.The Monte-Carlo EM (Wei and Tanner (1990)) replaces the (E) step by computing a Monte Carlo approximation of Q using a large amount of simulated missing data z.Another possibility, more computationally efficient, is to use a Stochastic Approximation (SA) of the function Q.This SAEM algorithm has been introduced in Delyon et al. (1999) and the authors proved the convergence towards a local maximum of the incomplete likelihood with probability 1 under several hypotheses.It has later on been generalized in Kuhn and Lavielle (2004) in the case where we are not able to easily sample z.This new algorithm, called the SAEM Monte Carlo Markov Chain (SAEM-MCMC) replaces the sampling of z by one step of a Markov Chain targeting the conditional distribution p.Those two algorithms have then been applied in lots of different contexts: deformable models (Allassonnière et al., 2010;Bône et al., 2018;Debavelaere et al., 2020;Schiratti et al., 2015), Independent Component Analysis (Allassonniere et al., 2012) and in many medical problems (see Benzekry et al. (2014); Guedj and Perelson (2011); Lavielle and Mentré (2007); Sissoko et al. (2016) among many others).
Among the hypotheses ensuring the convergence of most of these algorithms, and in particular our focus, the SAEM algorithm, one of the most restrictive is the necessity for the joint likelihood to belong to the curved exponential family.This writes: where S is called a sufficient statistic of the model and Φ and Ψ are two functions on Θ.Similarly, the different extensions to the SAEM algorithm, and some of the EM algorithm, carry the same assumption (Kuhn et al., 2020;Lartigue et al., 2020;Panhard and Samson, 2009;Samson et al., 2006).
However, this hypothesis can in fact be a bottleneck in lots of situations.For example, it is not verified for heteroscedastic models (Dubois et al., 2011;Kuhn and Lavielle, 2005) nor with more complex models (Bône et al., 2018;Debavelaere et al., 2020;Lindsten, 2013;Meza et al., 2012;Schiratti et al., 2015;Wang, 2007).Most of the authors then choose to compute the maximization step using a gradient descent.However, in that case, there is no theoretical guarantee of convergence.Moreover, the computational complexity increases.One needs to compute the gradient descent steps and compute the stochastic approximation of the complete likelihood while this function may not have a simple form.To solve this problem, Kuhn and Lavielle (2005) propose to transform the initial model to make it curved exponential.
Their solution consists in considering the parameter θ as a realization of a Gaussian vector of mean θ and fixed variance σ 2 .θ then becomes an additional latent variable and the new parameter to estimate is θ.We call this new model the exponentialized model.It now belongs to the curved exponential family.However, as the likelihood of this exponentialized model is different, the function to maximize has also been modified.In particular, there is no guarantee that the new parameter to estimate is close to the initial one.Nevertheless, this trick has been successfully used in different situations (Ajmal et al. (2019) In this paper, we will study the maximum likelihood of this new exponentialized model and measure its distance to one of the maxima of the initial likelihood.More precisely, we will show that this distance goes to 0 as the variance σ 2 of the exponentialized model tends to 0. We will also provide an upper bound to this error when σ is small enough.Finally, we will verify those results on an example.This example will show us that a compromise must be done in the choice of σ.Indeed, if σ is too big, a substantial error is made in the estimation.However, for σ too small, despite the theoretical guarantees, the numerical convergence is difficult to obtain.To overcome this problem, we will present a new algorithm allowing a better estimation of the initial parameter θ in a reasonable computation time.

Presentation of the Stochastic Approximation Expectation Maximization (SAEM) Algorithm
In this section, we recall the Stochastic Approximation Expectation Maximization (SAEM) algorithm, first presented in Delyon et al. (1999) and recall the hypotheses ensuring convergence.In the following, we suppose that the observation y belongs to R n , the latent variable z to R l and that the parameter space Θ is an open subset of R p with n, l, p ∈ N * .

Expectation Maximization (EM) Algorithm
The original EM algorithm proposes to maximize a function defined via: with f the joint likelihood of the model and µ is a σfinite measure on R l .
This situation is of interest to estimate the parameters of a statistical model using maximum likelihood estimates where the model depends on unobserved latent variables.
The Expectation-Maximization consists of iterations which guarantee an increase in g(θ k ) at each step.Starting from θ 0 , the algorithm iterates: where p is the conditional distribution of z given the observations:

SAEM Algorithm
Because the expectation with respect to the conditional distribution p(y, z, θ) is often intractable in practice, a different approach suggests replacing the E-step by a stochastic approximation on Q, starting from θ 0 and Q 0 = 0.This gives us the following algorithm: -Simulation.Generate z k , a realization of the hidden variable under the conditional density p(y, z, θ k ).
Convergence of this procedure is shown under the following hypotheses: (M1) The parameter space Θ is an open subset of R p , and f can write: where S(•) is a Borel function taking its value in S, an open subset of R ns .In that case, we say that f belongs to the curved exponential family.Moreover, the convex hull of S(R l ) is included in S and, for all θ ∈ Θ, (M2) The functions Ψ and Φ are twice continuously differentiable on Θ.
(SAEM2) θ : S → Θ and the observed-data log likelihood l : θ → R are n s times differentiable.
(SAEM3) For all positive Borel function φ: where z k is the missing value simulated at step k under the conditional density p(y, z, θ k−1 ) and F n is the family of σ-algebra generated by the random variables S 0 , z 1 , . . ., z n .
With the hypothesis (M1) specifying the form of the complete likelihood and (M5) giving us the existence of a maximizer θ, the algorithm can take a simpler form.Indeed, using the fact that Q is fully defined by a sufficient statistic S, we remark, by linearity, that the stochastic approximation (2) is only applied on this sufficient statistic.Similarly, the maximization step can be rewritten using only the sufficient statistic and θ.This gives the following algorithm: -Simulation.Generate z k , a realization of the hidden variable under the conditional density p(y, z, θ k ).
We finally assume the following hypothesis:

Remark 1
The assumption (A) can be relaxed by projecting the sequence (S k ) k∈N on increasing compacts.See Andrieu et al. ( 2005) for more details.
Under the hypotheses (M1)-( M5), (SAEM1)-(SAEM4) and (A), it was shown in Andrieu et al. (2005) that the distance between the sequence generated by the SAEM and the set of stationary point of the observed likelihood g converges almost surely towards 0.
However, in numerous cases, even quite simple (Dubois et al., 2011;Kuhn and Lavielle, 2005), the joint likelihood f does not verify the hypothesis (M1) as it does not belong to the curved exponential family.In the next section, we will present a trick allowing us to approximate the maximum likelihood when (M1) is not verified.In the following, to simplify the notations, we no longer write the variable y in the different expressions.

Exponentialization process
We now denote by (θ, ψ) the parameters of g where θ ∈ Θ, ψ ∈ Ψ = R m , and we tackle the case where the model cannot be written under the curved exponential form (4) because of the parameter ψ.In that case, the log-likelihood can only be written as: and f does not belong to the curved exponential family.
Here, some parameters θ are separable from the latent variables z and do not require further transformation.Other variables ψ are at the source of the computational problem and the exponentialization process will only apply on those parameters.It must be noticed that, in some cases, θ can be empty.
The trick proposed in Kuhn and Lavielle (2005) is to consider ψ as a Gaussian random variable ψ ∼ ⊗ N (ψ, σ 2 ), where the notation ⊗ N (., .)denotes a multivariate Gaussian distribution with diagonal covariance matrix.Hence, in this augmented model, ψ is no longer a parameter but becomes an additional latent variable while a new parameter ψ appears.
The variance σ 2 is chosen by the user, and should be reasonably small in order to minimally perturb the original model.In practice, this variance should at the same time be chosen reasonably large in order to speedup the parameter estimation (see experiments in section 3).
The complete log-likelihood of this exponentialized model then writes: (5) It is easy to check that the complete log-likelihood now belongs to the curved exponential family with sufficient statistics: (S(z), ψ).Concerning the parameter θ, the maximization is done as usual: θ k+1 = θ(S k ) with S k the stochastic approximation of the (S(z i )) i≤k .The update of the parameter ψ can for its part be written as: If we suppose that this augmented model satisfies the hypotheses (M1)-( M5), (SAEM1)-(SAEM4) and (A), we know, using the theorem proved in Andrieu et al. (2005), that it will converge towards a critical point of its incomplete likelihood.However, if this process is used in several applications (Lavielle, 2014), there is in fact no guarantee that the algorithm will converge towards a critical point of the incomplete log-likelihood of the initial model.
In the following section, we show that, in general, the parameter returned by the SAEM on the exponentialized model is indeed not a maximum likelihood of the initial model.However, when σ goes to 0, it converges towards a critical point of the incomplete log likelihood of the initial model.We also give an upper bound of the error made by this process for σ small.
It is interesting to notice that, even if this proof is done in the context of the SAEM algorithm, the same results can be obtained for the MCMC-SAEM (Kuhn and Lavielle, 2004) as well as for the Approximate SAEM (Allassonnière and Chevallier, 2019).

Distance between the limit point and the nearest critical point
In this section, we first present an equation satisfied by the limit of the sequence of estimated parameters of the SAEM algorithm for the exponentialized model.Using this equation, we will then give an upper bound on the distance between this limit point and the nearest critical point of the incomplete likelihood of the nonexponential model.This upper bound will in particular show us that this distance tends to 0 when σ goes to 0.

Equation verified by the limit
We now state a theorem giving us an equation satisfied by the limit parameter estimated by the SAEM algorithm applied on the exponentialized model.It is important to remark that, if the set of the critical points of l is finite then, the SAEM algorithm converges almost surely towards one of them (and not only towards a point at zero distance).Hence, we can study the parameters returned by the SAEM on the exponential model: ψσ and look at their behaviour when σ goes to 0.
Theorem 1 Assume that the exponentialized model with variance σ verifies the hypotheses (M1)-( M5), (SAEM1)-(SAEM4), (A) and that Ψ = R m .Assume also that, for all σ > 0, is finite where l σ refers to the observed log-likelihood of the exponentialized model of variance σ.Then, the sequence returned by the SAEM algorithm converges almost surely towards (θ ∞ , ψσ ), solutions of the following set of equations: Remark 2 Here, we suppose Ψ = R m to be able to define a Gaussian distribution on Ψ .The following proofs would be adaptable as long as one can define such a gaussian distribution, necessary for applying the exponentialization trick.
Proof The update of S k , ψ k can easily be seen as a Robbins Monro update: where z k , v k are sampled following the conditional law f σ (y, z, ψ + ψk |θ k , ψk ).
Proposition 1 Suppose that a point ψ ∈ R m verifies: Then, ψ is solution to Eq. (7) and can be the parameter returned by the exponential model.
Remark 3 It is not necessarily the only possibility of returned parameter.Several values of ψ could be solutions of Eq. ( 7).
In particular, a solution of ( 8) is a critical point of g.However, a critical point of g is not always solution of such an equation and will not always be a solution of Eq. ( 7).It is in fact easy to find cases where the maximum is not a solution to Eq. ( 7) and hence where the exponentialized model introduces a bias in the estimation of maximum likelihood.We introduce next subsection a function g presenting such a behaviour and we explain the heuristics behind Theorem 1.

Heuristics
We want to compare the solution of equation ( 7) to a maximum of the function g.Because of the form of f supposed in equation ( 4), we see that we can maximize g in θ and ψ independently of the other.In particular, we still immediately have θ ∞ ∈ argmax θ g(θ, ψ) (independent of ψ as can be seen in equation ( 4)).
To explain equation ( 7), we introduce the function 1.This function has a maximum for ψ = √ 2 but is not symmetric around it.We will look at the integral: for different values of ψ and σ. ψ is a solution of equation (7) if and only if this integral is null.It is interesting to remark that one can consider this integral as an expectation if normalized.
First we look at the case ψ = √ 2, the argmax of g, and σ = 1 on Figure (2a).In that case, because g is not symmetric around its maximum, v → g( is not symmetric either.In particular, it means that √ 2 is not a solution of equation ( 7) as the integral is strictly positive.
We then reduce the value of σ by taking σ = 0.1 on Figure is still not symmetric around 0. Even if the value of the integral is smaller, √ 2 is still not a solution of equation ( 7).
We now interest ourselves in the case where ψ is not the argmax of g by taking ψ = 1 and σ = 1 on Figure 3.This time, g is strictly increasing from 1 to √ 2. Hence, even if we multiply by the exponential, 2σ 2 is still increasing at 0. As g decreases slower than it increases, 1 cannot be a solution of equation ( 7).The integral is indeed strictly positive.The same behaviour would be observed for any point before √ 2.
We now look at a value bigger than the maximum: ψ = 4 and σ = 5 Figure 4a.This time, as g decreases at still decreases at 0. But g decreases slower than it increases.Hence, it is possible to compensate this difference of variation by taking ψ > √ 2 as a solution of equation ( 7).
Let us now take a smaller value of σ as in Figure 4b.This time, the integral is negative.Indeed, v → g(4 + v) exp − v 2 2σ 2 still decreases at 0. However, due to the multiplication by the exponential, the difference of variation before and after the maximum is now way smaller.In particular, it is this time too small to compensate the decrease at 0 and the integral will be negative.To have a solution of equation ( 7) for this value of σ, we would need to choose a value of ψ smaller.This suggests that, as σ goes to 0, the solution of equation ( 7) is closer to √ 2.
From these examples, we can deduce two things.First, the argmax of g is not always solution of the equation ( 7) even for small values of σ because there is a difference in the speed of variation before and after this maximum.However, it is possible to compensate this difference of variation by choosing a value different than the argmax and thus to find a solution of (7) different than the argmax.Moreover, when σ goes to 0, the difference of variation obtained by multiplying by the exponential is smaller and smaller.It means that a parameter closer and closer to the argmax will be solution of equation ( 7).We illustrate this behaviour by plotting the exact value of the solution of equation ( 7 In the following, we write ψ M the critical point of g(θ, ψ) minimizing the distance to ψσ .Using the heuristics presented above, we will, in the next section, state the theorem giving us an upper bound on the distance to the nearest critical point of g.We will then prove it in the case Ψ = R.A more general proof in R m for m ≥ 2 is given in the supplementary material.

Upper bound on the distance between ψσ and the nearest critical point of g
Theorem 2 1. Assume that the exponential model verifies the hypotheses (M1)-( M5), (SAEM1)-( SAEM4) and (A).Assume also that, for all σ > 0, L σ is finite, that L := {ψ ∈ R m |∂ ψ g(θ, ψ) = 0} is compact and that there exists K compact such that, ∀σ > 0, ψσ ∈ K.Then, 2. Assume also that L is finite and that, for all ψ M ∈ L, there exists an integer l M such that g is l M -times continuously differentiable and such that (a) v → g(θ, ) for different values of σ.In all cases, we can see that √ 2 is not a solution of Eq. ( 7).
We write l = max ψ M ∈L l M .Then, there exists c > 0 such that, for σ small enough, are integrable for all k between 1 and m.Then, we have the following approximation of ψσ when σ goes to infinity: for all Remark 4 When m = 1, l M is the smallest integer such that the l M -th derivative of g(θ ∞ , .) at ψ M is not 0. The inequality (9) indicates that the convergence will be slower when the function to maximize behaves as a flat curve around the maximum, which was expected.
If such a l M does not exist, it means that g is constant in at least one direction around ψ M .
Remark 5 We need to take a maximum over all ψ M in L since, from one σ to another, ψ σ can approach a different maximum ψ M ∈ L. It is also why the upper bound depends on this maximum l.It is constrained by the critical point for which the convergence is the slowest.
Remark 6 For m = 1, we have the exact value of the constant in (9): 2σ 2 ).Since g is increasing at 1 and increases quicker than it decreases, 1 is not solution of ( 7).
Remark 7 The results presented here would still be true in the case of the SAEM-MCMC algorithm.Indeed, the equation ( 7) verified by the limit parameter of the SAEM would still be verified in the case of the SAEM-MCMC and the same reasoning could then be done.
In the following, we will present the proof in the case m = 1.The proof in the multi-dimensional case follows the same ideas than in dimension one but is more technical.It is presented in the supplementary material.
Proof We present the proof in the case m = 1.As the maximum does not depend of θ ∞ and to simplify notations, we will forget the variable θ in g and use g(ψ) = g(θ ∞ , ψ).We suppose that ψσ is never a critical point of g for σ small enough.Otherwise, we directly have the result.The equation ( 7) writes: By contradiction, even if it means extracting, we can suppose that there exists c > 0 such that ∀σ > 0, d( ψσ , L) > 3c.
Because there is no critical point between ψσ − c and ψσ + c, g is either increasing or decreasing on [ ψσ − Fig. 5: Solution of Eq. ( 7) as a function of σ for the function g studied subsection 2.2.c, ψσ + c].We first suppose it is increasing.In particular, K 0 := K \ {y | d(y, L) < c} is compact and thus c 0 := inf{g (y) |y ∈ K 0 , g (y) ≥ 0} > 0. According to equation ( 7), the integral on [−c, c] must have the same absolute value as the integral on [−c, c] c .However, we will show that, when σ goes to zero, the first one converges towards 0 much more slowly than the second one.Indeed, 2 ) .On the other hand, we have: Using the mean value theorem, for all 0 But, using an integration per part and defining we have: Hence, because we have: It is easy to find the same inequality if g is decreasing on [ ψσ − c, ψσ + c].Indeed, in that case, by integrating only on {v ≥ c}, we first show that Then, by considering this time c 1 := sup{g (y) |y ∈ K 0 , g (y) ≤ 0} < 0, we find: Hence, for all σ > 0, there exists C := 2 max(c 0 , −c 1 ) > 0 such that: ) .
By taking σ to 0 and using the fact that erf(x) − −−− → x→∞ 1, we find C ≤ 0 which is a contradiction.Hence, we have proved that The next step is to find an upper bound on d( ψσ , L).

Second step: Search of the upper bound
In the following, we will suppose that the critical point towards which ψσ converges is a maximum.In practice, it will always be the case as any other critical point would be unstable numerically.Theoretically, a set of conditions (LOC1)-(LOC3) are given in Delyon et al. (1999) insuring the convergence towards a local maximum.
We write ψ M the closest critical point to ψσ and α σ = | ψσ − ψ M |.We also write l M the smallest integer such that g (l M ) (ψ M ) = 0.Moreover, as explained above, we assume that ψ M is a maximum.It must be remarked that ψ M depends on σ.However, as L is finite, we will be able to consider maxima at the end of the proof.Since we assume ψ M maximum, l M is even and, for σ small enough, since As before, we will split up the integral (7) in two parts: {v||v| < α σ } and {v||v| > α σ }.The idea behind the computations is that α σ cannot be too big without making the absolute value of the integral on {v||v| < α σ } strictly superior than the one on {v||v| > α σ }.
In the case where g in decreasing on [ ψσ − α σ , ψσ + α σ ], we have α σ = ψσ − ψ M and it is easy to show that we have this time Hence, we can use the same inequalities as before to find again:

Third step: Approximation when σ goes to infinity
We use again the equation ( 7).For all σ > 0, Using the change of variable ψσ + v, we find: But ψσ is supposed to stay in a compact so, ∀v ∈ Using the integrability of g and v → vg(v), it is easy to conclude using the dominated convergence theorem.
3 Simulation of a counter example In this section, we demonstrate that the maximum likelihood of g is indeed not reached by the SAEM algorithm on the exponentialized model on a concrete situation.

Application of the SAEM algorithm to the exponentialized model
We choose to study a heteroscedastic model where the variance depends on the observation.This model has been used in Kuhn and Lavielle (2005) in order to analyze the growth of orange trees.The parameters to estimate are the age β 1 at half asymptotic trunk circumference ψ i and the grow scale β 2 of n orange trees according to the measurement of their circumference y i,j at m different ages x j .
We suppose that our observation y i,j verifies, for i between 1 and n and j between 1 and k i : where ε i,j are independent noises of distribution N (0, σ 2 ε ) of variance σ 2 ε supposed to be known.φ i is treated as a random effect and is supposed to follow a Gaussian distribution of mean µ to estimate and known variance τ 2 .Such a model cannot be written in an exponential form due to the parameters β 1 and β 2 and we will hence consider an exponentialized model where β 1 and β 2 are considered as random effects with β 1 ∼ N ( β1 , σ 2 ) and , the complete likelihood of the exponentialized model can then be written as: where θ = (µ, β1 , β2 ) are the exponentialized model parameters to estimate.
Remark 8 It would be easy to suppose τ and σ 2 ε unknown and estimate them using the SAEM algorithm.Those parameters would leave the joint distribution curved exponential and it would not be necessary to further exponentialize the model.To simplify, we assume them known here.
It is then easy to show that this likelihood belongs to the curved exponential family with sufficient statistics being: The maximum likelihood estimator can then be expressed as a function of S 1 (φ), S 2 (β 1 ) and S 3 (β 2 ) as follows: Because we cannot easily sample (φ, β 1 , β 2 ) from the conditional distribution, we will not directly use the SAEM algorithm but the SAEM-MCMC algorithm.We replace the sampling step by one iteration of a Metropolis Hastings algorithm targeting the posterior distribution.Under hypotheses presented in Kuhn and Lavielle (2004), this process converges towards the same limit as the SAEM algorithm.In particular, it has been proved in Kuhn and Lavielle (2005) that those conditions are indeed verified here and thus that the algorithm converges.Moreover, as the limit is the same than the one given by the SAEM, our theorem 2 still applies.
We then create a synthetic dataset of a thousand observations following this model (100 subjects observed at 10 different ages).Knowing the exact value of µ, we plot the incomplete likelihood of the non-exponentialized model g N E as a function of (β 1 , β 2 ) figure 6a.We also plot its behaviour around the maximum and along the axes β 1 and β 2 figures 6b and 6c.As we can see, the function is not symmetric around the maximum.Hence, there should be a bias while estimating the maximum likelihood using the exponentialized model.More precisely, we can see the error in β 2 should be larger than the one in β 1 as the function is less symmetric along the y axis than along the x axis.
To verify this heuristic, we use the SAEM-MCMC algorithm and launch our algorithm a hundred times for different values of σ.We then compare the results given by the SAEM-MCMC algorithm to the exact value of the maximum likelihood of the initial model.Because we know the exact parameters from which the dataset has been simulated, we are also able to compute numerically the solution of the equation ( 7) as a function of σ.The results are presented figure 7.
For σ ≥ 1, the results of the simulation follow our theory with the estimated parameters estimated close to the solution of the equation (7).Moreover, as expected, the error is bigger in the estimation of β 2 than in the estimation of β 1 (see axis scale).
However, for a small σ, the algorithm does not converge.Indeed, in that case, the variance of the conditional distribution is really small as it is proportional to exp − (β− β) 2 2σ 2 . In particular, it means that the algorithm will be extremely long to converge and, in practice, will stay near the initial value (β 1 = 6, β 2 = 34 here).

Proposition of a new algorithm
To prevent this phenomenon, we now propose a new process that will allow a better estimation of the real maximum of the non-exponentialized likelihood.We will still use the exponential trick but using an adaptive σ along the iterations.The goal is to allow the estimate to escape from its initial value while converging towards a point closer to the true maximum.
We propose to first run the algorithm with σ = 1 for a certain number m of iterations and then reduce the value of σ by multiplying it by 0.9.We iterate this process every m iterations until the difference between successive parameters estimated is sufficiently small.We then let the algorithm converge with this small value of σ.This may be seen as launching the algorithm several times with an initialization closer and closer to the true maximum likelihood.While the algorithm will not converge towards the real maximum likelihood estimate as σ is still positive during the last iterations, the error should be smaller than before as σ has been significantly reduced.
To test this new algorithm, we launch this process a hundred times.We present the means and variances of the estimated parameters in Table 1 and as green crosses in figure 7. If we do not reach the maximum likelihood of the initial model, the error for β 2 is now smaller: 1.04% while it was at least 2.6% without reducing the variance throughout the algorithm.As for β 1 , the error is of the same order as before.
Remark 9 In Kuhn and Lavielle (2005), the authors use this model and algorithm on a real dataset for different values of σ.They conclude that the estimation of (β 1 , β 2 ) does not seem to depend on the choice of σ.In fact, for the particular values of this real dataset, the likelihood is practically symmetric around its maximum.Hence, the error made in that case is indeed small for any σ.

Conclusion
In this paper, we have proved that the exponentialization process does not converge in general towards the maximum likelihood of the initial model using the SAEM or SAEM-MCMC algorithm.If the error converges towards 0 when σ goes to 0, it is numerically impossible to take σ too small as the algorithm is numerically never able to converge.To overcome this problem, we propose a new numerical scheme consisting in launching the algorithm several times while making the variance of the exponentialized model decrease.Thanks to out theoretical results, we show that this new process converges towards a better estimation of the maximum of likelihood of the initial model, as verified by the numerical simulations.Hence, we are able to approach the exact maximum likelihood even in the case where our ; Bône et al. (2018); Debavelaere et al. (2020); Schiratti et al. (2015) among others).
) as a function of σ in Figure 5.
First step: d( ψσ , L) − −− → σ→0 0 (a) Plot of the incomplete likelihood of the initial model as a function of (β 1 , β 2 ).(b) Plot of the incomplete likelihood of the initial model as a function of β 1 for β2 the argmax of likelihood.(c) Plot of the incomplete likelihood of the initial model as a function of β 2 for β1 the argmax of likelihood.

Fig. 6 :
Fig. 6: Plot of the incomplete likelihood of the initial model as a function of (β 1 , β 2 ) along different sections for µ = 5.

Fig. 7 :
Fig.7: The red line represents the theoretical value towards which the algorithm is supposed to converge.The red points are the means of the parameters estimated over 100 iterations with their standart deviations represented by the red zone.In dotted blue is the maximum likelihood of the initial model.In magenta, the theoretical limit towards which the parameter converges when σ goes to infinity.Finally, the green cross represents the value returned while varying the variance of the exponentialized model throughout the algorithm.

Table 1 :
Mean and variance of the parameters estimated while reducing the variance throughout the algorithm.To be compared with the maximum likelihood of the non-exponentialized model reached for β 1 = 6.3 and β 2 = 32.28.Wainwright MJ, Yu B, et al. (2017)Statistical guarantees for the em algorithm: From population to sample-based analysis.The Annals of Statistics 45(1):77-120 Benzekry S, Lamont C, Beheshti A, Tracz A, Ebos JM, Hlatky L, Hahnfeldt P (2014) Classical mathematical models for description and prediction of experimental tumor growth.PLoS Comput Biol 10(8):e1003800 Bône A, Colliot O, Durrleman S (2018) Learning distributions of shape trajectories from longitudinal datasets: a hierarchical model on a manifold of diffeomorphisms.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9271-9280 Chrétien S, Hero AO (2008) On em algorithms and their proximal generalizations.ESAIM: Probability and Statistics 12:308-326