Lp and almost sure rates of convergence of averaged stochastic gradient algorithms: locally strongly convex objective

An usual problem in statistics consists in estimating the minimizer of a convex function. When we have to deal with large samples taking values in high dimensional spaces, stochastic gradient algorithms and their averaged versions are efficient candidates. Indeed, (1) they do not need too much computational efforts, (2) they do not need to store all the data, which is crucial when we deal with big data, (3) they allow to simply update the estimates, which is important when data arrive sequentially. The aim of this work is to give asymptotic and non asymptotic rates of convergence of stochastic gradient estimates as well as of their averaged versions when the function we would like to minimize is only locally strongly convex.


Introduction
With the development of automatic sensors, it is more and more important to think about methods able to deal with large samples of observations taking values in high dimensional spaces such as functional spaces. We focus here on an usual stochastic optimization problem which consists in estimating where H is a Hilbert space and X is a random variable supposed to be taking value in a space X and g : X × H −→ R. One usual method, given a sample X 1 , ..., X n , is to consider the empirical problem generated by this sample, i.e to consider the M-estimates (see the books of Huber and Ronchetti (2009) and Maronna et al. (2006)  and to approximate m n using deterministic optimization methods (see Boyd and Vandenberghe (2004) for instance). Nevertheless, one of the most important problem of such methods is that they become computationally expensive when we deal with large samples taking values in high dimensional spaces. Thus, in order to overcome this, stochastic gradient algorithms introduced by Robbins and Monro (1951) are efficient candidates. Indeed, they do not need too much computational efforts, do not require to store all the data and can be simply updated, which represents a real interest when the data arrive sequentially.
The literature is very large on this domain (see the books of Duflo (1997), Kushner and Yin (2003) among others) and on the method to improve their convergence which consists in averaging the Robbins-Monro estimates, which was introduced by Ruppert (1988) and whose first convergence results were given by Polyak and Juditsky (1992). Many asymptotic results exist in the literature when data lies in finite dimensional spaces (see Duflo (1997), Pelletier (1998), or Pelletier (2000 for instance) but the proofs can not be directly adapted for infinite dimensional spaces. Moreover, an asymptotic result such as a Central Limit Theorem does not give any clue of how far the distribution of the estimate is from its asymptotic law for a fixed sample size n. Then, non asymptotic properties are always desirable for statisticians who deal with real data (see the nice arguments of Rudelson (2014) for example). As a consequence, these last few years, statisticans have more and more focused on non asymptotic rates of convergence. For example, Moulines andBach (2011) andBach (2014) give some general conditions to get the rate of convergence in quadratic mean of averaged stochastic gradient algorithms, while Ghadimi and Lan (2012), for instance, focus on non asymptotic rates for strongly convex stochastic composite optimization.
The aim of this work is to seek inspiration in the demonstration methods introduced by Cardot et al. (2017) and improved by Godichon-Baggioni (2016) and Cardot and Godichon-Baggioni (2015) to give convergence results for stochastic gradient algorithms and their averaged versions when the function we would like to minimize is only locally strongly convex. First, we establish almost sure rates of convergence of the estimates in general Hilbert spaces. Furthermore, as mentioned above, asymptotic results are often non sufficient, and L p rates of convergence of the algorithms are so given.
The paper is organized as follows. Section 2 introduces the framework, assumptions, the algorithms and some convexity properties on the function we would like to minimize. Two examples of application are given in Section 3: we first focus on the estimation of geometric quantiles, which are a generalization of the real quantiles introduced by Chaudhuri (1996).
They are robust indicators which can be useful in statistical depth and outliers detection (see Serfling (2006), Chen et al. (2009) or Hallin and Paindaveine (2006)). In a second time, we focus on the estimation of generalized p-means (Polya et al. (1952), Borwein and Borwein (1987)), used in several domains such that computer vision (Turaga et al. (2008)) or medical imaging (Goh et al. (2009)). In a third time, stochastic gradient algorithms can be applied in several regressions (Bach (2014), Cohen et al. (2016)) and we focus on robust logistic regression. In Section 4, the almost sure and L p rates of convergence of the estimates are given. Our theoretical results are illustrated by numerical experiments in Section 5. Finally, the proofs are postponed in Section 6 and in Appendix.
(η) There are positive constants η, L η such that for all h ∈ H, Note that for the sake of simplicity, we often denote by the same way the different constants. We now make some comments on the assumptions. First, note that no convexity assumption on the functional g is required.
Assumptions (A2) and (A3) give some properties on the spectrum of the Hessian and ensure that the functional G is locally strongly convex. Note that assumption (A3) can be resumed as λ min (Γ m ) > 0, where λ min (.) is the function which gives the smallest eigenvalue (or the lim inf of the eigenvalues in infinite dimensional spaces) of a linear operator, if the functional h → λ min (Γ h ) is continuous on a neighborhood of m.
Moreover, assumption (A4) allows to bound the remainder term in the Taylor's expansion of the gradient. Note that since the functional G is twice continuously differentiable Assumption (A5) enables us to bound the gradient under conditions on the functional f . More precisely, (A5a) and (A5η) are sufficient to get the strong consistency and the almost sure rates of convergence while we need to assume (A5b) to obtain the L p rates of convergence. This still represents a significant relaxation of the usual conditions needed to get non asymptotic results. For example, a main difference with Bach (2014) and Godichon-Baggioni (2016) is that, instead of having a bounded gradient, we split this bound into two parts: one which admits q-th moments, and one which depends on the estimation error.
Moreover, note that it is possible to replace assumption (A5) by (A5a') There is a positive constant L 1 such that for all h ∈ H, (A5η') There are positive constants η, L 1+η such that for all h ∈ H, (A5b') For all integer q, there is a positive constant L q such that for all h ∈ H, Remark 2.1. These assumptions are analogous to the usual ones in finite dimension (Pelletier (1998), Pelletier (2000) but in our case, the proofs remain true in infinite dimension.
Remark 2.2. Note that the Hessian of the functional G is not supposed to be compact. Then, if H = R d , its smallest eigenvalue λ min (Γ m ) does not necessarily converge to 0 when the dimension d tends to infinity.

The algorithms
Let X 1 , ..., X n , ... be independent random variables with the same law as X. The stochastic gradient algorithm is defined recursively by where Z 1 is chosen bounded and U n+1 := ∇ h g (X n+1 , Z n ). The step sequence (γ n ) is a decreasing sequence of positive real numbers which verifies the following usual assumptions (see Duflo (1997)) The term U n+1 can be considered as a random perturbation of the gradient Φ at Z n . Indeed, let (F n ) be the sequence of σ-algebra defined for all n ≥ 1 by F n := σ (X 1 , ..., X n ) = σ (Z 1 , ..., Z n ), then E [U n+1 |F n ] = ∇G(Z n ) =: Φ (Z n ) .
In order to improve the convergence, we now introduce the averaged algorithm (Ruppert (1988), Polyak and Juditsky (1992)) defined recursively by with Z 1 = Z 1 . This can also be written as follows

Some convexity properties
We now give some convexity properties of the functional G. The proofs are given in Appendix. First, since ∇G(m) = 0 and since G is twice continuously differentiable, note that for all h ∈ H, The first proposition gives the local strong convexity of the functional G.
The following corollary ensures that m is the unique solution of the problem defined by (1).
Corollary 2.1. Assume (A1) to (A3) and (A5a) hold. Then, m is the unique solution of the equation and in a particular case, m is the unique minimizer of the functional G.
Finally, the last proposition gives an uniform bound of the remainder term in the Taylor's expansion of the gradient.

Applications in general separable Hilbert spaces
In this section, let us consider a separable Hilbert space H and let X be a random variable taking values in H.

Estimating geometric quantiles:
The geometric quantile m v of X corresponding to a direction v, where v ∈ H and v < 1, is defined by Note that if v = 0, the geometric quantile m 0 corresponds to the geometric median (Haldane (1948), Kemperman (1987)). Let G v be the function we would like to minimize, defined for and G v admits so a minimizer m v , which is also a solution of the following equation Then, assumption (A1) is fulfilled and the stochastic gradient algorithm and its averaged version are defined recursively for all n ≥ 1 by with m v 1 = m v 1 chosen bounded (choosing a positive constant M, one can take m v 1 of the form m v 1 : = X 1 1 X 1 ≤M for example). In order to ensure the uniqueness of the geometric quantiles and the convergence of these estimates, we consider from now that the following assumptions are fulfilled: (B1) The random variable X is not concentrated on a straight line: for all h ∈ H, there is h ∈ H such that h, h = 0 and Var X, h > 0.
(B2) The random variable X is not concentrated around single points: for all positive constant A, there is a positive constant C A such that for all h ∈ B (m v , A), Note that assumption (B2) is not restrictive when we deal with a high dimensional space.
For example, if H = R d with d ≥ 3, as discussed in Chaudhuri (1992) and Cardot et al. (2013), this condition is satisfied since X admits a density which is bounded on every compact subset of R d . Finally, this assumption ensures the existence of the Hessian of G v , which is defined for all h ∈ H by Finally, for all positive integer p ≥ 1 and for all h ∈ H, and assumptions (A5a) and (A5b) are so verified.
Estimating p-means: Les p ∈ (1, 2), then, the p-mean of X is defined by Note that the cases p = 1 and p = 2 correspond respectively to the geometric median and the usual mean. Let G p be the function we would like to minimize defined for all h ∈ H by G p (h) = 1 p E X − h p . This function is convex and lim h →∞ G p (h) = +∞, and G p admits so a minimizer m (p) , which is also a solution of the following equation Then, assumption (A1) is fulfilled and the stochastic gradient algorithm and its averaged version are defined recursively for all n ≥ 1 by In order to ensure some differentiability properties and the convergence of the estimates, les us now introduce some assumptions: (B1a') The random variable X admits a moment of order 2p − 2.
(B1b') For all positive integer q, the random variable X admits a moment of order q.

(B2')
The random variable X is not concentrated around single points: for all positive constant A, there is a positive constant C A such that for all h ∈ B m (p) , A , Assumption (B1a') ensures that the gradient of G p is well defined and that assumption Remark that this example can not be treated thanks to the theoretical tools of Godichon-Baggioni (2016) and Bach (2014). Indeed, in these previous papers, uniform bounds of the gradient are needed while in this example, the gradient is bounded by a term with finite moments and a term depending on the estimation errors. Finally, assumption (B2') ensures that the function we would like to minimize is twice continuously differentiable and

An application in a finite dimensional space: a robust logistic regression
Let d ≥ 1 and H = R d . Let (X, Y) be a couple of random variables taking values in H × {−1, 1}. The aim is to minimize the functional G r defined for all h ∈ R d by (see Bach (2014) In order to ensure the existence and uniqueness of the solution, we consider from now that the following assumptions are fulfilled: (B1") There exists m r such that ∇G r (m r ) = 0.
(B2") The Hessian of the functional G r at m r is positive.
(B3b") For all integer p, the random variable X admits a p-th moment.
Assumption (B1") ensures the existence of a solution while (B2') gives its uniqueness. Assumption (B3a") ensures that the functional G r is twice Fréchet-differentiable and its gradient and Hessian are defined for all h ∈ R d by Note that assumption (B2") is verified, for example, since there are positive constants Then, the solution m r can be estimated recursively as follows: with m r 1 = m r 1 bounded. Under assumptions (B1") to (B3a"), hypothesis (A1) to (A5a) are satisfied, while under additional assumption (B3b"), hypothesis (A5b) is satisfied. Remark that this example is already treated in Bach (2014), but only for a bounded gradient, i.e under the existence of a positive constant R such that i.e only in the case where X is bounded.
Remark 3.1. Remark that these results remain true for several cases of regression. For example, one can consider the logistic regression Then, one can consider estimates of the form

Rates of convergence
In this section, we consider a learning rate sequence (γ n ) n≥1 of the form γ n := c γ n −α with c γ > 0 and α ∈ (1/2, 1). Note that taking α = 1 could be possible with a good choice of the value of the constant c γ (taking c γ > 1 λ min for instance). Nevertheless, the averaging step enables us to get the optimal rate of convergence with a smaller variance than the stochastic gradient algorithm with a fastly decreasing step sequence γ n = c γ n −1 (see Polyak and Juditsky (1992), Pelletier (1998) and Pelletier (2000) for more details).

Almost sure rates of convergence
In this section, we focus on the almost sure rates of convergence of the algorithms defined in (3) and (4). First, the following theorem gives the consistency of the algorithms.
The following theorem gives the almost sure rates of convergence of the stochastic gradient algorithm as well as of its averaged version under the additional assumption (A4).
Note that similar results are given in Pelletier (1998), but only in finite dimension. More precisely, the given proofs cannot be directly extended to the case where H is an infinite dimensional space. For example, these methods rely on the fact that the Hessian of the functional G admits finite dimensional eigenspaces, which is not necessarily true for general Hilbert spaces. Another problem is that norms are not equivalent in infinite dimensional spaces, and consequently, the Hilbert-Schmidt (or Frobenius) norm for linear operators is not necessarily finite even if the spectral norm is. For example, under assumption (A3), if H is an infinite dimensional space, where . H−S is the Hilbert-Schmidt norm.

L p rates of convergence
In this section, we focus on the L p rates of convergence of the algorithms. The proofs are postponed in Section 6. The idea is to give non asymptotic results without focusing only on the rate of convergence in quadratic mean. Indeed, recent works (see Cardot and Godichon-Baggioni (2015) and Godichon-Baggioni (2016) for instance), confirm that having L p rates of convergence can be very useful to establish rates of convergence of more complex estimates.
Theorem 4.3. Assume (A1) to (A5b) hold. Then, for all integer p, there is a positive constant K p such that for all n ≥ 1, This result remains true replacing assumptions (A3) and/or (A5b) by (A3') and/or (A5b').
Finally, the last theorem gives the L p rates of convergence of the averaged estimates.
Theorem 4.4. Assume (A1) to (A5b) hold. Then, for all integer p, there is a positive constant K p such that for all n ≥ 1, This result remains true replacing assumptions (A3) and/or (A5b) by (A3') and/or (A5b').
As done in Cardot et al. (2017) and Godichon-Baggioni (2016), one can check that, under assumptions, these rates of convergence are the optimal ones for Robbins-Monro algorithms and their averaged versions, i.e one can prove that there are positive constants c, c such that for all n ≥ 1, Remark 4.1. One can obtain the same L p and almost sure rates of convergence for the stochastic gradient algorithm replacing assumption (A4) by (A4') There are positive constants > 0 and β ∈ (1, 2] such that for all h ∈ B (m, ) Moreover, one can get the same L p and almost sure rates of convergence for the averaged algorithm replacing (A4) by (A4') and taking a step sequence of the form γ n := c γ n −α with α ∈ (β −1 , 1).

Remark 4.2.
Let p be a positive integer, it is possible to get the L 2p rates of convergence of the Robbins-Monro algorithm just supposing that there is a positive integer q such that q > 2 p + 2 and a positive constant and taking a step sequence of the form γ n := c γ n −α with α ∈ 1 2 , q p+2+q .

Simulation study
In this section, we consider a random gaussian vector X ∼ N (0, I 100 ) taking values in R 100 , and we aim to estimate the p-mean m (p) of X with p = 1.5. Note that in this case, m (p) = 0 R 100 . We now consider q samples X 1,1 , . . . , X 1,n , . . . , X q,1 , . . . , X q,n with a size n. In order to compare the different estimates, for a fixed sample size n, we will consider the empirical quadratic mean error of the estimates, i.e given an estimatem of m and the associate estimatesm 1,n , . . . ,m q,n , we will consider In order to initialize the algorithms, we take the first data, i.e m (p) i,1 = X i,1 . In Figure 1, we consider a step sequence γ n = c γ n −α with c = 2 and α = 0.66. One can check that the averaged algorithm converges faster than the gradient and become better after having dealt with a small number of data (about 50). This quite bad behavior on the first step can be explained by a quite bad initialization of the gradient algorithm which so spend some time before turning around the target. In figure 2, we study the impact of the choice of α on Inversely, for small sample size, the averaged version seems to converge faster when α decreases for small sample size, before having analogous behaviors for n = 1000. This can be explained by the fact that the less important is α, the more the gradient estimates will "move", and the more they have a chance to turn around the target quickly.
Finally, in Table 1, we study the impact on the estimates of the choices of α and c γ for a moderate sample size n = 10 4 . As expected, one can see that averaged estimates are globally better than gradient ones and are more stable in relation to the choice of the step sequence. The quite critical choices of step sequence for the averaged algorithm are when we both take c γ small and α close to 1. This is not surprising because here again, the gradient steps need too much data before turning around the target, since, for example,  Table 1: Quadratic mean errors (.10 −2 ) of the gradient estimates (on the left) and of averaged estimates (on the right) for a sample size n = 10000 for different α and c γ .

Some decompositions of the algorithms
In order to simplify the proofs thereafter, we introduce some usual decompositions of the algorithms. First, let us recall that the Robbins-Monro algorithm is defined by with U n+1 := ∇ h g (X n+1 , Z n ). Then, let ξ n+1 := Φ(Z n ) − U n+1 , equality (7) can be written as Note that (ξ n ) is a martingale differences sequence adapted to the filtration (F n ). Furthermore, linearizing the gradient, equation (8) can be written as where δ n := Φ(Z n ) − Γ m (Z n − m) is the remainder term in the Taylor's expansion of the gradient. Note that thanks to Proposition 2.2, there is a positive constant C m such that for all n ≥ 1, δ n ≤ C m Z n − m 2 . Finally, by induction, we have the following usual decomposition with β n−1 := In the same way, in order to get the rates of convergence, we need to exhibit a new decomposition of the averaged algorithm. In this aim, equality (9) can be written as As in Pelletier (2000), summing these equalities, applying Abel's transform and dividing by n, we have
Moreover, since Φ (Z n ) , Z n − m ≥ 0, by induction, there is a positive constant M such that for all n ≥ 1, Thus, one can conclude the proof in the same way as in the proof of Theorem 3.1 in Cardot et al. (2013) for instance. Finally, one can apply Toeplitz's lemma (see Duflo (1997), Lemma 2.2.13) to get the strong consistency of the averaged algorithm.
In order to get the almost sure rates of convergence of the Robbins-Monro algorithm, we now introduce a technical lemma which gives the rate of convergence of the martingale term β n−1 M n in decomposition (10).
Proof of Theorem 4.2. Rate of convergence of the Robbins-Monro algorithm: Applying decomposition (10), as in Pelletier (1998), let We have Thus, since δ n ≤ C m Z n − m 2 , for n large enough, one has We now introduce the sequence of events E n = Z n − m ≤ λ min 2C m and since Z n converges almost surely to m, 1 E n converges almost surely to 1. Then, thanks to decomposition (10), denoting n 0 = max {inf {n, λ min γ n < 2} , inf {n, λ max γ n ≤ 1}}, one has for all n ≥ n 0 , Furthermore, thanks to Lemma 6.1, there is a positive finite random variable M ∞ such that for all n, β n−1 M n + β n−1 (Z 1 − m) ≤ M ∞ √ ln n n α/2 almost surely and with the help of an induction, it comes One can easily check, with usual calculus, that ∆ 0,n converges exponentially fast to 0 while Finally, one can rewrite ∆ 2,n as and since 1 E C n converges almost surely to 0, one has and one can so easily check that ∆ 2,n converges exponentially fast to 0, leading to which concludes the proof.
Since Z n+1 − m converges almost surely to 0, applying Robbins-Siegmund theorem (see Theorem E.1), M 2 n converges almost surely to a finite random variable, which concludes the proof.

Proof of Theorem 4.3
In order to prove Theorem 4.3 with the help of a strong induction on p, we have to introduce some technical lemmas (the proofs are given in Appendix). Note that these lemmas remain true replacing assumptions (A3) and/or (A5b) by (A3') and/or (A5b') but the proofs are only given for the first assumptions.
The first lemma gives a bound of the 2p-th moment when inequality (6) is verified for all integer from 0 to p − 1.
Lemma 6.2. Assume (A1) to (A5b) hold. Let p be a positive integer, and suppose that for all k ≤ p − 1, there is a positive constant K k such that for all n ≥ 1, Then, there are positive constants c 0 , C 1 , C 2 and a rank n α such that for all n ≥ n α , Then, the second lemma gives an upper bound of the (2p + 2)-th moment when inequality (6) is verified for all integer from 0 to p − 1. Lemma 6.3. Assume (A1) to (A3) and (A5b) hold. Let p be a positive integer, and suppose that for all k ≤ p − 1, there is a positive constant K k such that for all n ≥ 1, Then, there are positive constants C 1 , C 2 and a rank n α such that for all n ≥ n α , Finally, the last lemma enables us to give a bound of the probability for the Robbins-Monro algorithm to go far away from m, which is crucial in order to prove Lemma 6.3.
Lemma 6.4. Assume (A1) to (A3) and (A5b) hold. Then, for all integer p ≥ 1, there is a positive constant M p such that for all n ≥ 1, Proof of Theorem 4.3. As in Godichon-Baggioni (2016), we will prove with the help of a strong induction that for all integer p ≥ 1, and for all β ∈ α, p+2 p α − 1 p , there are positive constants K p , C β,p such that for all n ≥ 1, Applying Lemma 6.4, Lemma 6.2 and Lemma 6.3, as soon as the initialization is satisfied, the proof is strictly analogous to the proof of Theorem 4.1 in Godichon-Baggioni (2016).
Thus, we will just prove that for p = 1 and for all β ∈ (α, 3α − 1), there are positive constants K 1 , C β,1 such that for all n ≥ 1, We now split the end of the proof into two steps.
Step 1: Calibration of the constants. In order to simplify the demonstration thereafter, we now introduce some notations. Let K 1 , C β,1 be positive constants such that K 1 ≥ 2 1+α C 1 c −1 0 c −1 γ , (c 0 , C 1 are defined in Lemma 6.2), and 2K 1 ≥ C β,1 ≥ K 1 ≥ 1. By definition of β, there is a rank n β ≥ n α (n α is defined in Lemma 6.2 and in Lemma 6.3) such that for all n ≥ n β , with C 2 defined in Lemma 6.2 and C 1 , C 2 defined in Lemma 6.3. The rank n β exists because since β > α, Moreover, since β < 3α − 1, we have β < 2, and 1 − 2 n 2 n + 1 Step 2: The induction on n. Let us take K 1 ≥ max 1≤k≤n β k α E Z k − m 2 and C β,1 ≥ max 1≤k≤n β k β E Z k − m 4 . We now prove by induction that for all n ≥ n β , Applying Lemma 6.2 and by induction, since 2K 1 ≥ C β,1 ≥ K 1 ≥ 1, Factorizing by In the same way, one can check by induction and applying Lemma 6.3 that By definition of n β , which concludes the induction on n, and one can conclude the induction on p and the proof in a similar way as in Godichon-Baggioni (2016).

Proof of Theorem 4.4
Proof of Theorem 4.4. Let λ min be the smallest eigenvalue of Γ m , with the help of decomposition (11), for all integer p ≥ 1, As in Godichon-Baggioni (2016), applying Theorem 4.3 and Lemma 4.1 in Godichon-Baggioni (2016), one can check that there are positive constants R 1,p , R 2,p , R 3,p , R 4,p such that for all We now prove with the help of a strong induction that for all integer p ≥ 1, there is a positive constant C p such that Step 1: Initialization of the induction. Since (ξ n ) is martingale differences sequence adapted to the filtration (F n ), applying Theorem 4.3, there is a positive constant C 1 such that for all n ≥ 1, Step 2: the induction. Let p ≥ 2, we suppose from now that for all p ≤ p − 1, there is a positive constant C p such that for all n ≥ 1, Thus, let M n := ∑ n k=1 ξ k+1 , with the help of previous equality and applying Cauchy-Schwarz's inequality, We now bound the expectation of the three terms on the right-hand side of previous inequality. First, since Then, since M n is F n+1 -measurable, In the same way, let As for ( ), there are positive constants C 0 , C 1 such that and in a particular case Thus, thanks to inequalities (15) to (17), there are positive constants B 0 , B 1 such that for all n ≥ 1, which concludes the induction and the proof.

A Proofs of Propositions 2.1 and 2.2
Proof of Proposition 2.1. If h ∈ B (m, ), under assumptions (A2) and (A3) and by dominated convergence, In the same way, if h − m > , since G is convex, under assumptions (A2) and (A3) and by dominated convergence, Thus, let A be a positive constant and h ∈ B (m, A), with c A := min λ min 2 , λ min

2A
. We now give an upper bound of this term. First, thanks to assumption (A2), let A be a positive constant, for all h ∈ B (m, A), Moreover, applying Cauchy-Schwarz's inequality and thanks to assumption (A5a), for all which concludes the proof.
Proof of Proposition 2.2. Let us recall that there are positive constants , C such that for all Let h ∈ H such that h − m ≥ . Then, thanks to assumptions (A2) and (A3), which concludes the proof.

B Proof of Lemma 6.4
We propose here a not detailed proof. For analogous and more detailed calculus, one can see the proof of Lemma 5.3 in Cardot et al. (2017).
Proof of Lemma 6.4. We prove Lemma 6.4 with the help of a strong induction on p. The case p = 1 is already done in the proof of Theorem 3.1. We suppose from now that p ≥ 2 and that for all k ≤ p − 1, there is a positive constant M k such that for all n ≥ 1, Let V n := Z n − m − γ n Φ(Z n ), and with the help of decomposition (8) Conclusion. Applying inequalities (B-23) to (B-25) and by induction, there are positive constants B 1 , B 2 such that which concludes the induction and the proof.

C Proof of Lemma 6.3
We propose here a not detailed proof. For analogous and more detailed calculus, one can see the proof of Lemma 4.2 in Godichon-Baggioni (2016).
Proof of Lemma 6.3. Let p ≥ 1, we suppose from now that for all integer k < p, there is a positive constant K k such that for all n ≥ 1, As in the previous proof, let us recall that We now bound the expectation of each term on the right-hand side of previous inequality.
Remark C.1. Note that in order to get the rate of convergence in quadratic mean of the Robbins-Monro algorithm, i.e in the case where p = 1, we just have to suppose that there are a positive integer q ≥ 3α 1−α and a positive constant L q such that for all h ∈ H, E f (X, h) 2q ≤ L q .

D Proof of Lemma 6.2
We propose here a not detailed proof. For analogous and more detailed calculus, one can see Lemma 4.1 in Godichon-Baggioni (2016).
Proof of Lemma 5.2. Let p ≥ 1, we suppose from now that for all integer k < p, there is a positive constant K k such that for all n ≥ 1, E Z n − m 2k ≤ K k n kα . (D-32) Using decomposition (9) and Cauchy-Schwarz's inequality, there are a positive constant c and a rank n α such that for all n ≥ n α , Z n+1 − m 2 ≤ 1 − c γ n Z n − m 2 + γ 2 n U n+1 2 + 2γ n Z n − m, ξ n+1 + 2γ n Z n − m δ n .
If p = 1, thanks to Proposition 2.2, we have 2 δ n Z n − m ≤ c 2 γ n Z n − m 2 + 2 C 2 m c Z n − m 4 , and since (ξ n ) is a martingale differences sequence adapted to the filtration (F n ), applying inequality (B-19), for all n ≥ n α , E Z n+1 − m 2 ≤ 1 − c 2 γ n + 2C 2 γ 2 n E Z n − m 2 + 2γ 2 n L 1 + 2γ n C 2 m c E Z n − m 4 , and one can conclude the proof for p = 1 taking a rank n α and a positive constant c such that for all n ≥ n α , 1 − c 2 γ n + 2C 2 γ 2 n ≤ 1 − cγ n .