RANDOM FORESTS FOR TIME-DEPENDENT PROCESSES

. Random forests were introduced by Breiman in 2001. We study theoretical aspects of both original Breiman’s random forests and a simpliﬁed version, the centred random forests. Under the independent and identically distributed hypothesis, Scornet, Biau and Vert proved the consistency of Breiman’s random forest, while Biau studied the simpliﬁed version and obtained a rate of convergence in the sparse case. However, the i.i.d hypothesis is generally not satisﬁed for example when dealing with time series. We extend the previous results to the case where observations are weakly dependent, more precisely when the sequences are stationary β − mixing.

. A partitioning of [0, 1] 2 and the associated binary tree. c 1 , c 2 , c 3 are the constants associated to each cell. among all the admissible splits based on all the variables. The cell is cut in two on the selected split and the previous step is reiterated on the new cells. A tree is then a piecewise constant decomposition of the input space. We can associated to the input space partitioning a binary tree where each node corresponds to a test matching how the input space was cut. An illustration is given in Figure 1 of a partitioning in the two-dimensional space and its associated binary tree. The principle of bagging (short form of bootstrap aggregating) is to create M randomly generated training sets by randomly sampling α n observations with or without replacement from the set D n and to construct on each set a predictor. Once the predictors are constructed, the bagging prediction for a new observation x is an aggregation, generally the empirical mean, of the predictions given by the M predictors for the point x. This procedure aims to improve stability and accuracy of the base predictor. In the context of random forests, the predictors are regression trees.
We study two variants of random forests, the random forest-random input and the centred forest. By construction of the bagging, each predictor is computed in the same way. In order to explain the different procedures we then have to explicit the construction of one predictor. Let us begin with the variant which remains to this day the most commonly used and referred to as the original Breiman's random forest, the random forest-random input (RF-RI). For a given generated training set of α n points, a tree is computed using the CART [8] criterion: at each node of the tree the best split is selected by minimising the intra-node variance. This criterion is detailed in Section 2.1. A subtlety of the RF-RI is to restrict at each node the minimisation of the criterion on a random subset of m try variables rather than on the p variables and thereby increase the diversity of the predictors by adding randomness in the construction. This is then recursively repeated until a stopping criterion is met, typically when the number of nodes reached a given number or when the number of observations in each node is below a given threshold.
The RF-RI have received increasing attention in recent years regarding theoretical analysis and we can cite for example the works described in [18,21,22,25]. Since notations are only set later on for ease of readability, we decide to develop Section 2.1 with the exception of the result in [22] on which the present work relies on and doesn't require additional notations. Assuming that the observations (X i , Y i ) 1≤i≤n are independent and identically distributed as (X, Y ) , they establish the consistency of the pruned version (that is, the depth of the trees is controlled by a parameter) of the RF-RI, i.e. that E f n (X) − f (X) 2 → 0 as n → +∞, for trees where points are selected without replacement and the regression function is an additive model. Under an additional assumption, yet hard to verify in general, they also established consistency of the unprunned version (that is, the depth of a tree is not controlled) which is almost the algorithm commonly used in practice.
The second variant of random forests we study belongs to the so-called purely random forests's family. The RF-RI is based on the CART criterion which is heavily data-dependent, the criterion depends on both the position of the X i and the value of the Y i to choose the best split, while the purely random forests are based on criteria which are independent of the data. The variant we consider is called centred forest which was introduced in [7]. The first difference with the RF-RI is that there is no re-sampling step, meaning that the set used to compute the trees is D n . A tree is then recursively constructed as follows. At each node, a coordinate is chosen uniformly or according to some probability independent of the data and the split is performed in the middle of the cell along the selected coordinate. This kind of variants has been preferred for statistical analysis since they are easier to define, provide non-asymptotic risk bound giving insight in the choice of the parameters of the forest but also capture some attractive features of the original random forest as the variance reduction by randomisation and adaptive variable selection. Under the hypothesis that (X i , Y i ) 1≤i≤n are i.i.d, [2] established that if the splits concentrate on the relevant variables then the procedure adapts to sparsity by giving a rate of convergence which depends on the number of strong features. We refer to [3] for a complete theoretical survey on random forests.
The aforementioned theoretical results are established under the condition that the observations are independent and identically distributed. However, in applications, it is very common to have dependent data instead of independent one such as in time series and random forests are proven to perform well on these kind of observations. We may cite as an example of successful applications of random forests in time series [11,12,14,15]. In this regard, many algorithms were studied in the case of weakly dependent observations, and in particular, when dealing with β−mixing sequences. The β−mixing provides some kind of measure of how the dependence between observations decreases as the distance between them increases. It is usually difficult to estimate the mixing rates in practice. However, β−mixing sequences can be theoretically well-studied and estimated for various classes of random processes as Gaussian or Markov processes. We refer to [10] and [20] for more details about dependent processes. The general problem of one-step ahead predicting of time series was considered in [17] when the time series satisfies β−mixing and stationary condition, establishing consistency and rates of convergence for a certain class of functions which complexity and memory are determined by the data and minimising the structural risk. Consistency and a rate of convergence are also established for the boosting algorithm in [16] when the observations are stationary β−mixing. Their rate of convergence has an additional term, we also find in our analysis, which can be viewed as a penalty when considering β−mixing sequences instead of independent observations, O n 1−a(r β +1) with a ∈ [0, 1) and where r β measures the dependence of the mixing sequence we precise later on.
The paper is organised as follows: we first formalise the models studied and then set the statistical framework together with the notion of β−mixing sequences. We then state our contribution, including the extension of the aforementioned results to the case where observations are weakly dependent, namely the consistency of the RF-RI when trees are not fully grown and the rate of convergence of centred random forests. The proofs are postponed to the appendices for ease of readability.

Models
In this section, we formalise the previous mentioned models, namely the RF-RI and the centred random forest.
Recall that a random forest (either RF-RI or simpler models) is a collection of M random trees, computed in the same way, and the trees are constructed from a recursive partitioning of the input space X to which a binary tree can be associated matching how the input space was cut. We denote for the jth random tree, the predicted value at the point x,f n (x; Θ j ; D n ) where (Θ 1 , . . . , Θ M ) are independent and identically distributed as Θ and independent of D n . The random variable Θ is defined later on depending on the variant. The jth random tree is defined as followŝ where D n (Θ j ) is the data set which can be dependent on the random variable Θ j for example if resampling or sub-sampling is used to construct the jth tree. The cell containing the point x is denoted A n (x, Θ j , D n ), and E n (x, Θ j ) the event defined by {N n (x, Θ j ) = 0} . This means that each random tree outputs for a new point x the average value over all Y i for which the corresponding X i falls into the cell A n (x, Θ j , D n ) of the random partition.
In the regression case, we aggregate the predictions by taking the average in the following way to get the random forest estimatorf (2.1) Since M can be chosen as large as possible in practice, we study the properties of the infinite random forest estimate which is obtained as the limit of equation (2.1) when the number of trees M grows to infinity. The law of large numbers then justifies usingf n (x, D n ) = E Θ f n (x, Θ, D n ) instead off M,n (x; Θ 1 , . . . , Θ M ; D n ) , where E Θ denotes expectation with respect to Θ conditionally to D n . In the following, to ease legibility we omit the dependency on D n and denote simplyf n (x) := f n (x, D n ).

Random forest -random input
We begin by recalling the variant of random forest which is the most commonly used in practice, the random forest-random input. We denote: -α n ∈ {1, . . . , n} the number of sampled data points in each tree; -m try ∈ {1, . . . , p} the preselected number of variables for splitting; -τ n ∈ {1, . . . , α n } the number of leaves in each tree.
Here we consider the stopping criterion where the number of leaves must not exceed the given parameter τ n . The random forest is then computed as detailed in Algorithm 1. We shall make a remark regarding the selection of the nodes. They are not chosen uniformly among all the childless nodes, otherwise, there could exist tree branches far more developed than other only because of randomness. Usually, all nodes of a given level are split (if permitted) then the algorithm considers the nodes of the next level and so on. This remark holds also for the centred forest in Section 2.2.
Algorithm 1: Random forest -random input input: Training set ((X 1 , Y 1 ), . . . , (X n , Y n )) parameters: number of trees M, number of observations per tree α n , size of preselected variables for splitting m try , number of leaves τ n for j ← 1 to M do Construct the jth tree: -Draw uniformly α n ≤ n points without replacement.
while n nodes < τ n do • Choose a childless node A, containing more than one observation.
• Choose the best split in the cell A maximising the CART criterion, defined in equation (2.2), on M try. • Cut the cell A according to the best split. Let A R and A L be the cells we obtain.
• n nodes = n nodes + 1. The CART criterion is defined as follows. Let C A be the set of all possible cuts in the cell A. For any (j, z) ∈ C A , the CART-split criterion takes the form Let us suppose that the observations (X i , Y i ) 1≤i≤n are independent and identically distributed as (X, Y ) . A link between the error of the finite and infinite forest is established in [21] and shows that the error of the finite forest can be made arbitrary close to the infinite one provided that the number of trees is large enough, when is a centred Gaussian noise with finite variance σ 2 > 0 and independent of X. Another consequence of this result is that as soon as infinite random forests are consistent then the finite random forests are consistent provided that log n M −→ n→∞ 0. Asymptotic normality of random forests based on subsampling was proven in [18] when the subsample size α n grows slower than √ n, i.e. that αn √ n −→ n→∞ 0 and that the number of trees M varies with n, i.e. that n M −→ n→∞ C for some constant C > 0. However, this does not necessarily imply that random forests are asymptotically unbiased. This gap was filled in [25] and also established that the infinitesimal jackknife consistently estimates the forest variance under the less restrictive condition that the subsample size grows such that αn log n p n −→ n→∞ 0.

Centred forest
We now recall the construction of the centred random forest introduced in [7], detailed in Algorithm 2.
We note that τ n ≥ 2 is a fixed deterministic parameter which may depend on n but not on D n and that each tree has exactly 2 log 2 τn ≈ τ n nodes. However, there is no re-sampling step in the centred random forest algorithm and so D n (Θ) = ((X 1 , Y 1 ), . . . , (X n , Y n )) .

Statistical framework
The first assumption throughout this paper is that the random sequence (W t ) t∈Z is stationary. More precisely, we assume that (W t ) t∈Z is a strongly stationary process as defined in Definition 3.1.
In order to prove the consistency of the RF-RI we also need to assume that (W t ) t∈Z is an ergodic process as defined in Definition 3.2.
Let (C n ) n be a positive sequence and define the truncation operator T Cn by T Cn u = u when |u| ≤ C n C n when |u| > C n . and the set where G n = G(D n ) denotes a class of functions g : X → Y. Following the definition of the truncation operator T we denote The consistency proof in [22] relies on the general consistency theorem found in [13]. In order to extend the consistency result to the dependent case, we use the extension of the general consistency theorem to the stationary ergodic setting as stated in Proposition 3.3. We postpone the proof in Appendix B for ease of readability.
Proposition 3.3. Let (W t ) t∈Z be a stationary ergodic process and D n a data set. Let G n = G(D n ) be a class of functions g : X → Y, (C n ) n a positive sequence, f the regression function in equation (1.1) andf n an estimator which minimises the empirical L 2 risk on G n . If We recall the notion of weak dependence, more precisely the β−mixing case in which we establish the results.
Definition 3.4 (β-mixing process). Let σ l = σ(W l 1 ) and σ l+m = σ(W ∞ l+m ) be the sigma-algebras of events generated by the random variables W l 1 = (W 1 , . . . W l ) and W ∞ l+m = (W l+m , W l+m+1 , . . .). The β-mixing coefficients is given by where the expectation is taken with respect to σ l .
A stochastic process is said to be absolutely regular, or β-mixed, if The most common β−mixing coefficients are known as the algebraic and exponential mixing defined as follows, The exponential mixing hypothesis is stronger than algebraic mixing. The values r β and k β are called the mixing exponents and the i.i.d process can be recovered by taking either the limit r β → +∞ for the algebraic mixing or k β → +∞ for the exponential mixing.
The β−mixing property is appealing in the theoretical setting since many statistical properties are preserved under this condition and are easy to manipulate. One method to manipulate β−mixing sequences is by using a lemma established in [26], recalled in Lemma A.1. Using this lemma, the dependent process is approximated with independent blocks of observations plus some linear function in β.

Result for the RF-RI
We recalled the studied models and the notion of weak dependence. We need the following hypotheses to establish the consistency of the RF-RI when the observations are weakly dependent: is an independent centred Gaussian noise with finite variance σ 2 > 0 and each component f j is continuous.
We can now state the result of consistency of random forests when the observations are weakly dependent under the regime τ n < α n (i.e. the trees are not fully grown).
Theorem 4.1. Assume the hypothesis of stationary ergodic β−mixing data H1a, the independent errors hypothesis H2a and that the additive model hypothesis H3a. If there exists a sequence a n verifying 1 ≤ a n ≤ n such that τn log(αn) 9 Let us first verify if we recover the result in the independent case. If the observations (X i , Y i ) 1≤i≤n are independent, β m = 0 for all m ≥ 0. We then get exactly the same hypotheses and result as in [22] by setting a n equal to 1.
The hypotheses H2a and H3a are the same as in [22]. Note however that in the context of β− mixing processes, the independent errors hypothesis H2a is not necessarily true but is assumed in some theoretical models as in the autoregressive model. We refer to [4] for a complete survey of processes verifying the β−mixing condition. An interesting perspective would be to extend the result to the case where the errors are not assumed to be i.i.d.
The condition τn log(αn) 9 an αn −→ 0 as n tends to infinity is also highly similar to the last hypothesis in their theorem and recover it by setting a n equal to 1. The last one is simply saying that the dependence between the data must not be too long in order to have consistency of the forest. Let us see how the dependence influences the number of leaves parameter τ n . Let us suppose, in the following analysis, that r β (or k β in the exponential mixing case) is known. Let us consider the algebraic mixing case and suppose that a n = α a n with 1 1+r β < a < 1. The last condition is then verified and the greatest value of τ n must verify the following in order to obtain consistency: In the exponential mixing case, suppose that a n = c b log (α n ) 1 k β with c > 1. The last condition is then equal to log(αn) n which tends to 0 as n tends to infinity. The penultimate condition can then be rewritten, implying that τ n cannot be greater than the following condition is true, This analysis leads to the following conclusion. The nature of the hypothesis appears in the choice of the parameter τ n , influenced by r β (or k β ) : the stronger the dependence between the observations, meaning that r β (or k β ) is small, the shallower the trees need to be compared to the trees constructed based on i.i.d observations, in order to guarantee convergence.

Results on centred random forest
We analyse now the convergence rates of the centred random forest model when the observations are stationary β-mixing. The space [0,1] p is equipped with the standard Euclidean metric. We analyse the centred random forest in a sparse framework; this arises from the fact that in many applications the true dimension is always smaller than p. We assume that the regression function only depends on a nonempty subset S of the p features. We use the letter S to denote the cardinal of S. Based on this assumption we have S → R that is the section of f corresponding to S. We then have We also need the following hypotheses to establish the results: -H1b: the data set D n = ((X 1 , Y 1 ) , . . . , (X n , Y n )) is composed of stationary β−mixing (X i , Y i ) ∈ [0, 1] p × R; -H2b: the errors i := Y i − f (X i ) are independent of finite variance σ 2 > 0.

Convergence rates
We first decompose E f n (X) − f (X) 2 with the variance/bias decomposition: We assume throughout that the coordinate-sampling probabilities are such that p n,j = 1 S (1 + ν n,j ) for j ∈ S and p n,j = ν n,j otherwise where each ν n,j tends to 0 as n tends to infinity.
The first result concerns the variance term and the second the bias term.
Proposition 5.1. Assume the hypotheses of stationary β−mixing data H1b, independent errors H2b and that X is uniformly distributed on [0, 1] p .Assuming that the coordinate-sampling probabilities are such that p n,j = 1 S (1 + ν n,j ) for j ∈ S and if there exists a sequence a n verifying 1 ≤ a n ≤ n then τ n a 2 n n(log τ n ) S/2p + σ 2 β an n a n where C = 576 π π log 2 16 . As noted in [2], if p lower < p n,j < p upper for some constants p lower , p upper ∈ (0, 1) we have Proposition 5.2. Assume the hypotheses of stationary β−mixing data H1b, X is uniformly distributed on [0, 1] p and f * is L-Lipschitz on [0, 1] S . Assuming that the coordinate-sampling probabilities are such that p n,j = 1 S (1 + ν n,j ) for j ∈ S and if there exists a sequence a n verifying 1 ≤ a n ≤ n then where γ n = min j ν n,j .
The bias in the weakly dependent case only depends on the true dimension and not p which confirms the intuition and the result in the independent case as noted in [2]. However, we should keep in mind, whether in the dependent or independent setting, that the result relies on the assumption that the splits concentrate on the relevant variables.
Using the inequality z exp (−nz) ≤ 1 en for z ∈ (0, 1] and combining both previous convergence rates we get the following result. Theorem 5.3. Assume the hypotheses of stationary β−mixing data H1b, independent errors H2b, X is uniformly distributed on [0, 1] p and f * is L-Lipschitz on [0, 1] S . Assuming that the coordinate-sampling probabilities are such that p n,j = 1 S (1 + ν n,j ) for j ∈ S and if there exists a sequence a n verifying 1 ≤ a n ≤ n then S log 2 (1+γn) n + C 1,n τ n a 2 n n + C 2 β an n a n with C 1,n = 4e −1 sup (1 + ν n ) , The remark done previously on the independent error hypothesis H2a holds obviously for H2b as well. The hypothesis X ∼ U(0, 1) p is only a convenience and can be easily extended to the case where X admits a Lebesgue density which is lower and upper bounded.
We also recover the convergence rate in the independent setting given in [2] up to a constant factor. Let us suppose that we are in the independent case hence β m = 0 for all m ≥ 1. Setting a n equal to 1 and plugging into Propositions 5.1 and 5.2, we get exactly the same upper bound for the variance as in [2] back. However, regarding the bias term, we get a term with exp − n 4τn instead of exp − n 2τn which is due to a necessary pre-processing needed in order to work with β−mixing sequences.
Under the hypothesis of algebraic mixing and thus exponential mixing, the term depending on β is converging to 0 when n tends to infinity. The last term shows the price we must pay when dealing with β−mixing sequences instead of independent observations. More precisely, under algebraic mixing the penalty is of the form O n 1−a(r β +1) with a ∈ [0, 1) which is the same penalty as in the convergence rate of boosting established in [16]. The following corollary precises, under algebraic and exponential mixing conditions, the choices of τ n with the associated upper bound on the rate of consistency. This implies that the parameter τ n is of the form τ n ∝ n (1+r β )S log 2 2.25+2S log 2+r β (0.75+S log 2) and achieves the following convergence rate: 2. Under exponential mixing; taking a n ∝ log n The form of the convergence rate under algebraic mixing condition implies that in order to have consistency, we need the couple (r β , S) to satisfy the inequality 0.75 + S log 2 < 0.75r β . It also implies that this result only treats the case where r β ≥ 1.41. We note that we recover the same optimal parameter and convergence rate as in [2] by letting r β go to infinity. Under exponential mixing condition, the chosen τ n is, up to a logarithmic factor in the denominator, the optimal parameter in the i.i.d case and gives the same convergence rate up to a logarithmic term depending on the inverse of k β .
The previous analysis leads to the following conclusion. The choice of the parameter τ n is determined by the nature of the hypothesis; the stronger the dependence between the observations, meaning that r β (or k β ) is small, the shallower the trees need to be compared to the trees constructed based on i.i.d observations, in order to guarantee convergence.

Conclusion
The results for either the random forest-random input or the centred forest lead to the same conclusion: the more the dependence between the observations is long, the shallower the trees need to be compared to the trees constructed based on independent and identically distributed observations.
Theses results may also lead to new variants of random forests. The proofs of the results are based on a decomposition in blocks of the random process and the blocks are close to being independent. An analogy can be drawn between this decomposition and the so-called block bootstraps commonly used in time series estimation. Instead of considering the observations one by one, the algorithm is fed with blocks of observations and lead to better estimations. It could be interesting to modify the random forest algorithm in the same way to get a random forest adapted to time series.

Appendix A. Proofs
The proofs are based on the construction and lemma given in [26], also recalled below, but we note that a similar coupling lemma is proved in [1].
We divide the sequence (W i ) 1≤i≤n into 2µ n blocks each of size a n . We assume that n = 2µ n a n and so consider that there is no remaining terms. We then define for 1 ≤ i ≤ µ n , H j = {i : 2(j − 1)a n + 1 ≤ i ≤ (2j − 1)a n } T j = {i : (2j − 1)a n + 1 ≤ i ≤ 2ja n } . and we denote We then denote the sequence of H-blocks W an = W (j) 1≤j≤µn . We construct a sequence of independently distributed blocks Ξ an = Ξ (j) 1≤j≤µn where Ξ (j) = {ξ i , i ∈ H j } and such that for all j ∈ {1, . . . , n} , We construct in the same way a sequence of T -blocks. An illustration of this construction is given in Figure  The proof consists in applying Proposition 3.3. The computation of the approximation error is the same as in [22] since it does not require the independence of (X i , Y i ) 1≤i≤n but only stationarity and that the errors ( i ) 1≤i≤n are independent. This verifies equation (3.1b).
The partition obtained with the random variable Θ and the data set D n is denoted by P n (D n , Θ). We let be the family of all achievable partitions with random parameter Θ. We let M (Π n (Θ)) = max {Card(P, P ∈ Π n (Θ)} be the maximal number of terminal nodes among all partitions in Π n (Θ). Given a set z n 1 = {z 1 , . . . , z n } ⊂ [0, 1] p , Γ n (z n 1 , Π n (Θ)) denotes the number of distinct partitions of z n 1 induced by elements of Π n (Θ), that is, the number of different partitions {z n 1 ∩ A, A ∈ P} of z n 1 , for P ∈ Π n (Θ).
Let G n (Θ) be the set of all functions g : [0, 1] p → R piecewise constant on each cell of the partition P n (Θ).
We define as in [22], C n = f ∞ + σ √ 2 log (α n ) 2 , hence equation (3.1a) is verified. Regarding the estimation error, it is very similar to the computation done in [22] but we need to use a result established in [17] to introduce the β−mixing coefficient. This will prove equation (3.1c).
Theorem A.2. Let (W t ) t∈Z be a β-mixing stationary stochastic process, with |Y i | ≤ A n and let G n be a class of functions g : R p → R. Then, for any d ≥ 2, Using Theorem A.2 we get, , G n (Θ), l 1,n exp − µ n δ 2 128(2C n ) 4 + 2µ n β an where α n = 2µ αn a αn . For simplicity's sake, we denote µ n = µ αn and a n = a αn .
Hence with the β−mixing condition Thus, according to Proposition 3.3, We only need to check if the non-truncated random forest estimate is consistent, this step is identical to [22].

A.2 Proofs for centred forests
Proof of the variance rate, Proposition 5.1. We follow the proof given in [2]. Since the training sample is not independent, we cannot get the same lines and results but the main ideas are, associated with Lemma A.1, the same.
Remember that the random forest estimator is written Thus, omitting the dependence in D n , the random forest estimator can be written We can now begin the computation, The second term of equation (A.1) is equal to zero since the errors ( i ) 1≤n are independent by hypothesis H2b.
We next analyse the first term. We can upper-bound Θ)] (by hypothesis on the variance of the errors H2b).
The next step is to analyse the expectation of W n,i . Since the data is not independent we cannot do exactly the same as in [2]. We need to rewrite the sum over n, decompose it in blocks and then use Lemma A.1. We can then use a similar argument as [2] which is, by introducing another random variable, to reveal a random binomial variable in the denominator. Let us first decompose the previous term for B = H or T. We easily observe that u ≤ 1 by definition of W n,i .
Let us begin with the first part of the right hand: We introduce Θ independent of Θ but with same distribution, For a fixed j, By independence of the blocks we can remove the conditioning to ξ Using Cauchy-Schwarz's inequality, for a fixed j, Using the following fact (cf. [13]) that E 1 1 + Bin(N, p) 2 ≤ 3 (N + 1)(N + 2)p 2 . and since each blocks are independent wherej denotes one component of each block (H j ) 1≤j≤µn and (T j ) 1≤j≤µn . By independence of the blocks we get 2µn−1 j =1 1 ξj ∈An (X,Θ) ∼ Bin (2µ n − 1, P(X ∈ A n (X, Θ)|X, Θ)) .
Since we suppose that the law is uniform on [0, 1] p and by the construction of the tree we get The same is done for the conditional expectation with respect to (X, Θ ) . Thus The last inequality using the fact that even though dependent, they have the same distribution. The rest is the same as in [2]. After the computations over H, we get τ n a n µ n µ 2 n (log τ n ) S/2p with C = 144 π π log 2 16 S/2p and 1 + ν n = j∈S (1 + ν n,j ) . We do the same over T .
Combining both analyses we have τ n a n µ n (log τ n ) S/2p + 2σ 2 β an µ n .
By construction of the blocs µ n = n 2an , plugging in the previous expression we have τ n a 2 n n(log τ n ) S/2p + σ 2 β an n a n with C = 576 π π log 2 16 Proof of the bias term, Proposition 5.2. The start of the proof is the same as in [2] since it does not use the hypothesis of independence between the observations: where we get the last inequality using the hypothesis that f * is L−Lipschitz. To go further in the analysis we have to use Lemma A.1 to get independent variables. We proceed similarly to the first proof, Thus, using Lemma A.1, We do the same over T. We need to do a similar operation to compute the probability P (E c n (X, Θ)) . We recall that E n := n i=1 1 Xi∈An(X,Θ) = 0 : Using Lemma A.1, We get f 2 (x) P ∀ 1 ≤ j ≤ µ n , ∀i ∈ H j , ξ 1 i ∈ A n (X, Θ) + µ n β an 2SL 2 + sup We first analyse the term P ∀ 1 ≤ j ≤ µ n , ∀i ∈ H j , ξ 1 i ∈ A n (X, Θ) , P ∀ 1 ≤ j ≤ µ n , ∀i ∈ H j , ξ 1 i ∈ A n (X, Θ) ≤ P ∀ 1 ≤ j ≤ µ n , pickĩ ∈ H j , ξĩ ∈ A n (X, Θ) whereĩ is an arbitrary index chosen in {1, . . . , a n }. Since the blocks are independent, the terms in the probability are independent. Furthermore, they have the same distribution. Thus P ∀1 ≤ j ≤ µ n , ∀i ∈ H j , ξ 1 i ∈ A n (X, Θ) ≤ P µn [ξ 1 ∈ A n (X, Θ)] Proof. To prove this result, we follow the same line as in [13]. Instead of using the large of law numbers for i.i.d variables we use the law of large numbers for stationary ergodic processes. We write It suffices to show We rewrite this term, The last line uses equations (3.1a-3.1c) and the strong law for stationary ergodic process. We get the result letting L → ∞.