Empirical measures: regularity is a counter-curse to dimensionality

We propose a"decomposition method"to prove non-asymptotic bound for the convergence of empirical measures in various dual norms. The main point is to show that if one measures convergence in duality with sufficiently regular observables, the convergence is much faster than for, say, merely Lipschitz observables. Actually, assuming $s$ derivatives with $s<d/2$ ($d$ the dimension) ensures an optimal rate of convergence of $1/\sqrt{n}$ ($n$ the number of samples). The method is flexible enough to apply to Markov chains which satisfy a geometric contraction hypothesis, assuming neither stationarity nor reversibility, with the same convergence speed up to a power of logarithm factor. Our results are stated as controls of the expected distance between the empirical measure and its limit, but we explain briefly how the classical method of bounded difference can be used to deduce concentration estimates.


Empirical measures and quadrature
Consider a discrete-time stochastic process (  ) ≥0 taking its values in some phase space Ω, assumed to be a Polish space endowed with its Borel -algebra.We are concerned with the random atomic measure called the empirical measure of the process, and its convergence.We shall either assume that the (  ) ≥0 are independent identically distributed of some law , or assume some weak long-range dependence and convergence of the law of   to  as  → ∞.
To quantify the convergence, we are interested in distances on the set (Ω) of probability measures defined by duality.Given a class F of functions  : Ω → R (sometime called "test functions" or "observables"), one defines for  0 ,  1 ∈ (Ω): (note that we write indifferently  0 ( ) or ∫︀  d 0 ).One particularly important case is obtained by taking F = Lip 1 (Ω), the set of 1-Lipschitz functions.The corresponding metric is the 1-Wasserstein metric W 1 = ‖•‖ Lip 1 , which by virtue of Kantorovich duality can be written equivalently as W 1 ( 0 ,  1 ) := inf where ‖•‖ here is the Euclidean norm and the infimum is over all pairs of random variable with the given measures as individual laws.It is long-known [AKT84] that, when the (  ) ≥0 are independent and uniformly distributed on [0, 1]  , we have where ≍ expresses upper and lower bounds up to multiplicative constants and  denotes the Lebesgue measure.This problem and generalizations have been studied in several works, e.g.[Tal92, Tal94, BLG14, DSS13, FG15, AST16, WB17].The bounds (1) are interesting theoretically, but are rather negative for the practical application to quadrature.Computations of integrals are in many cases impractical using deterministic methods, and one often has to resort to Monte Carlo methods, i.e. approximate the unknown ( ) by μ ( ).When one has to compute the integrals of a large number of functions (  ) 1≤≤ with respect to a fixed measure , one would rather draw the random quadrature points  1 , . . .,   once and for all, and use them for all functions   ; while usual Monte Carlo bound will ensure each individual estimate μ (  ) has small probability to be far from (  ), if  is large compared to  these bounds will not ensure that all estimates are good with high probability.On the contrary, convergence in W 1 (or in duality with some other class F ) ensures good estimates simultaneously for all   , as long as they belong to the given class, independently of  .This makes such convergence potentially useful; but the rate given above,  − 1  , is hopelessly slow in high dimension which is precisely the setting where Monte Carlo methods are most needed.We shall prove that if the functions of interest are regular, then this "curse of dimensionality" can be overcome.We shall be interested in the duality with  s 1 the set of functions with  s norm at most 1 (precise definitions are given below; when  = 1 this is the set of 1-Lipschitz functions); but other spaces could be considered, e.g.Sobolev or Besov spaces.
Another issue is that in many cases, drawing independent samples (  ) ≥0 of law  is not feasible, and one is lead to instead rely on a Markov chain having  as its stationary measure; this is the Markov Chain Monte Carlo method (MCMC).While the empirical measure of Markov chains have been considered by Fournier and Guillin [FG15], these authors need quite strong assumptions: a spectral gap in the  2 space (or similarly large spaces), and a "warm start" hypothesis ( 0 should have a law absolutely continuous with respect to ).In good cases, one can achieve this by a burn-in period (start with arbitrary  0 , and consider (  0 + ) ≥0 for some large  0 ); but in some cases, each   has a singular law with respect to  (for example the natural random walk generated by an Iterated Function System).We shall consider Markov chains satisfying a certain geometric contraction property, but again the method can certainly be adapted to other assumptions.

Markov chains
Our main result handles Markov chains of arbitrary starting distribution and with a spectral gap in Lip (e.g.positively curved chains in the sense of Ollivier [Oll09]).
Theorem A. Assume that (  ) ≥0 is a Markov chain defined on a bounded domain Ω of R  , whose iterated transition kernel (   ) ∈Ω,∈N defined by is exponentially contracting in the Wasserstein metric W 1 , i.e. there are constants  ≥ 1 and  ∈ (0, 1) such that W 1 (   ,    ) ≤   ‖ − ‖.Denote by  the (unique) stationary measure of the transition kernel.
Then for some constant  = (Ω, , , ) and all large enough , letting n = (1−), we have Let us stress two strengths of this result: , the bounds are only a power of logarithm factor away from the optimal bounds for IID random variables, • for  large enough, we almost obtain the optimal convergence rate ≍ 1/ √ • we assume neither reversibility, stationarity, nor warm start hypotheses (the distribution of  0 can be arbitrary), • the rate of convergence does not depend on the specific feature of the Markov chain, only on  and .
Note that for fixed , n has the same order than , but if  is close to 1, 1/(1 − ) is the typical time scale for the decay of correlations.One thus cannot expect less than (1 − ) Markov samples to achieve the bound obtained for  independent samples.Examples of Markov chains which are exponentially contracting in W 1 (equivalently, that have a spectral gap in the space of Lipschitz observables) are numerous; it is a slightly more general condition than "positive curvature" in the sense of Ollivier [Oll09], see e.g.[JO10] and [Klo17b] for concrete examples, or in the context of dynamical systems [KLS15] and [Klo17a].
Under the assumption of Theorem A, it is well-known that uniform estimates sup → 0 and sup hold, here with F = Lip 1 (or any smaller class), with a Gaussian rate.The problem of convergence in duality to the class F is thus to invert the supremum and the probability (or expectancy), to bound from above .
We shall disregard the potential issue of non-measurability: as we shall only deal with classes F having a countable subset which is dense in the uniform norm, we can always replace the supremum with a supremum over a countable set of functions.The idea of the proof of Theorem A is to take an arbitrary  ∈  s 1 (Ω) and decompose it using Fourier series.The regularity hypothesis gives us a control on both the uniform approximation by a truncated Fourier series, and on the Fourier coefficients.Combining these controls, we bound from above |μ  ( ) − ( )| by a quantity that does not depend on  at all, but depends on the Fourier basis elements (  ) ∈Z  up to some index size.Taking a supremum and an expectation, this leaves us with the simple task to optimize where to truncate the Fourier series.
This decomposition method can in principle be used under various assumptions on the process (  ) ≥0 , the point being to identify a decomposition suited to the assumption; in particular, one can easily adapt the method to study geometrically ergodic Markov chains.I chose to present Theorem A in part because its hypothesis is relevant to several Markov chains I am interested in, and in part because it presents specific difficulties: a blunt computation leads to non-optimal powers of .To obtain good rates, we translate the contraction hypothesis to frame part of the argument in the space Hol  , where the Fourier basis has smaller norm; and instead of bounding the Fourier coefficients of a Lipschitz function directly, we use Parceval's formula and the injection  s →   which turns out to give a better estimate.Another functional decomposition, and another path in computations might improve the power in the logarithmic factor.
We restrict to the compact case, but the method can in principle be adapted, or truncation argument be used, to deal with non-compactly supported measure.
In order to introduce the decomposition method and show its flexibility, we shall state two simpler results below.

Explicit bounds in the i.i.d case, for the Wasserstein metric
The decomposition method enables one to get a very explicit version of (1) with a few computations but very little sophistication.

Theorem B.
If  is any probability measure on [0, 1]  and (  ) ≥0 are i.i.d.random variable with law , then for all  ∈ N we have where  3 ≤ 6.3,   ≤ 3 √  for all  ≥ 4, and The order of magnitude of these bounds is sharp in many regimes: • in dimension 1, the order of magnitude 1/ √  is optimal; however the constant 1/(2( √ 2 − 1)) is not asymptotically optimal when  is Lebesgue measure, • when  = 2 and  is Lebesgue measure, as previously mentioned the correct order is √︁ log /, but to the best of my knowledge it is an open question to determine whether this better order holds for arbitrary measures (a positive answer is strongly expected).See Section 2.4 for an example showing that in a more general setting the order log / √  cannot be improved, • when  ≥ 3, both orders of magnitude  −1/ as  → ∞ and √  as  → ∞ are sharp up to multiplicative constants (see Remark 2.2).The asymptotic constant 2 is certainly quite larger than the asymptotic constant lim which has been computed for the related, but slightly different matching problem by Talagrand [Tal92]; but our bound holds for all  and all  (and also all ).An even more general bound has been given by Boissard and Le Gouic [BLG14], but their constant is larger by a factor approximately 10.
Let us stress that the main purpose of this result will be to expose our method in an elementary setting: indeed many previous similar bounds are available in this case.For example more general non-asymptotic results have been obtained by Fournier and Guillin [FG15], building on previous work by Dereich, Scheutzow and Schottstedt [DSS13].They are more general in that they consider -Wasserstein metric for any  > 0 (while we will only be able to consider  ≤ 1), and apply to non-compactly supported measures  under moment assumptions.However their constants, though non-asymptotic, have not been made explicit, and their behavior when the dimension grows has not been studied.

Regular observables and independent samples
In the i.i.d.case, we can improve Theorem A by removing most of the logarithmic factors.
Theorem C. If  is any probability measure on [0, 1]  and (  ) ≥0 are i.i.d.random variable with law , then for all  ≥ 1, for some constant  = (, ) > 0 (not depending upon ), and all integer  ≥ 2 we have It is possible to prove this result with previous, more classical methods.Indeed, combining the "entropy bound" for the class  s 1 [VdVW96, Thm 2.7.1] and the "chaining method" (see e.g.[vH96, Ex 5.11, p. 138]) leads to Theorem C; I am indebted to Jonathan Weed for pointing this out to me.The proof by the decomposition method we provide here is very simple, but non-elementary as it relies on a wavelet decomposition.It is well-known that all functions in  s 1 can be written as a linear combination of a few elements of a wavelet basis, with small coefficients, up to a small error.Then controlling |μ  ( ) − ( )| for all  ∈  s 1 simultaneously reduces to controlling this quantity for the few needed elements of the wavelet basis.

concentration inequalities
Up to know, we have restricted to estimates on the expectancy, while in many practical situations one would need concentration estimates.This is in fact not a restriction, as we shall explain briefly in Section 5: the classical bounded difference method enable one to get concentration near the expectancy.In particular, we get the following.

Corollary D.
Under the assumptions of Theorem A, for some  depending on , , diam Ω, for all large enough  and all  ≥  = (Ω, , , ) we have: (The last inequality is not optimal as we relaxed the poly-logarithmic factor for simplicity.) For example, when  ≥ /2 we deduce that Structure of the paper Sections 2, 3 and 4 are independent and contain the proofs of the main Theorems (B, C and A respectively: we start with the most elementary proof, follow with the simplest one, and end with the most sophisticated).Section 5, dealing with concentration estimates, is mostly independent from the previous ones, which are only used to deduce Corollary D.
We shall write   for  ≤ , the dependency of the constant  being left implicit unless it feels necessary; the constants denoted by  will be allowed to change from line to line.

Wasserstein convergence and dyadic decomposition
The goal of this Section is to prove (a refinement of) Theorem B. We consider a sequence (  ) 1≤ of independent, identically distributed random points whose common law shall be denoted by ; we assume that  is supported on the cube [0, 1]  and consider the convergence of the empirical measure μ := where Hol  1 is the set of functions  : While we are mostly interested in the Euclidean norm ‖•‖, our method is sharper in the case of the supremum norm 1 ‖•‖ ∞ , with respect to which the analogue of the aforementioned objects are denoted by W ,∞ and Hol ,∞ 1 .We will work with ‖•‖ ∞ , and then deduce directly the corresponding result for the Euclidean norm by using that ).Our most precise result is the following.
We deduce several more compact formulas below, including Theorem B. Observe that for fixed  and large , the complicated front constant converges to 2. Remark 2.2.It is not difficult to see that for  the Lebesgue measure and an optimal, deterministic approximation μ with  =   Dirac masses, one has so that in high dimension, for the ℓ ∞ norm and in the worst case  = 1 our estimate is off by a factor of approximately 4 compared to a best approximation.
With the Euclidean norm, an easy lower bound in the case of the Lebesgue measure is obtained by observing that a mass at most is at distance  or less of one of the  points (be they random or not).This leads, for any measure μ supported on  points, to and again our order of magnitude   ≍ √  is the correct one.
The results of [Tal92] show that, at least for the bipartite matching problem, this seemingly crude lower bounds are in fact attained asymptotically, taking renormalized limits as  → ∞ and then  → ∞.This indicates that our constant are not optimal, and it would be interesting to have a non-asymptotic bound with optimal asymptotic behavior.

Decomposition of Hölder functions
The method to prove Theorem 2.1 consists in a multiscale decomposition of the functions  ∈ Hol ,∞ 1 .In its spirit, it seems quite close to arguments of [BLG14], [DSS13] and [FG15]; our interest is mostly in setting this multiscale analysis in a functional decomposition framework.

Wasserstein distance estimation
With the notation of Lemma 2.3, for any  ∈ Hol  1 we have: where the last right-hand term does not depend on  in any way.We can thus take a supremum and an expectation to obtain Remark 2.4.This is the core of the decomposition method.Observe that we used no hypothesis on the (  ) yet; any stochastic process for which one can control E[|μ  (  ) − (  )|] can be applied the method.
Setting   = (  ), the random variable μ  (  ) is binomial of parameters  and   .A standard estimation of the mean absolute deviation yields By concavity of the square-root function, we have and we deduce leaving us with the simple task to optimize the choice of .

Optimization of the depth parameter
We shall distinguish three cases:  < 2,  = 2 and  > 2.The first case is only possible for  = 1, but we let it phrased that way because for some measures  the dimension  of the ambient space can be replaced by the "dimension" of the measure itself, see Section 2.4 for an example.

Small dimension
If  < 2, then the sum in (11) is bounded independently of  and we can let  → ∞ to obtain: In particular, for  = 1,  = 1: Remark 2.5.For  2 −  close to 0, the constant in (12) goes to infinity; in this regime, for moderate  letting  → ∞ is sub-optimal and one should optimize  in (11) as we shall do in the next cases.

Critical dimension
If  = 2 (or in fact  ≤ 2) we can rewrite (11) as To optimize , we formally differentiate the right-hand side with respect to , equate to zero and solve for .Reminding that  is an integer, and keeping only the leading term (when  → ∞) to simplify, this leads us to choose We deduce the claimed bound immediately implying the bound of Theorem B for  = 2 and  = 1 (where a √ 2 comes from the comparison between the supremum and Euclidean norms):

Large dimension
If  > 2, equation ( 11) becomes Following the same optimization process as in the critical dimension case, we choose  such that 1 2 We have notably  ′ 4 = 3. Relaxing our bound for  ≥ 4 to it is more easily seen that it is decreasing (and still takes the value 3 at  = 4).We also see that we can take  ′  → 2 as  → ∞.The last part of Theorem B follows with   = √  ′  , and a numerical computation shows  3 ≤ 6.3.

The four-corners Cantor measure
We conclude this section with an example showing that the critical case order log / √  is sharp if one generalizes its scope.
The four-corner Cantor set  is the compact subset of the plane defined as the attractor of the Iterated Function System ( 1 ,  2 ,  3 ,  4 ) where   are homotheties of ratio 1/4 centered at (0, 0), (0, 1), (1, 1) and (1, 0) (see figure 1).It has a natural measure   , which can be defined as the fixed point of the map ( is contracting in the complete metric W 1 , so that it has a unique fixed point).The measure   can also be described as follows.In the 4-adic decomposition of the square, at depth  > 0 there are 16  squares, among which 4  intersect  in their interior;   gives each of these squares a mass 1/4  .
Figure 1: Second stage of the construction of the four-corners Cantor set (contained in the filled black area).
has Hausdorff dimension 1 (and positive, finite 1-dimensional Hausdorff measure), and one should expect   to have dimension  = 1 in any reasonable sense of the term.It is thus interesting to have a look at W  (μ  ,   ) in the critical case  = 1/2.Proposition 2.6.
Proof.The proof of the upper bound follows the proof of Theorem 2.1, using a 4-adic decomposition and discarding all  such that   does not intersects  in its interior.This replaces  by 1 as there are 4  relevant squares of size 4 − (indeed the only place where  is used is in (10), only through the number of dyadic squares to be considered), so that with  = 1/2 we end up in the critical case.
To prove the lower bound, we first record the proportions  1 ,  2 ,  3 ,  4 of the random points   lying in each of the four relevant depth-one squares (of side-length 1/4).For large , each   is close to 1/4 with typical fluctuations of the order of 1/ √ .The discrepancy of mass in each of these squares compared to the mass 1/4 given to each of them by   induces a cost of at least 1/ √ 2, since the distance between depth-one squares is at least 1/2 and  = 1/2.The same reasoning applies at depth two inside each depth-one square, but with   ≃ /4 points, thus fluctuations are of the order of 1/ √︁ /4 = 2/ √ , inducing a total cost of the order of 1/ √ 2 (distances are now 1/4 × 1/2, and a square root is taken since  = 1/2).The fact that the number of points is   rather than precisely /4 is not an issue, an uneven distribution improving the bound.
At each depth  up to log 4 , there is a typical induced cost of the order of 1/ √  from the uneven distribution of points among the 4 subsquares of each depth  square, yielding the desired bound of the order of log / √ .

Wavelet decomposition
Let us give a short account of the results about wavelets we will use (see e.g.Meyer's book [Mey92] for proofs and references).
It will be convenient to use wavelets of compact support with arbitrary regularity  r , whose construction is due to Daubechies [Dau88].The construction yields compactly supported functions ,   : R  → R where  takes any of 2  − 1 values ( ∈  := {0, 1}  ∖ {(0, 0, . . ., 0)}), with particular properties of which only those we will use will be described.
One defines from these "father and mother" wavelets a larger family of wavelets by one important property of the construction is that the union of (  )  ∈Z  and (  ) ∈Λ form an orthonormal basis of  2 (R  ).For  ∈  2 (R  ) we can thus write where Λ  = {} × Z  ×  and ⟨•, •⟩ denotes the  2 scalar product (with respect to Lebesgue measure).
One stunning property is that many functional spaces can be characterized in term of the wavelet coefficients () = ⟨,   ⟩ and ( ) = ⟨,   ⟩.We shall only use upper bounds on the () and ( ) in a specific case.
The Hölder space  s is defined as the space of  times continuously differentiable with -Hölder partial derivatives of order , with  a non-negative integer,  ∈ (0, 1] and  +  =  (e.g. 1 is the space of Lipschitz functions,  3/2 the space of once continuously differentiable functions with 1/2-Hölder first-order partial derivatives,  5 is the space of four-times continuously differentiable functions with Lipschitz fourth-order partial derivatives, etc.).Note that "1-Hölder", meaning "Lipschitz", could be slightly enlarged to "Zygmund" (and should, if one is interested in two-sided bounds), but we need not enter this subtlety here.
The space  s is endowed with the norm where the decomposition  =  +  is defined as above and ‖•‖ ⋆ is the uniform norm if  <  and is the -Hölder constant if  = .We denote by  s 1 the set of functions with  s norm at most 1.
If the regularity of the wavelets is larger than the regularity of the considered Hölder space ( > ) then where the constant  , depends implicitely on the choice of father and mother wavelets  and   ; but we can fix for each  such a choice with suitable regularity, e.g. = +1 and the constants then truly depends only on  and .The  s norm in the () coefficient could be relaxed to the "regularity part" of the norm but we do not use this.
Note that the explicit computation of these constants would in particular need a very fine analysis of the chosen wavelet construction, and I do not know whether such a task has been conducted.

Decomposition of regular functions
Let us now use wavelet decomposition to prove good convergence properties for the empirical measure against smooth enough test functions; the strategy is similar to the one used in Section 2. We assume here that (  ) ≥0 is a sequence of i.i.d.random variables whose law  is supported on a bounded set Ω ⊂ R  (e.g.Ω = [0, 1]  ); note that  s 1 =  s 1 (R  ) makes no reference to Ω.We consider a fixed family of wavelet of regularity  >  as in Section 3.1; all constants  below implicitly depend on ,  and Ω (only through its diameter).
Since the wavelets have compact support, there exist some constant  such that for each : • for each point  ∈ [0, 1]  , there are at most  different  corresponding to a   that does not vanish at ; the set of those  is denoted by Λ  () ⊂ Λ  , • the union Λ  (Ω) := ⋃︀ ∈Ω   () has at most 2  elements.We denote by  the set of parameters  ∈ Z  corresponding to a   whose support intersects Ω (observe that  is finite).
We fix a function  ∈  s 1 and decompose it in our wavelet basis: Cutting the second term of the decomposition to some depth  we get: Using the bound on the  coefficients and the formula (16) for   , we get: and it follows: where the right-hand side does not depend on  .Taking a supremum and an expectation, it then comes: (17) and to conclude, we simply need to estimate the last two terms above.

Convergence for basis elements
Lemma 3.1.We have For each  ∈ , the random variable μ (  ) is the average of  independent identically distributed, bounded random variables of expectation (  ), so that Since  is finite, the first claim is proved.To prove the second claim, we cannot argue in the exact same way because   depends on .To ease notation we introduce ψ := 2 −  2   and   := μ ( ψ ) − ( ψ ), and recall that ψ is bounded independently of .Also, a bounded number of different ψ ( ∈ Λ  ) are non-zero at any point  ∈ Ω; we denote by   the mass given by  to the support of   and observe that   is the average of  i.i.d.centered random variables of variance less than   + ( ψ ) 2 .We have Var(  ) 2  √  Remark 3.2.Lemma 3.1 is the only place where we use that the (  ) ∈N are i.i.d.The method can therefore be applied to any stochastic process satisfying the conclusion of Lemma 3.1.

Conclusion of the proof
and we get the same trichotomy as before.If  > /2, then we can let  → ∞ to obtain and if  < /2 we can choose  such that 2  ≃

Markov chains
In this section we assume (  ) ≥0 is a Markov chain on a bounded domain; since we will use Fourier series, it will make things simpler to embed this domain into a torus, so we assume Ω ⊂ T  = R  /Z  (we do not lose generality in doing so, as scaling down Ω makes it possible to make the embedding isometric).We still denote by ‖ − ‖ the distance between two points induced by the Euclidean norm.
Our main assumption is that the iterated transition kernel of (  ) ≥0 , defined by Let us denote by L the averaging operator, i.e.

L𝑓 (𝑥) =
∫︁  () d  () and by L * its dual acting on probability measure, i.e.L *  is the law of  +1 conditioned on   having law .The linearity of W 1 enables one to rewrite (18) as so that there is a unique stationary measure , and the law of   converges exponentially fast (in W 1 ) to , whatever the law of  0 is.We shall prove Theorem A, which we restate for convenience.
Theorem 4.1.For some constant  = (Ω, , , ) and all large enough , letting n = (1 − ), we have Following the decomposition method, we shall find a suitable decomposition basis for any  ∈  s 1 , seeking for a compromise between precision of a truncated decomposition and number of basis elements.Here using wavelets seems inefficient, as we do not have a precise enough analogue of Lemma 3.1, which uses independence to take advantage of the localization property of wavelets; without this, the number and size of the   are overwhelming.We shall use Fourier series instead, as they will be more easily controlled under our assumptions.For simplicity we consider complex-valued functions here, and denote the Fourier basis by   () :=  2• where  ∈ Z  and the dot • denotes the canonical inner product.
The key is thus to control |μ  (  ) − (  )|; our hypothesis may seem perfectly suited to this since   is Lipschitz, but its Lipschitz constant grows too rapidly with  for a direct approach to be efficient.We shall combine the following two observations (the first of which is pretty trivial, the second of which is folklore).Lemma 4.2.For all  ∈ (0, 1), we have the following control of   's -Hölder constant: For all  ∈ (0, 1], denoting by W  the -Wasserstein metric (i.e. the 1-Wasserstein metric associated with the modified distance ‖•‖  ), we have As a consequence, for all -Hölder functions  : Ω → C and all ℓ,  ∈ N it holds where the implied constants depends only on Ω and the constant  in (18).
Proof.By linearity we only have to check (21) when  0 =   and  1 =   for some ,  ∈ Ω, and by concavity To prove convergence toward the average and decay of correlation, we first use the contraction and that  is the stationary measure to get Hol  ( )   .
Assume further  ≥ ℓ and write  = ℓ + .Combining all previous observations we get: and the conclusion follows.
We deduce the following from these two Lemmas.
Corollary 4.4.For all ,  and all  ≥ 1/(1 −   ) it holds Proof.We have: Fix some threshold  ≥ 3 and some exponent  ∈ (0, 1], to be determined explicitly later on. Let  : T  → R be in  s 1 .From the multidimensional version of Jackson's theorem [Sch69], we know that there is a trigonometric polynomial   ( ) which is a linear combination of the   for || ∞ ≤ , such that We have no clear control on the coefficient of this optimal trigonometric polynomial, which need not be the Fourier coefficients of  .But it is also known that the Fourier series of  is within a factor ≃ ‖ ‖ ∞ (log )  of the best approximation (see [Mas80] for an optimal constant), so that denoting by   ( ) := ∑︀ ||∞≤ f   the -truncation of the Fourier series of  , we get We can assume f0 = 0 by translating  , and what precedes yields: Where the right-hand side does not depend on  in any way (note that ‖•‖   is the Sobolev norm, controlled by the  s norm).
Remark 4.5.At line (22), one could be tempted to bound directl | f | instead of using the Cauchy-Schwarz inequality, in order to make better use of our assumption on  .This would be effective if |μ  (  ) − (  )| were of the order of 1/, but it is actually of the order of 1/ √ , ultimately leading to a weaker bound than the one we aim for.
Taking a supremum and an expectation in (23) and using concavity, it comes: Choose now  = 1/ log  so that ℓ 2 1 for all ℓ ∈ {1, . . ., }, use 1 For  < /2, we get: Trying to balance the contribution of the two terms, we first see that taking  ≃ n 1  would optimize the power of n in the final expression; refining to  = (log n)  n 1  , developing and ignoring lower order terms shows that the choice  = 2 − 1  optimizes the final power of log n, and we thus set Any large enough  (the bound depending on both  and ) satisfies the requirement  ≥ 1/(1 −   ) since the right-hand side is of the order of log .It then comes: Finally, for  > /2 we get ending the proof of Theorem A.

Concentration near the expectancy
Let us detail how classical bounded martingale difference methods can be used to prove that the empirical measure concentrates very strongly around its expectancy.When (  ) ≥0 are independent identically distributed, this is long-known (see [Tal92], and also [WB17] for more general Wasserstein metrics W  ,  ≥ 1).In the case of Markov chains, such arguments have been developed notably in [CR09] and, in a dynamical context, [CG12].Our approach is very similar and thus cannot pretend to novelty, but we write it down to show how to handle functional spaces more general than just Lipschitz and Hölder.
The fundamental result to be used is the Azuma-Hoeffding inequality, which we recall.
Theorem (Azuma-Hoeffding inequality).Let  be a random variable, let be a filtration and for each  ∈ 1,  set Assume that for all  and some numbers   ∈ R,   > 0 we have Δ  ∈ [  ,   +   ] almost surely.Then for all  > 0, .

The independent case
In the case of i.i.d.random variables, the Azuma-Hoeffding inequality famously yields the following concentration inequality.Similarly, with Theorem B we can obtain entirely explicit, non-asymptotic concentration bounds.

Markov Chains
To tackle Markov chains we will need some hypothesis to replace independence; we choose a framework that covers the case of W 1 , but also more general dual metrics ‖•‖ F .
Assume that Ω is endowed with a metric  with finite diameter ( is assumed to be lower-semi-continuous, but not necessarily to induce the given topology on Ω).We still denote by Lip 1 (Ω) be the space of functions Ω → R which are 1-Lipschitz with respect to .
Let (  ) ≥0 be a Markov chain on Ω which is exponentially contracting (see the beginning of Section 4) with constant  and rate , in the metric  instead of the euclidean norm; this can be rewritten in a coupling formulation as follows: for all ,  ′ ∈ Ω, all ,  ∈ N there are random variables ( ′  ) ≥ with the same law as (  ) and we say that Φ is separately Lipschitz if Λ  (Φ) < ∞ for all  (when  = 1 ̸ =, the coordinate-wise Lipschitz constant become the coordinate-wise oscillations).