Bayesian Learning with Wasserstein Barycenters

We introduce and study a novel model-selection strategy for Bayesian learning, based on optimal transport, along with its associated predictive posterior law: the Wasserstein population barycenter of the posterior law over models. We first show how this estimator, termed Bayesian Wasserstein barycenter (BWB), arises naturally in a general, parameter-free Bayesian model-selection framework, when the considered Bayesian risk is the Wasserstein distance. Examples are given, illustrating how the BWB extends some classic parametric and non-parametric selection strategies. Furthermore, we also provide explicit conditions granting the existence and statistical consistency of the BWB, and discuss some of its general and specific properties, providing insights into its advantages compared to usual choices, such as the model average estimator. Finally, we illustrate how this estimator can be computed using the stochastic gradient descent (SGD) algorithm in Wasserstein space introduced in a companion paper arXiv:2201.04232v2 [math.OC], and provide a numerical example for experimental validation of the proposed method.


Introduction
Given a set M of probability distributions on some data space X , learning a model m ∈ M from data points D = (x 1 , . . . , x n ) ∈ X n consists in choosing, under a given criterion, an element of M that best explains D as data sampled from it. The Bayesian paradigm provides a probabilistic approach to deal with model uncertainty in terms of a prior distribution on models M, and also furnishes strategies to address the problem of model selection, based on the posterior distribution on M given D. This type of estimators, usually called predictive posterior laws, include classical Bayesian estimators such as the maximum a posteriori estimator (MAP), the posterior mean, the Bayesian model average estimator (BMA) and generalizations thereof. Predictive posterior estimators typically result from selection criteria consisting in optimizing some loss function, averaged with respect to the posterior law over models, or Bayesian risk function. We refer the reader to [23,37] and references therein for mathematical background on Bayesian statistics and their use in the machine learning community.
In Section 2, we will formulate the general problem of Bayesian model selection directly on the space of probability measures (or models) on the data space, and show how this abstract framework covers both classic finitely-parametrized settings and parameter-free model spaces, allowing us to retrieve classical selection criteria as particular cases. An eye-opening observation that will follow from adopting this viewpoint is that many classical predictive posteriors can be seen as instances of Fréchet means [22], or barycenters in the space of probability measures, with respect to specific metrics or divergences between them, that play the role of abstract loss functions defined on the model space.
Building upon this general framework, the main goals of this work are to introduce a novel Bayesian modelselection criterion by proposing a loss function on models coming from the theory of optimal transport, and to study some of the distinctive features of the predictive posterior law that results from it. More precisely, let us consider observations D = (x 1 , . . . , x n ) in a metric space (X , d) and a set of candidate models M that generated these observations. Equipping the set M with a prior distribution Π, and denoting the corresponding posterior distribution over M by Π n , we will define the Bayesian Wasserstein barycenter estimator (BWB) as a minimizerm n p ∈ M of the "risk function" M m → P(X ) W p (m,m) p Π n (dm), (1.1) where P(X ) is the set of probability measures on X , p ≥ 1 and W p is the celebrated p-Wasserstein distance on probability measures on X associated with d, see [45,46]. Wasserstein barycenters were initially studied in [1] and, since then, the concept has been extensively explored from both theoretical and practical perspectives. We refer the reader to the overview [38] for statistical applications, to the works [12,15,16,41] for applications in machine learning, and to [3,4,10,33] for a presentation of recent developments and further references. As a cautionary tale, we mention that the problem of computing Wasserstein barycenters is known to be NP-hard, see [2]. To cope with this, fast methods aiming at learning and generating approximate Wasserstein barycenters on the basis of neural networks techniques, have also been proposed, see e.g. [31,32].
In Section 3 we will recall Wasserstein distances and revisit Wasserstein barycenters together with their basic properties such as existence, uniqueness and absolute continuity. In Section 4, we will rigorously introduce the BWB estimatorm n p , which will correspond to the so-called population Wasserstein barycenter [33] for the posterior distribution on models Π n , and we will state some of its main properties. Specifically, we will show that the BWB has less variance than the BMA and we will study its statistical consistency. In particular, we will address the question of "posterior consistency", or asymptotic concentration of the posteriors Π n around the Dirac mass on a model m 0 as n → ∞, whenever the data consists of i.i.d. observations following the law m 0 , alongside the question of convergence of the BWB to the "true" distribution m 0 of the data in that setting. We refer the reader to [17,23], and references therein, for detailed accounts on posterior consistency, a highly desirable feature of a Bayesian estimation procedure, both from a semi-frequentist perspective as well as from the "merging of opinions" point of view on Bayesian statistics (cf. [23], Chap. 6). After reviewing central notions and tools from the framework of posterior consistency, namely the celebrated Schwartz' theorem [44], Theorem 6.17 of [23] and the notion of Kullback-Leibler support of Π, we will provide in that section equivalent and (verifiable) sufficient conditions for both the posterior consistency in the Wasserstein topology and the a.s. convergence Lastly, we will show how the BWB estimator can be calculated using a novel stochastic algorithm, introduced in the companion paper [7], to compute population Wasserstein barycenters in a general setting. This algorithm, presented in Section 5, can be seen as an abstract stochastic gradient descent method in the Wasserstein space and is advantageous compared to gradient or fixed-point algorithms developed in [3,4,48], whose application is restricted to barycenters of model spaces M comprised of finitely-many elements only. Moreover, our algorithm has theoretical guarantees of convergence under suitable conditions, it can be easily implemented for some families of regular models for which optimal transport maps are explicit or easily computed (we recall one such family in Sect. 6.1), and its convergence rate can be studied and established in some cases, see [13] for the Gaussian setting. A comprehensive numerical experiment illustrating this method, and its natural "batch" variants, will be presented in Section 6 and compared to (more conventional) empirical barycenter estimators.

Notation:
We denote P(X ) the set of (Borel) probability measures on X endowed with the weak topology, and P ac (X ) the (measurable) subset of absolutely continuous probability measures, with respect to a common reference σ-finite measure λ on X . As a convention, we shall use the same notation for an element m(dx) ∈ P ac (X ) and its density m(x) with respect to λ. We denote by supp(ν) the support of a measure ν and by |supp(ν)| its cardinality. Given Γ ∈ P(P(X )) and a measurable subset M ⊆ P(X ), we say that M is a model space for Γ if Γ(M) = 1. Last, given a measurable map T : Y → Z and a measure ν on Y we denote by ν T the image measure (or push-forward), that is, the measure on Z given by ν T (·) = ν(T −1 (·)).

Bayesian learning in model space
We start by setting a general framework for Bayesian learning which covers both finitely-parametrized settings (including hierarchical models) and parameter-free models. We consider a probability measure Π ∈ P(P(X )) understood as a prior distribution on the model space M ⊆ P ac (X ). In particular we have Π(M) = Π(P ac (X )) = 1.
The posterior distribution Π(dm|x 1 , . . . , x n ) given the data D is also an element of P(P(X )), which we denote Π n for simplicity and which, by virtue of the Bayes rule, is given in this setting by .
Notice that (x 1 , . . . , x n ) → Π n = Π(·|x 1 , . . . , x n ) defines λ ⊗n a.e. a measurable function from X n to P(P(X )). The density Λ n (m) of Π n (dm) with respect the prior Π(dm) is called the likelihood function. The fact that Π n Π implies that a model space M for Π is a model space for Π n too.
We call loss function a non-negative functional on models L : M × M → R, interpreting L(m 0 ,m) as the cost of selecting modelm ∈ M when the true model is m 0 ∈ M. With a loss function and the posterior distribution over models Π n , the Bayes risk (or expected loss) R L (m|D) and the corresponding Bayes estimatorm L (or predictive posterior law) are respectively defined as follows: See [8] for further background on Bayes risk and statistical decision theory. A key consequence of defining both L and Π n directly on the model space M (rather than on parameter space), is that learning according to equations (2.3)-(2.4) does not depend on the chosen parametrization or the geometry of the parameter space. Moreover, this point of view will allow us to define loss functions in terms of various metrics/divergences directly on the space P(X ), and therefore to enhance the classical Bayesian estimation framework through the use of optimal transportation distances on that space. Before further developing these ideas, we discuss how this general framework includes model spaces which are finitely parametrized, and recall some standard choices in that setting. We will also discuss the advantages of formulating the problem of model selection directly on the model space, even when this space can be finitely parametrized.

Parametric setting
We say that M is finitely parametrized if there is an integer k, a measurable set Θ ⊆ R k termed parameter space, and a measurable function T : Θ → P ac (X ), called parametrization, such that M = T (Θ). In other words, m = T (θ) is the model corresponding to parameter θ, which is classically denoted p(·|θ) or p θ (·). In general, T is a one-to-one function (the model space is otherwise said to be over-parametrized).
The parametric case is embedded into the considered model-space setting as follows. The push-forward of p through T defines a prior Π = p T over the model space M in the sense discussed at the beginning of this section. If T is one to one, a loss function : Θ × Θ → R + induces a loss function L on the model space M = T (Θ), such that L(T (θ 0 ), T (θ)) = (θ 0 ,θ). More generally, any loss function L on M × M induces a loss functional on Θ × Θ defined as (θ 0 ,θ) := L(T (θ 0 ), T (θ)). Moreover, when the data x 1 , . . . , x n under the model parameterized by θ consists of an i.i.d. sample from p(·|θ) = T (θ), one can verify that Π n given in equation (2.2) corresponds precisely to the the push-forward through T of p(θ|x 1 , . . . , x n ) and that R (θ|D) in equation (2.5) is given by with Π n (dm) = Λ n (m)Π(dm) associated with the prior on model space Π = p T , andm = T (θ).
Example 2.1. Consider the parametric Bayesian model with parameter space Θ and sample space X both equal to R d and Gaussian parametrized models N (θ, Σ), with Σ ∈ R d×d a fixed covariance matrix, and θ a random mean with prior N (µ 0 , Σ 0 ). The mean µ 0 ∈ R d and convariance matrix Σ 0 are fixed hyperparameteres. This is classically denoted: From now on, N (y; ν, K) stands for the density of the Gaussian law N (ν, K) evaluated on the value y. Following from the introduced notation, the parametrization T : Θ → P(X ) is thus given by T (θ) = N (θ, Σ), the model space is M = {N (θ, Σ) : θ ∈ R d } and a measure m sampled from the prior Π(dm) = p T (dm) is a Gaussian distribution on X = R d , with fixed covariance matrix Σ and random mean θ distributed according to N (µ 0 , Σ 0 ). In this case, if both Σ and Σ 0 are nonsingular, the posterior of θ given D = (x 1 , . . . , x n ) is withx n the sample mean of the observations D. A model m sampled from the posterior Π n (dm) given D is thus obtained by sampling θ distributed according to p(θ|D) in equation (2.6) and then setting m = N (θ, Σ). The posterior mean estimator of the parameter θ is therefore given bŷ and the corresponding predictive posterior model is the Gaussian law N (θ 2 , Σ). Observe that, in this case, one could equivalently obtainθ 2 as the mean of the predictive posterior lawm L associated with the loss function L on models, defined by L(m,m) This illustrates the equivalence of (some) pairs of losses in the model and parameter spaces that result on the same Bayes estimators. Additionally, observe that Bayesian inference on the mean and the covariance matrix could be similarly formulated in terms of a loss functions on models too, take e.g. L (m,m) = L(m,m) + Σ Fro with L(m,m) as before, Σ m the covariance matrix of a r.v. with law m and · Fro the Frobenius norm on matrices.
The general model space framework applies equally to models with parameter and data spaces that might be of different nature: Example 2.2. Assume Θ = R + and X = N, with p(x|λ) = e −λ λ x x! , x ∈ X , and p(λ) ∝ e −βλ λ α−1 , λ ∈ Θ. In this case, T : Θ → P(X ) is given by T (λ) =Pois(λ), the Poisson distribution of parameter λ, and a measure m sampled from the prior Π(dm) = p T (dm) is a Poisson law with random parameter λ distributed according to the Gamma(α, β) law. The latter is a conjugate prior for the Poisson distribution, and m sampled from Π n (dm), the posterior on models given D = (x 1 , . . . , x n ), is again a Poisson law Pois(λ), with random parameter λ distributed according to Gamma(nx n + α, n + β). Remark 2.3. Hierarchical models are also catered for in the proposed setting. For instance, if in Example 2.1 the hyperparameter µ 0 is random with known density π 0 on R d , it can be integrated out and the prior p on parameters becomes an infinite Gaussian mixture: The parematrization mapping T is in this case the same as before, and the corresponding prior Π(dm) = p T (dm) and posteriors Π n (dm) on models m ∈ P(X ) follow the same rationale as above.
As a cautionary note, the following example illustrates how defining estimators directly in terms of parameters might result in non-intrinsic criteria for model selection. On the other hand, starting from the priorp, the posterior density of θ is proportional to θ 2S N +1 (1 − θ 2 ) N −S N and now the associated MAP estimator for θ isθ N := 2S N +1 2N +1 . Hence T (θ N ) =T (θ N ) in general (although their discrepancy vanishes in the limit as N → ∞). To summarize, although the same prior and posteriors at the level of models can arise by considering different parametrizations, the latter may easily define very different estimated models even if we agree on the estimation method (here the MAP).

Non parametric setting: posterior average estimators
The general "learning in model space" approach relies on loss functions that compare directly distributions (instead of their parameters), and thus allows us to define selection criteria based on intrinsic features of the models. It also allows for a wider choice of model-selection criteria, which can account for geometric or information-theoretic properties of the models. The next result illustrates the fact that many examples of Bayesian estimators or predictive posterior, including the classical model average estimator, correspond to finding an instance of Fréchet mean or barycenter [22,48] under a suitable metric/divergence on probability measures. See Appendix A for the proof. Furthermore, the Bayes estimators corresponding to the cases ii) and iv) are given by the square model average and the exponential model average, respectively: where Z sqr and Z exp denote the corresponding normalizing constants.
Example 2.6. If the posterior distribution was approximately equally concentrated on the models m 0 = N (µ 0 , 1) and m 1 = N (µ 1 , 1) with µ 0 = µ 1 , that is, two (unimodal) Gaussian distributions with unit variance, then the standard model average is a bimodal non-Gaussian distribution with variance strictly larger than 1.
Example 2.7. In the parametric Bayesian model discussed in Example 2.1, the Bayesian model average estimator is the convolution of distributions on R d : that is, a Gaussian density with mean equal to the posterior mean estimatorθ 2 -see equation (2.7)-and a covariance matrix that is strictly larger (in the usual order on nonnegative definite symmetric matrices) than that of the predictive posterior law N (θ 2 , Σ) associated with it.
The Bayesian estimators considered Proposition 2.5, equations (2.8)-(2.9), share the following characteristic: their values at each point x ∈ X are computed in terms of some posterior average of the values of certain functions evaluated at x. This is due to the fact that the corresponding distances/divergences on probability distributions are "vertical" [43]: computing the distance between distributions m andm involves the integration of vertical displacements between the graphs of their densities across their domain. An undesirable fact about vertical averages is that they are not well suited to incorporate geometric properties into the model space (as illustrated by Exam. 2.6). More generally, model averages might yield solutions that can be hardly interpretable in terms of the prior and parameters, or even be intractable. This motivates us to explore the use of "horizontal" distances between probability distributions, thus extending the concept of Bayes estimator by making use of the geometric features of the model space. The next section presents the main ideas we will rely on to build such Bayes estimators.

Wasserstein distances and Barycenters: a quick review
We shall now introduce objects analogous to those in Proposition 2.5 but suited to Wasserstein distances on the space of probability measures. The following framework is adopted in the sequel: Assumption 3.1. The metric space (X , d) is a separable locally-compact geodesic space endowed with a σ-finite Borel measure λ and p ≥ 1.
By geodesic we mean that the space (X , d) is complete and any pair of points admit a mid-point with respect to d. Next, we briefly recall some basic elements of optimal transport theory and Wasserstein distances, referring to [45,46] for general background.

Optimal transport and the Wasserstein distance
Given two measures µ, υ over X we denote by Cpl(µ, υ) the set of transport plans or couplings with marginals µ and υ, i.e., γ ∈ Cpl(µ, υ) if and only if γ ∈ P(X × X ), γ(dx, X ) = µ(dx) and γ(X , dy) = υ(dy). Given a real number p ≥ 1 we define the p-Wasserstein space W p (X ) by The p-Wasserstein distance between measures µ and υ is given by (3.1) An optimizer of the right-hand side of equation (3.1) always exists and is called an optimal transport. The distance W p turns W p (X ) into a complete metric space. If in equation (3.1) we assume that p = 2, X is the Euclidean space, and if µ is absolutely continuous, then Brenier's theorem ( [45], Thm. 2.12(ii)) establishes the uniqueness of a minimizer, and guarantees that it is supported on the graph of the subdifferential of a convex function. The corresponding gradient is thus called an optimal transport map. Explicit formulae for such optimal transport maps do exist in some cases, e.g., for generic one-dimensional distributions and multivariate Gaussians when p = 2 (see [14]). Contrary to the distances / divergences considered in Proposition 2.5, Wasserstein distances are horizontal [43], in the sense that they involve integrating horizontal displacements between the graphs of probability densities.

Wasserstein barycenter
Let us now recall the Wasserstein barycenter, introduced in [1] and further studied in [28,33,40], among others. Our definition slightly extends the ones in those works in that the optimization problem is posed in a possibly strict subset of the usual one.
Given a measurable set M ⊆ P(X ), any measurem p ∈ M which attains the quantity with finite value, is called a p-Wasserstein barycenter of Γ over M. Notice that in principle we are not assuming M to be a model space for Γ, but this will often be the case. When the support of Γ is infinite and M = W p (X ), this object is termed p-Wasserstein population barycenter of Γ as introduced in [33]; see [10].
. This should be compared to Example 2.6. Figure 1 illustrates the corresponding vertical and a horizontal interpolations between two Gaussian densities with different means and the same variance. Let us introduce additional notation for the sequel and review then some basic properties of Wasserstein barycenters. Considering W p (X ) with the complete metric W p as a base Polish metric space, we define W p (W p (X )) in the natural way: Γ ∈ P(W p (X )) is an element of W p (W p (X )) if it is concentrated on a set of measures with finite moments of order p, and moreover for some (and then all)m ∈ W p (X ) it satisfies We endow W p (W p (X )) with the corresponding p-Wasserstein distance, which we also denote W p for simplicity. Also, if Γ is concentrated on measures with finite moments of order p which have densities with respect to λ, then we write Γ ∈ P(W p,ac (X )) and use the notation Γ ∈ W p (W p,ac (X )) if, as before, P(X ) W p (m,m) p Γ(dm) < ∞ for somem. Remark 3.6. If Γ ∈ P(P(X )) has a p-Wasserstein barycenterm p over M, then We next state an existence result first established in Theorem 2 of [33] for the case M = W p . See Appendix B for a simpler, more direct, proof.
Theorem 3.7. Suppose Assumption 3.1 holds, Γ ∈ P(P(X )), and M ⊆ W p is a weakly closed set. There exists a p-Wasserstein barycenter of Γ over M if and only if Γ ∈ W p (W p (X )).
Regarding uniqueness, the following general result was proven in Proposition 6 of [33] for the case X = R q with d the Euclidean distance and p = 2 (observe that, in that situation, the previous result applies): and Γ(A) > 0. Then, Γ admits a unique 2-Wasserstein population barycenter over W 2 (R q ). Remark 3.9. Observe that the model space M = W p,ac (X ) is not weakly closed. Nevertheless, the existence and uniqueness of a population barycenter over that set can still be guaranteed when p = 2, X = R q , d is the Euclidean distance, λ is the Lebesgue measure, and This was proven in Theorem 6.2 of [28] for compact finite-dimensional manifolds with lower-bounded Ricci curvature (equipped with the volume measure), but one can read-off the (non-compact but flat) Euclidean case X = R q from the proof therein, in order to establish the absolute continuity of a barycenter over W 2 (R q ), in the setting of Lemma 3.8. If |supp(Γ)| < ∞ then equation (3.2) can be relaxed to the condition Γ ({m : m λ}) > 0, as shown in [1] or Theorem 5.1 of [28].
The following statement, corresponding to Lemma 3.1 of [7], provides a useful description of barycenters which generalizes a result proven in [3] when |supp(Γ)| < ∞.
s. equal to the unique optimal transport map from µ to m at x. Furthermore, lettingμ be a barycenter of Γ, we have

Bayesian Wasserstein barycenter and statistical properties
Building on the Wasserstein distance as a loss function on models, we arrive to the following central object of the article: Definition 4.1. Let us consider a prior Π ∈ P(P(X )) with model space M ⊆ W p,ac (X ) and data D = (x 1 , . . . , x n ) which determines Π n as in equation (2.2). We define the p-Wasserstein Bayes risk ofm ∈ W p (X ) and a Bayes Wasserstein barycenter (BWB) estimatorm n p over M respectively as follows: if the corresponding minimum is finite. This implies that, for any tuple of data point D = (x 1 , ..., x n ) the BWB corresponds to the Gaussian distribution , that is, the posterior mean estimator of the mean parameter. Moreover, in this particular case the barycentric cost or optimal 2-Wasserstein Bayes risk equals the trace of the covariance of a random vector with law given in equation (2.6), i.e., T r((Σ −1 0 + nΣ −1 ) −1 ). Notice that the covariance Σ of the BWB is strictly smaller than that of the corresponding BMA estimator in equation (2.10) (in the usual order on symmetric positive semidefinite matrices). This is, in fact, a general property of the BWB as claimed in Proposition 4.9 below.
Example 4.3. Similarly, in the parametric setting of Example 2.2, the problem of finding BWB estimators can be written in terms of the parametric loss induced by the corresponding p-Wasserstein distance on Poisson distributions, computed in Example 3.3. We thus deduce for p = 1 that the BWB estimatorm n 1 corresponds, in that setting, to the law Pois(λ 1 ) with λ 1 the median of the posterior distribution Gamma(nx n + α, n + β); for which an explicit expression is not available.
, is known to be optimal for the p-Wasserstein distance, for any p ≥ 1 (see [45], Rem. 2.19(iv)). The BWBm n =m p n of the posterior Π n is also independent of p and is characterized via its quantile, as follows: Interestingly enough, the model averagem n := mΠ n (dm) of Π n is in turn characterized by its averaged cumulative distribution function: Fm n (·) = M F m (·)Π n (dm). See Section 5 in the companion paper [7] for details and further discussion on one-dimensional Wasserstein barycenters, in particular, on geometric properties they inherit from the elements m of the support of the prior/posterior. Remark 4.6. We observe that Π ∈ W p (W p (X )) implies for each n ≥ 1 that Π n ∈ W p (W p (X )) a.s.
Indeed, for fixedm ∈ W p (X ), we have (we thank an anonymous referee for pointing this out): However, Π ∈ W p (W p (X )) is in general not enough for Π n ∈ W p (W p (X )) to hold a.s. for i.i.d. data points (x 1 , . . . , x n ) sampled from a fixed law m 0 , which would be the natural setting to formulate the question of Bayesian consistency (see next subsection). Corollary 4.19 below ensures this fact for suitable given laws m 0 , in the framework of Bayesian consistency. In Appendix C we further provide a sufficient condition on the prior Π (termed integrability after updates) ensuring that Π n ∈ W p (W p (X )) for every possible tuple of data points (x 1 , . . . , x n ).
The following statement gathers the discussion in Section 3.2 for the case Γ = Π n , as well as the main point of Remark 4.6: Theorem 4.7. Suppose Assumption 3.1 holds and the model space M is weakly closed. Let Π ∈ P(P(X )) be a prior with model space M ⊆ W p,ac (X ) and Π n be the corresponding posterior given the data D = (x 1 , . . . , x n ). The following are equivalent: Moreover, if Π ∈ W p (W p (X )), a p-Wasserstein barycenter estimatorm n p over M exists a.s. for every n ≥ 1.
We thus make in all the sequel the following assumption: Assumption 4.8. Π ∈ W p (W p,ac (X )) and there exists a weakly closed model space M ⊆ W p,ac (X ) for Π.
Since Π n Π a.s., Assumption 4.8 together with Remark 4.6 imply that Π n ∈ W p (W p,ac (X )) a.s. for all n.
We will next study some basic statistical properties of the BWB estimator.

Variance reduction with respect to BMA
In this subsection, we assume that X = R q , λ = Lebesgue measure, d = Euclidean distance and p = 2. Let m :=m n 2 be the unique population barycenter of Π n in that case, and denote by (m, x) → T m (x) a measurable function equal λ(dx)Π(dm) a.e. to the unique optimal transport map fromm to m ∈ W 2 (X ). As a consequence of Lemma 3.10 we have the fixed-point propertym = ( T m Π(dm))(m). Thus, for all convex functions ϕ, non negative or with at most quadratic growth, we have is the Bayesian model average in equation (2.8). We have used Jensen's inequality and Fubini's theorem. This means that the BWB estimator is less spread, or smaller, in the sense of convex-order of probability measures, than the BMA. As a consequence, we have: Proposition 4.9. Consider p = 2, and letm BM A andm n 2 respectively denote the BMA and the BW B estimators associated with Π n . Then, we have In other words, the BWB estimator has less variance than the BMA. Furthermore, withx denoting the mean w.r.t. the BM A or BW B, the corresponding covariances satisfy: in the usual order for symmetric positive semidefinite matrices. The inequalities are strict unless Π n is a Dirac mass.
Proof. Given the previous discussion, the equality of means is obtained by taking ϕ(x) and −ϕ(x) equal to each coordinate of x and the inequality of variances by taking ϕ(x) = x 2 . The inequality for the covariances follows by taking ϕ(x) = (y t (x −x)) 2 for arbitrary y ∈ R d .
For the last claim, we just need to make explicit the corresponding equality case of Jensen's inequality as used in the preceding discussion. Since x 2 is strictly convex, the equality of second moments implies that m n 2 (dx) a.e. x, that T m (x) is Π m (dm) a.s. constant. This entails that the map T := T m does not depend on m and that for Π m (dm) a.e. m, it holds that m = T (m n 2 ). The equality case for the covariances is reduced to the previous one considering their traces using also the equality of means.

Convergence to the true model and Bayesian consistency
A natural question in Bayesian statistics is whether a given predictive posterior estimator is consistent (see [17,23,44]). In short, this means the convergence of the predictive posterior law, in some specified sense, towards the true model m 0 , as we observe more and more i.i.d. data sampled from m 0 . We are specifically interested in the question of whether the BWB estimator converges to the model m 0 and, more precisely, on conditions which guarantee that a.s. Remark 4.11. We notice that in the literature of Bayesian consistency, see e.g. [23,44], Π satisfying the above property (in either topology) would be called "strongly consistent" at m 0 in allusion to the m (∞) 0 -a.s. convergences, whereas the term "weakly consistent" would be used when those convergences hold in probability. Since we will be only dealing with the almost sure notion, in order to avoid possible confusions with topological concepts, the adverb "strongly" is omitted throughout when referring to consistency of the prior, while the adverb "weakly" only refers to the weak topology on probability measures.
The celebrated Schwartz theorem [44] provides sufficient conditions for consistency w.r.t. a given topology, see also Chapter 6 of [23] for a modern treatment. A key ingredient is the notion of Kullback-Leibler support: Remark 4.13. The statistical model is interpreted as being "correct" or well specified, if the data distribution m 0 is an element of supp(Π), the support of Π w.r.t. the weak topology, see [9,23,26,29,30]. The condition m 0 ∈ KL (Π) is stronger. Indeed, by the Csiszar-Pinsker inequality and the fact that the dual bounded-Lipschitz distance (metrizing the weak topology in P(X )) is majorized by the total variation distance, one can check that KL (Π) ⊆ supp(Π). In particular, one has KL (Π) ⊆ M for any weakly closed model space M for Π. (The reader may consult the mentioned works for the misspecified framework too). We recall (see Theorem 6.17 and Example 6.20 in [23]) the following result which concerns the weak topology: Theorem 4.15. Assume (only) that (X , d) is Polish endowed with a σ-finite Borel measure λ, that Π ∈ P ac (X ) and that m 0 ∈ KL (Π). Then, Π is consistent at m 0 in the weak topology.
We now state a general result relating consistency of Π at m 0 in the p-Wasserstein topology and the convergence (4.4) (or consistency of the predictive posteriorm n p ), with other asymptotic properties of the posterior laws in the Wasserstein setting. Recall that the notation W p throughout stands for the Wasserstein distance both in W p (X ) and W p (W p (X )).
Theorem 4.16. Suppose Assumptions 3.1 and 4.8 hold, that m 0 ∈ M, and that Π n ∈ W p (W p (X )), m ⊗n 0 − a.s. for all n ≥ 1. (4.5) The following are equivalent: -a.s. as n → ∞, we have W p (m n p , m 0 ) → 0 and the barycentric cost (or optimal p-Wasserstein Bayes risk or) M W p (m,m n p ) p Π n (dm) goes to 0. c) Π is consistent at m 0 in the p-Wasserstein topology and the p-moment of the BMA estimator (2.8) converge m (∞) 0 -a.s. to that of m 0 as n → ∞, i.e. for some (and then all) x 0 ∈ X we have d) Π is consistent at m 0 in the weak topology and for some (and then all) Proof. By minimality of a barycenter over M, Thus, for some c > 0 depending only on p, proving that a) ⇒ b). The converse b) ⇒ a) follows from Let us now show that a) ⇒ c). The convergence W p (Π n , δ m0 ) → 0 implies (by the Portmanteau theorem) that lim sup n→∞ Π n (F ) ≤ δ m0 (F ) for all closed sets F of W p (X ). Taking F = U c with U a neighborhood of m 0 yields the consistency of Π at m 0 in the p-Wasserstein topology. Moreover it implies that for eachm ∈ W p (X ), -a.s., as probability measures on the metric space W p (X ). As in the previous (converse) implication, we obtain from (4.6) the convergence of some moments of order p of Π n , to the corresponding moment of δ m0 , m (∞) 0 -a.s. This plus the weak convergence just established imply that W p (Π n , δ m0 ) → 0 m (∞) 0 -a.s.
We have thus established that a), b) and c) are equivalent. Notice now that the function , that is, Φ has polynomial growth of order at most p on W p (X ). Therefore, if a) or equivalently c) holds, we have m which is tantamount to (4.7). Moreover if c) holds, consistency of Π at m 0 in the weak topology is obvious. This shows that c) ⇒ d).
Last, if d) holds, we deduce with Markov's inequality that, for each rational ε > 0, as n → ∞. This, together with the consistency of Π at m 0 w.r.t. weak topology, is equivalent to having that consistency w.r.t. the p-Wasserstein topology. Moreover, we have and so the convergence (4.7) implies the convergence (4.6). This and the previous show that d) implies c), concluding the proof.
Remark 4.17. The convergence (4.11) for each ε > 0 is exactly what one must add to consistency at m 0 of Π in the weak topology to obtain such consistency in the p-Wasserstein topology. However, the latter is only equivalent to the m (∞) 0 -a.s. weak convergence of Π n to δ m0 as measures on the metric space W p (X ) as n → ∞, which is not enough to grant the m (∞) 0 -a.s. convergence in W p (W p (X )) of Π n to δ m0 . Similarly, consistency at m 0 of Π in the p-Wasserstein topology cannot in general be obtained by adding only the convergence of moments 4.6 to the consistency at m 0 of Π in the weak topology. Of course, if the space X is bounded, weak and p-Wasserstein topologies on it coincide, and consistency of Π at m 0 in the weak topology implies in that case the convergence (4.7) (since the function Φ in the previous proof is in that case continuous and bounded). Hence, in that case, all the equivalent properties in Theorem 4.16 are satisfied. The same is true if Π(dm)-a.e. m is supported on a fixed bounded (weakly) closed set Y ⊆ X (just replace X by Y).
An immediate consequence of the proof of Theorem 4.16 (cf. the estimate (4.8)) is the following bound, which can be used to obtain quantitative estimates for the average rate of convergence of the BWB, if the posterior and the Wasserstein distance between models are explicit enough: The following result gathers sufficient conditions for consistency at m 0 of Π in the p-Wasserstein topology and convergence of the BWB. Proof. In view of Theorem 4.15, to prove the first claim we just need to prove that (4.5) holds. To that end, let us first check that, whenever m 0 ∈ KL (Π), we have m ⊗n 0 (dx 1 , . . . , dx n ) m ⊗n (dx 1 , . . . , dx n ) M Π(dm). If A ⊆ X n measurable is such that A m ⊗n (dx 1 , . . . , dx n ) M Π(dm) = 0, then for Π(dm)-a.e. m and every x i ∈ X , i = 2, .., n, one has m({x ∈ X : (x, x 2 , ..., x n ) ∈ A}) = 0, from which we get m 0 ({x ∈ X : (x, x 2 , ..., x n ) ∈ A}) = 0 by Remark 4.14, hence m ⊗n 0 (A) = 0. We conclude noting that the set A := {(x 1 , . . . , x n ) ∈ X n : Π n / ∈ W p (W p (X ))} has null m ⊗n (dx 1 , . . . , dx n ) M Π(dm)-measure, by Remark 4.6.
Given the previous and part d) of Theorem 4.16, the convergence (4.7) immediately implies that Π is consistent at m 0 in the p-Wasserstein topology and that W p (m n p , m 0 ) → 0 m -a.s. uniform boundedness of the q moments of the BMA estimator for some q > p, implies, under the given assumptions, that said convergence (4.7) holds. Consider to that end the function Φ defined by (4.10) in the proof of Theorem 4.16. For each ε > 0 we have The assumption on the q-moments of the BWA estimators imply that sup n M Φ(m) q/p Π n (dm) < ∞, m (∞) 0 -a.s., following from applying Jensen's inequality to the convex function s → |s| q/p on R and the integral w.r.t. m(dx) defining Φ(m). Since ε > 0 is arbitrary, it just remains to ensure that Π n (m : Φ(m) ≥ ε) → 0 m (∞) 0 -a.s. as n → ∞ (that is, the convergence (4.11) holds). The elementary relations t = t ∧ R + (t − R) + and (t − R) + ≤ t1 {t>R} for all t, R ≥ 0 yield the bound Π n (m : Φ(m) ≥ ε) ≤ Π n m : where the first term on the r.h.s. goes m (∞) 0 -a.s. to 0 as n → ∞, since x → R ∧ d(x, x 0 ) p is a bounded continuous function and Π is consistent at m 0 in the weak topology. The second term on the r.h.s. is bounded by For each ε > 0, this expression can be made arbitrarily small by taking R large enough, since m 0 ∈ W p (X ) and because the uniform boundedness of the q-moments of the BWA estimators implies their p-moments are uniformly integrable. We conclude that Π n (m : Φ(m) ≥ ε) → 0 m   N (θ 0 , Σ) for any θ 0 ∈ R d . Moreover, the prior Π is easily seen to be in W 2 (W 2 (R d )), hence Lemma 4.19 ensures that Π n ∈ W 2 (W 2 (R d )), m ⊗n 0 -a.s. for all n ≥ 1. This fact alternatively follows from the existence of a 2-Wasserstein barycenter for Π n for any data points D, verified in this case. To verify consistency of Π in the 2-Wasserstein topology at any m 0 as well as convergence of the BWB, let us compute W 2 2 (Π n , δ m0 ) and apply directly Theorem 4.16. Denoting by θ a r.v. with law N θ 2 , (Σ −1 0 + nΣ −1 ) −1 , we find a.s. goes to 0 as n → ∞ by the Law of Large Numbers, whenever the data (x n ) n≥1 consists of an i.i.d. sample from the true model m 0 . Notice that the identity W 2 2 (Π n , δ m0 ) = W 2 2 (m n 2 , m 0 ) + M W 2 2 (m,m n 2 )Π n (dm) found in this case, can be seen as a bias-variance type decomposition for the posterior law Π n on W 2 (X ). In this case, one can readily see that E(W 2 2 (m n 2 , m 0 )) ≤ C/n.
with λ a r.v. with law Gamma(nx n + α, n + β). An elementary computation using the Gamma distribution's mean and variance shows, in this case, that which goes to zero m (∞) 0 a.s. as n → ∞, whenever the data (x n ) n≥1 is an i.i.d. sample from the true model m 0 . We deduce with Corollary 4.19 that E(W 1 (m 0 ,m n 1 )) ≤ C/ √ n.
As noticed earlier, consistency of Π at m 0 w.r.t. the p-Wasserstein topology is not enough to grant that W p (Π n , δ m0 ) → 0 m A typical example is the finitely parametrized case with compact parameter space Θ and continuous parametrization mapping T : Θ → W p (X ). More generally, diam(Π) < ∞ amounts to Π being supported on a set of models with centered p-moments bounded by a constant. In particular, X and the support of every m ∈ supp(Π) may be unbounded, and still supp(Π) be bounded. We have Proof. Let ε > 0 and B = {m : W p (m, m 0 ) < ε}, then Since ε > 0 is arbitrary, we only need to check that the second term in the last line goes m -a.s. as n → ∞, and since supp(Π n ) ⊆ supp(Π), we conclude that  (4.12) where B w is a fixed small weak neighborhood of m 0 (e.g. one can use the bounded Lipschitz distance to build B w ). As in the proof of Lemma 4.22, the second term in the r.h.s. of (4.12) goes m (∞) 0 -a.s. to zero as n → ∞ in this case too. However there is no reason why the term Bw W p (m, m 0 ) p Π n (dm) should be small. The reason is that for the functional m → Ψ(m) := W p (m, m 0 ) p , even if bounded on M := supp(Π), there is no reason why Ψ| Bw∩M should be small (no matter how small B w may be). Indeed, the statement "Ψ| Bw∩M is small if B w is small" would mean mathematically that the weak and the p-Wasserstein topologies coincide on M locally around m 0 , but this is not true in general, even if M is bounded 1 . This should not be confused with the fact that, if M equiped with the p-Wasserstein topology and metric is bounded, then on W p (M) weak and p-Wasserstein convergence coincide. Consistency at m 0 of Π in the weak topology, on the other hand, is equivalent to Π n → δ m0 , m  Then, Π is consistent at m 0 in the p-Wasserstein topology. Moreover, we have W p (Π n , δ m0 ) → 0, m (∞) 0 -a.s. and the BWB estimator is consistent in the sense that Before proving Theorem 4.24 some remarks on its assumptions and proof are in order.
Remark 4.25. The uniform control assumed on p-exponential moments implies, by Jensen's inequality, that sup By triangle inequality, this in turn implies that supp(Π) is bounded.
Remark 4.26. The general picture of Bayesian consistency (including the misspecified case) parallels in several aspects Sanov's large deviations theorem (see e.g. the Bayesian Sanov Theorem ( [26], Thm. 2.1) and references therein), and it similarly relies on exponential controls of (posterior) integrals. The proof of Theorem 4.24 follows the argument of Example 6.20 in [23], where Theorem 4.15 above is obtained by combining Schwartz' theorem ( [23], Thm. 6.17) with Hoeffding's concentration inequality, to get uniform exponential controls of the posterior mass of complements of weak neighborhoods of m 0 , which are defined in terms of bounded random variables. A p-exponential moment control is what is needed to derive such concentration inequalities for unbounded random variables (e.g. moments), defining neighborhoods in the Wasserstein topology. Notice that finite p-exponential moments are also required for Sanov's theorem to hold in the p-Wasserstein topology [47]. The uniform bound on exponential moments appears however too strong an assumption (it does not hold e.g. in the setting of Example 4.20). The question of relaxing that condition is left for future work.
Remark 4.27. In the misspecified framework dealt with in [9,23,26,29,30], and paralleling results applicable to the weak topology in those works, we expectm n p → argmin m∈M D KL (m 0 ||m) w.r.t. W p to hold, under suitable assumptions.
Proof of Theorem 4.24. First we show that if U is any W p (X )-neighbourhood of m 0 then m (∞) 0 -a.s. we have lim inf n Π n (U ) ≥ 1. According to Schwartz Theorem in the extended form ( [23], Thm. 6.17), under the assumption that m 0 ∈ KL(Π), it is enough to find for each such U a sequence of measurable functions or "tests" ϕ n : X n → [0, 1] such that − a.s, and 2. lim sup n 1 n log U c m ⊗n (1 − ϕ n )Π(dm) < 0. First we will construct tests {ϕ n } n that satisfy Point 1 and Point 2 above, over an appropriate subbase of neighbourhoods U , to finally extend those properties to general neighborhoods.
Recall that µ k → µ in W p (X ) iff for all continuous functions ψ on X with |ψ(x)| ≤ K(1 + d p (x, x 0 )) for some K ∈ R + it holds that X ψ(x)µ n (dx) → X ψ(x)µ(dx); see [46]. Given such ψ and ε > 0 we define the open sets U ψ,ε := m : X ψ(x)m(dx) < X ψ(x)m 0 (dx) + ε , which form a sub-base for the p-Wasserstein neighborhood system at the distribution m 0 . We can assume that K = 1 by otherwise considering U ψ/K,ε/K instead. Given a neighborhood U := U ψ,ε of m 0 as above, we define the test functions By the law of large numbers, m (∞) 0 − a.s : ϕ n (x 1 , . . . , x n ) → 0, so Point 1 is verified. Point 2 is trivial if r := Π(U c ) = 0, so we assume from now on that r > 0. Thanks to the exponential moments control assumed on supp(Π), for Π(dm) a.e. m the random variable Z = 1 + d p (X, x 0 ) with X ∼ m has a moment-generating function L m (t) which is finite for all t ∈ [0, λ 0 ], namely We thus have the bounds for all k ∈ N. Therefore, we may apply Bernstein's inequality in the form of Corollary 2.10 in [36] to the random variables {−ψ(x i )} i under the measure m (∞) on X N , obtaining for any α < 0 that with v := 2nL m (t)t −2 , c := t −1 , and 0 < t ≤ λ 0 . Going back to the tests ϕ n and using the definition of U c we deduce that Thanks to the uniform control of exponential moments on the support, we conclude as desired that Now, a general neighborhood U contains a finite intersection of, say, N ∈ N elements of the sub-base, i.e.
Therefore we can conclude as in the sub-base case that Point 2 is verified. All in all, we have established that Π is p-Wasserstein consistent at m 0 . Thanks to Lemma 4.22 and the boundedness of supp(Π) (see Rem. 4.25) the last two claims are immediate.

BWB calculation through descent algorithms in Wasserstein space
In this section we review some of the available methods to compute Wasserstein barycenters, and then explain how they can be used to calculate the BWB estimator. We will first survey the method proposed byÁlvarez-Esteban, Barrio, Cuesta-Albertos and Matrán in Theorem 3.6 of [3] and by Zemel and Panaretos in Theorem 3 and Corollary 2 of [48], applicable in the setting where the probability distribution Γ on the model space has finite support. This method is interpreted in [48] as a gradient descent in the Wasserstein space. We will then discuss the main aspects of the stochastic gradient descent in Wasserstein space (SGDW), introduced in our companion paper [7], building also upon the gradient descent idea. The latter method allows moreover the computation of population barycenters for a distribution Γ on the model space, using a streaming of random probability distributions sampled from it. In particular, we will recall the conditions established in [7] which ensure its convergence. The reader is referred to [16,18,34,41] for alternative approaches to computing Wasserstein barycenters.
The two discussed methods can be easily implemented when explicit analytical expressions for optimal transport maps between distributions in the model space are available (see [7] or Section 6 for some examples). Moreover, in that case these methods can easily be coupled with sampling procedures (MCMC or others) for the posterior distribution Π n in order to compute the BWB. We will see that the computation of the BWB through the SGDW has several advantageous features. In particular, when expressions for optimal transport maps are known, it can be done at nearly the same cost as the posterior sampling.
From now on we specialize Assumption 3.1, and make the following set of assumptions: Assumption 5.1. p = 2, X = R q , d = Euclidean metric, λ = Lebesgue measure. Furthermore, Γ ∈ W 2 (W p,ac (R q )) and there is a model space M ⊆ W p,ac (R q ) for Γ which is weakly closed.
The following concept will be central in the "gradient-type" algorithms we consider: Definition 5.2. We say that µ ∈ W 2,ac (R q ) is a Karcher mean of Γ ∈ W 2 (W 2,ac (R q )) if Remark 5.3. It is known that any 2-Wasserstein barycenter is a Karcher mean (c.f. [48]). However, the class of Karcher means is in general a strictly larger one, see [3]. For conditions ensuring uniqueness of Karcher means, see [38,48] for the case when the support of Γ is finite and [10] for the case of an infinite support. In one dimension, the uniqueness of Karcher means holds without further assumptions. See [7] for a deeper discussion.
The next result proven in Theorem 3.6 of [3] and independently in Theorem 3 and Corollary 2 of [48], establishes the convergence of the above sequence to a fixed-point of G, which is nothing other than a Karcher mean for Γ: Proposition 5.4. The sequence {µ n } n≥0 in equation (5.2) is tight and every weakly convergent subsequence of {µ n } n≥0 converges in W 2 to an absolutely continuous measure in W 2 (R q ) which is a Karcher mean of Γ. If some m i has a bounded density, and if there exists a unique Karcher meanm, thenm is the Wasserstein barycenter of Γ and W 2 (µ n ,m) → 0.
Panaretos and Zemel ([48], Thm. 1) discovered that the sequence (5.2) can indeed be interpreted as a gradient descent (GD) scheme with respect to the Riemannian-like structure of the Wasserstein space W 2 (R q ). In fact, the functional on W 2 (R q ) given by has a Frechet derivative at each point m ∈ W 2,ac (R q ), given by where I is the identity map in R q . This means that for each such m, one has and it coincides with the sequence in equation (5.2) if γ = 1. These ideas serve as inspiration for the stochastic gradient descent iteration in the next part.

Stochastic gradient descent for population barycenters
We recall next the stochastic gradient descent sequence introduced in the companion paper [7], where the reader is referred to for details. Additionally to the assumptions in the previous part, we will also make use of an extra one introduced in [7]: Assumption 5.5. Γ has a W 2 -compact model space K Γ ⊆ W 2,ac (R q ). Moreover this set is geodesically convex: for every µ, ν ∈ K Γ and t ∈ [0, 1], ((1 − t)I + tT ν µ )(µ) ∈ K Γ , with I the identity operator. In particular, under these assumptions, for each ν ∈ W 2 (R q ) and Γ(dm) a.e. m, there is a unique optimal transport map T ν m from m to ν and, by Proposition 6 of [33], the 2-Wasserstein population barycenter is unique. We notice that, although strong at first sight, assumption 5.5 can be guaranteed in suitable parametric situations (e.g., Gaussian, or even the location scatter setting recalled in Sect. 6.1), or under moment and density constraints on the measures in K Γ (e.g., under uniform bounds on their moments of order 2 + ε and their Boltzmann entropy, which are geodesically convex functionals, see [5]). Definition 5.6. Let µ 0 ∈ K Γ , m k iid ∼ Γ, and γ k > 0 for k ≥ 0. We define the stochastic gradient descent in Wasserstein space sequence (SGDW) by The sequence is a.s. well-defined, as one can show by induction that µ k ∈ W 2,ac (R q ) a.s. thanks to Assumption 5.5. The rationale for definition 5.4 is similar to that of Section 5.1, though now we wish to emphasize the population case: If we call now the functional minimized by a 2-Wasserstein barycenter, then we (formally at least) expect Hence, (I − T m µ ) with m ∼ Γ, is an unbiased estimator of F (µ). This immediately suggests the stochastic descent sequence (5.4) introduced in [7], drawing inspiration from the classic SGD ideas [42].
Clearly µ is a Karcher mean for Γ iff F (µ) L 2 (µ) = 0. Just like for the GD sequence, the SGD sequence is typically expected to converge to stationary points, or Karcher means in the present setting, rather than to minimisers. Next theorem provides sufficient conditions for the SGDW sequence to a.s. converge to a Wasserstein barycenter, and is the main result of [7]. The following assumption on the steps γ k , standard in the framework of SGD methods, is needed:

Batch stochastic gradient descent on Wasserstein space
We briefly recall how the variance of the SGDW sequence can be reduced by using batches: ∼ Γ, and γ k > 0 for k ≥ 0 and i = 1, . . . , S k . The batch stochastic gradient descent (BSGD) sequence is given by The following two results, extracted from [7], justify the above definition: The first result states that this sequence is still converging, while the second one states that batches help reducing noise: decreasing linearly in the batch size.

Computation of the BWB
It is immediate to deduce a simple methodology based on the SGDW algorithm, to compute the BWB estimator for general (finitely or infinitely supported) posterior laws Π n ∈ W 2 (W 2,ac (R q )).
We make the practical assumption that we are capable of generating independent models m i from the posteriors Π n (in the parametric setting, this can be done through efficient Markov Chain Monte Carlo (MCMC) techniques [6,11,25] or transport sampling procedures [20,27,35,39]). On the theoretical side, we assume conditions 5.1 and 5.5 are satisfied by Γ = Π, the prior law on models, implying the posterior Π n a.s. satisfies those conditions for all n too.
The proposed method can be sketched as follows: 1. Given a prior on models Π and data x 1 , . . . , x n , sample µ (n) 0 ∼ Π n and set k = 0.
The algorithm can be run until k ∈ N large enough such that the squared 2-Wasserstein distance between µ (n) k+1 and µ (n) k is repeatedly smaller than some given positive threshold. Moreover, under the assumptions of Theorem 4.24, Π is consistent in the W 2 -topology at the law m 0 of the data x 1 , . . . , x 0 and then, for some (random) large enough n, k ∈ N and given ε > 0 one has W 2 (µ The batch version SGWD can be implemented in a similar way to compute the BWB estimator.
Remark 5.11. An alternative, natural approach would be to sample, for given n a fixed number k of realizations m i iid ∼ Π n , i = 1, . . . , k, and compute, using the GDW algorithm, the Wasserstein barycenterm or 2-Wasserstein empirical barycenter of Π n (see [10]). Indeed, by Varadarajan's theorem, conditionally on Π n , a.s. Π (k) n converges weakly as k → ∞ to Π n , and in the W 2 metric as soon as Π n ∈ W 2 (W 2,ac (R q )). Since under our assumptions Π n a.s. has a unique 2-Wasserstein barycenterm n 2 and, by Theorem 3 of [33],m (n,k) 2 converges with respect to W 2 a.s. as k → ∞ to it, we also get through this approach that ,m n 2 ) + W 2 (m n 2 , m 0 ) ≤ ε for some (random) large enough n, k ∈ N. The clear disadvantage of this method is computational: besides generating k samples of Π n , we need to additionally run a possibly large number of GDW steps to approximatê m (n,k) 2 , and we need to evaluate at each step k new transport maps (instead of one, for the SGDW). Moreover, if additional k new samples from Π n become available, we need to run the whole scheme again to take advantage of this new information. On the contrary, the online nature of the SGDW method allows one to refine the already computed estimator by only performing k new steps of the algorithm.

Numerical experiments
Before presenting the experimental validation of the proposed methods we give a brief presentation of the scatter-location family of distributions. Our experiments consider this family because the optimal transport maps between two laws in the scatter-location family can be described explicitly. This property facilitates the numerical computation/approximation of barycenters, as the various iterative algorithms described so far take a more amenable form. See [7] for further examples.

Location-Scatter family
We follow the setting of [4]: Given a fixed distributionm ∈ W 2,ac (R q ), referred to as generator, the associated location-scatter family is given by where M q×q + is the set of symmetric positive definite matrices of size q × q. Without loss of generality we can assume thatm has zero mean and identity covariance. Note that F(m) is the multivariate normal family ifm is the standard multivariate normal distribution.
The optimal map between two members m 1 = L(A 1x + b 1 ) and This family of optimal maps contains the identity and is closed under convex combination.
If Γ is supported on F(m), then its 2-Wasserstein barycenterm belongs to F(m). In fact, call its meanb and its covariance matrixΣ. Since the optimal map fromm to m is T m with mean b 1 = (1 − γ)b 0 + γb m and covariance The batch stochastic gradient descent iteration is characterized by

Experiment
We considered a model within a location-scatter family (LS), with generatorm on R 15 with independent coordinates, as follows:   for i, j = 1, . . . , 15 2 , with the covariance (kernel) function K(i, j) = εδ ij + σ cos (ω(i − j)). Given the parameters ε, σ and ω, the so constructed covariance matrix will be denoted Σ ε,σ,ω . We chose the parameters ε = 0.01, σ = 1 and ω = 5.652 ≈ 1.8π for m 0 . Thus, under the true model m 0 the coordinates can be negatively/positively correlated and there is also a coordinate-independent noise component due to the Kronecker delta δ ij . Figure 3 shows the covariance matrix and three coordinates of the true model m 0 .
The model prior Π is the push-forward induced by a prior over the mean vector b and the parameters of the covariance Σ ε,σ,ω , chosen independently according to: where Exp(·|λ) is a exponential density with rate λ. Given n samples from the true model m 0 (also referred to as observations or data points), k samples are produced from the posterior measure Π n using Markov chain Monte Carlo (MCMC). The numerical analysis presented in what follows focuses on the behavior of the BWB as a function of both the number of samples k and the number of data points n.

Numerical consistency of the empirical posterior
We first validate the empirical measure Π (k) n as a consistent sample version of the true posterior under the W 2 distance, that is, we check that W 2 (Π (k) n , δ m0 ) → W 2 (Π n , δ m0 ) for large k. We estimate W 2 (Π   n , δ m0 ), using 10 simulations, for different values of observations n and samples k.  n , δ m0 ) for different values of k (in the x-axis) and of n (color coded). Notice how the estimates become more concentrated for larger k and that the Wasserstein distance between the empirical measure Π (k) n and the true model m 0 decreases for larger n. Additionally, Table 1 shows that the standard deviation of the 10 estimates of W 2 (Π (k) n , δ m0 ) decreases as either n or k increases. as suggested in Remark 5.11. Thus, we used the iterative GDW procedure in equation (5.2), namely the (deterministic) gradient descent method, and repeated this calculation 10 times. As a stopping criterion for gradient descent, we considered the relative variation of the W 2 cost; the computation was terminated when this cost fell below 10 −4 . Figure 5 shows the W 2 distances between the so computed barycenters and the true model, while Table 2 shows the average distance for each pair (n, k). Notice that, in general, both the average and standard deviation of the barycenters decrease as either n or k increases, yet for large values (e.g., n = 2000, 5000) numerical issues appear.

Distance between the empirical barycenter and the Bayesian model average
We then compared the empirical Wasserstein barycentersm to the standard Bayesian model averages, denoted herem (n,k) , in terms of their distance to the true model m 0 , for n = 1000 observations. To that end, we estimated the W 2 distances via empirical approximations with 1000 samples for each model based on [21]. We simulated this procedure 10 times for k ∈ {10, 20, 50, 100, 200, 500, 1000}. Figure 6 shows the sample average and variance of the W 2 distances of the Wasserstein barycenters and Bayesian model averages. The empirical barycenter is seen to be closer to the true model than the model average, regardless of the number of MCMC samples k.   n , m 0 ) denoted as ma in blue, for n = 1000 and different numbers of samples k. We considered 10 simulations for each k.

Computation of the Wasserstein barycenter with batch-SGDW
Lastly, we compared the empirical barycentersm (m,k) 2 to the barycenter obtained by the batch SGDW method in equation (5.8) with different batch sizes. Here, we shall denote the latter bym (n,t,s) 2 with n the number of observations, t the steps of the algorithm, and s the batch size. Fig. 7 shows the evolution of the W 2 2 distance between the stochastic gradient descent sequences and the true model m 0 for n ∈ {10, 20, 50, 100, 200, 500, 1000} observations and batches of sizes s ∈ {1, 15}, with step-size γ t = 1 t for t = 1, . . . , 200. Notice from Fig. 7 that the larger the batch, the more concentrated the trajectories ofm n,s become. Additionally, Table 3 summarizes the means of the distance W 2 2 to the true model m 0 , using the sequences after t = 100 against the empirical estimator using all the simulations with k ≥ 100. Finally, Table 4 shows the standard deviation of the distance W 2 2 to the true model m 0 , which can be seen to decrease as the batch size grows. Critically, we observe that for batch sizes s ≥ 5 the stochastic estimation was better than its empirical counterpart, i.e., it had smaller variance with similar (or even smaller) bias. This is noteworthy given the fact that computing our Wasserstein barycenter estimator via the batch stochastic gradient descent method is computationally less demanding than computing it via the empirical method.  Table 3. Means of W 2 2 of the stochastic gradient estimations (using the sequences with t ≥ 100) and that of the empirical estimator (using the simulations with k ≥ 100), across different combinations of observations n and batch size s.  Table 4. Std. deviations of W 2 2 of the stochastic gradient estimations (using the sequences with t ≥ 100) and that of empirical estimator (using the simulations with k ≥ 100), across different combinations of observations n and batch size s.

Conclusion
Based on this illustrative numerical example, we can conclude: the empirical posterior constructed using MCMC is consistent under the W 2 distance and therefore it can be relied upon to compute Wasserstein barycenters, the empirical Wasserstein barycenter estimator tends to converge faster (and with lower variance) to the true model than the empirical Bayesian model average, computing the population Wasserstein barycenter estimator via batch stochastic gradient descent is promising as an alternative to computing the empirical barycenter (i.e., to applying the deterministic gradient descent method to a finitely-sampled posterior).
over the set of densities. But the optimal value cannot be better than if we minimize pointwise the termm(x) inside the curly brackets, obtaining the candidatê As this pointwise minimizer is already a probability density, we conclude.
For n large enough we have W p ν n , Wp(X ) mΓ(dm) p ≤ Wp(X ) W p (ν n , m) p Γ(dm) ≤ B V + 1 =: K, by convexity of optimal transport costs. From this we derive that (for every x) sup n X d(x, y) p ν n (dy) < ∞.
By Markov inequality this shows, for each ε > 0, that there is large enough such that sup n ν n ({y ∈ X : d(x, y) > }) ≤ ε. As explained in [33], the assumptions on X imply that {y ∈ X : d(x, y) ≤ } is compact (Hopf-Rinow theorem), and so we deduce the tightness of {ν n }. By Prokhorov theorem, up to selection of a subsequence, there exists ν ∈ M which is its weak limit. By Fatou's lemma: B V = lim W p (ν n , m) p Γ(dm) ≥ lim inf W p (ν n , m) p Γ(dm) ≥ W p (ν, m) p Γ(dm), hence ν is a p-Wasserstein barycenter. For the converse implication, see Remark 3.6.
we likewise conclude that Π n is p-integrable after updates.