Optimal compromise between incompatible conditional probability distributions, with application to Objective Bayesian Kriging

Models are often defined through conditional rather than joint distributions, but it can be difficult to check whether the conditional distributions are compatible, i.e. whether there exists a joint probability distribution which generates them. When they are compatible, a Gibbs sampler can be used to sample from this joint distribution. When they are not, the Gibbs sampling algorithm may still be applied, resulting in a"pseudo-Gibbs sampler". We show its stationary probability distribution to be the optimal compromise between the conditional distributions, in the sense that it minimizes a mean squared misfit between them and its own conditional distributions. This allows us to perform Objective Bayesian analysis of correlation parameters in Kriging models by using univariate conditional Jeffreys-rule posterior distributions instead of the widely used multivariate Jeffreys-rule posterior. This strategy makes the full-Bayesian procedure tractable. Numerical examples show it has near-optimal frequentist performance in terms of prediction interval coverage.

leads to models with well understood properties because a closed-form expression is available. The second one allows for more flexibility in modeling but makes theoretical analysis more difficult.
The main problem with the second approach is that conditional distributions may not be compatible. In this context, it means that there exists no joint distribution from which the conditional distributions can all be derived. Other definitions of compatibility exist in the literature. For example, in the context of a model with a given prior distribution, Dawid and Lauritzen [11] examine the problem of eliciting a compatible prior distribution for a submodel. In the domain of Bayesian Networks, a probability distribution can be compatible or not with a given Directed Acyclic Graph (DAG) [30]. Moreover, an abstraction (simplification) of a DAG can be compatible or not [34]. A concept of compatibility of two prior distributions also exists in the field of Bayesian model selection. It is based on the Kullback-Leibler divergence of the corresponding marginal (predictive) distributions [9]. In this paper however, the notion of compatibility concerns families of conditional distributions. A family of conditional distributions is called compatible if there exists a joint distribution that agrees with them all [16]. Uniqueness of this joint distribution is desirable but not included in the requirements of compatibility.
For finite state spaces, null recurrent Markov chains are impossible, but attempting to define a joint probability distribution through its conditional distributions may still lead to incompatibility. Accordingly, the problem of efficiently determining whether a given system of conditionals is compatible has received considerable attention over the years. Kuo et al. [22], after listing previous attempts, provide probably the best solution to date. Their idea relies on the Structural Ratio Matrix which contains ratios between conditional distributions.
However, even if a system contains incompatible conditional probability distributions, it does not follow that it is useless. Since Heckerman et al. [15], practitioners have been using systems of conditional probability distributions without reference to compatibility. Indeed, providing the Markov chain is positive recurrent, it is always possible to fire up Gibbs samplers to deal with a system of conditional distributions. Some authors use the colorful acronym PIGS for "Potentially Incompatible Gibbs Sampler" to describe such a procedure. When the conditionals are definitely known to be incompatible, the most widely used term seems to be "Pseudo-Gibbs Sampler" (PGS).
Behind the practice of PIGS is the intuition that the Gibbs sampler should converge to the joint distribution that best represents the system of conditionals. Kuo and Wang [21] provide a detailed analysis and geometrical interpretation of the behavior of Pseudo-Gibbs Samplers for discrete conditional distributions. In particular, they show how the scanning order determines its stationary distribution. In Section 2 of the present paper, we provide some theoretical foundation for the intuition that the stationary distribution of a PGS with random scanning order is, in case of uniqueness, the best "compromise" between incompatible conditionals. Section 3 provides further discussion of this theory by considering alterations to the main definitions and showing them to lead to undesirable results.
In Section 4, we use the theory of optimal compromise to derive Objective Bayesian inference on correlation parameters of Kriging models. Kriging models are widely used in spatial statistics, but they are more complex than standard models. Their parameters are numerous, and their interactions complex, which makes eliciting a joint Objective prior difficult. This makes an approach resting on conditional priors attractive despite the risk of incompatibility of the associated conditional posteriors. It is to deal with such situations that the theory of optimal compromise was developed.
The Objective Bayesian paradigm as explained by Berger [4] consists in eliciting for every model a "default", reasonable prior distribution that could be used when no explicit prior information is available. In particular, the Berger-Bernardo reference prior [8], hereafter simply named "reference prior", can be algorithmically computed with minimal user intervention.
For models with a single scalar parameter, the reference prior rewards parameter values that are easily discriminated by the likelihood function. Its definition is related to the Kullback-Leibler divergence between posterior and prior [8]. For usual continuous models -essentially models where the Fisher information matrix is equal to the opposite of the expectancy of the second derivative of the log-likelihood, see Clarke and Barron [10] for an exhaustive list of conditions -it coincides with the Jeffreys-rule prior.
For models with multiple parameters, the reference prior algorithm requires the user to specify an ordering on the parameters and then iteratively compute the reference prior on each parameter conditionally to all subsequent parameters. The only user input is therefore this ordering, and common sense arguments often make one more sensible than others. Yang and Berger [33] list different reference priors obtained with different parameter orderings for a large number of statistical models. Of course, one could also group several parameters and treat them as one single multi-dimensional parameter, but doing so tends to produce less satisfactory inference [5]. In particular, for usual continuous models Berger et al. [7] state " We actually know of no multivariable example in which we would recommend the Jeffreys-rule prior. In higher dimensions, the prior always seems to be either 'too diffuse' [...] or 'too concentrated' ".
Berger et al. [6] were the first to derive a reference prior for the parameters of a Gaussian Process regression model. This model contained only one correlation parameter, however. When several correlation parameters are involved, there is no reasonable way to order them. Even if one were arbitrarily picked, computation of the prior would be analytically intractable. Several authors [25,27,19,28,13] have therefore resolved to treat all correlation parameters as a single multidimensional parameter. It is in order to avoid having to do this that we make use of PIGS.
The idea is simple: for every correlation parameter, it is possible to analytically derive the reference prior for this parameter conditionally to all others. Each of the corresponding posterior distributions can be seen as a conditional probability distribution on one correlation parameter when all others are known. These conditional distributions then serve as input to a PIGS. Theorem 4.2 is the main result with respect to the application. First, under reasonable assumptions, the PIGS admits one single stationary probability distribution. Second, the Markov kernel defined by the PIGS is uniformly ergodic. Since this Markov kernel is defined over an uncountable state space, the latter fact is significant. The stationary distribution, which we call the Gibbs reference posterior distribution, can be used to improve prediction of the value taken by the Gaussian process at unobserved points. Sections 5 and 6 illustrate the inferential and predictive performance of the stationary distribution respectively.

Definitions and notations
In this section we introduce the concepts necessary to define the optimal compromise between potentially incompatible conditional distributions. For the sake of readability, all proofs are provided in Appendix A.
Let r be a positive integer and let (Ω 1 , A 1 ),...,(Ω r , A r ) be measurable sets. Define Intuitively (we formalize this below), every π i should be assembled with a distribution m =i on j =i A j to create a "joint" distribution, that is a probability distribution on A. We refer to every m =i (i ∈ [[1, r]]) as an (r − 1)-dimensional distribution. If the m =i can be chosen in such a way as to make all joint distributions equal, then the Markov kernels in the sequence (π i ) i∈[ [1,r]] are called compatible. And if no choice of (m =i ) i∈[ [1,r]] can make all joint distributions equal, we have to look for a "compromise" between the Markov kernels.
Consider the following example [3] with r = 2 and Ω 1 = Ω 2 = (0, +∞) are endowed with their Borel σ-algebra: let π 1 and π 2 be Markov kernels such that for all x, y > 0 π 1 (·|y) and π 2 (·|x) are absolutely continuous with respect to the Lebesgue measure on (0, +∞). Let λ be this measure and let the densities of π 1 and π 2 respectively be Let us define m =1 and m =2 as the probability distributions with the following densities with respect to the Lebesgue measure on (0, +∞): Then the joint probability distributions π 1 m =1 and π 2 m =2 are equal. Now denoting by λ the Lebesgue measure on (0, +∞) × (0, +∞), the density of π 1 m =1 = π 2 m =2 is Remark (Producing incompatibility is easy). Take r = 2 and let Ω 1 = Ω 2 be a Borel subset of R. Assume that for all x, y ∈ Ω 1 , π 1 (·|y) and π 2 (·|x) are absolutely continuous with respect to λ, which denotes here the Lebesgue measure on Ω 1 . Let p 1 (·|y) and p 2 (·|x) be their respective density functions and further assume that for λ-almost all real numbers x and y, p 1 (x|y) > 0 and p 2 (y|x) > 0. A necessary condition [3] for the compatibility of π 1 and π 2 can be derived from Bayes' rule: there must exist two mappings u and v defined on Ω 1 such that for λ-almost all real numbers x and y In the previous example, Ω 1 = Ω 2 = (0, +∞) and this necessary condition is fulfilled: for all x, y > 0, Taking p 1 (x|y) = (y + 2) exp(−(y + 2)x) and p 2 (y|x) = exp(−y) makes π 1 and π 2 fail the necessary condition for compatibility. Definition 2.2. Let φ be a probability distribution on A. For every i ∈ [ [1, r]], denote by φ −i the probability distribution on j =i A j defined as follows. For every set S −i that can be decomposed as Remark. The above definition is valid because any probability distribution on A can be characterized by its [1, r]] and every probability distribution m =i on j =i A j , denote by π i m =i the distribution on A defined as follows. For every i ∈ [ [1, r]], for every set S <i ∈ j<i A j , every set S >i ∈ k>i A k and every set S i ∈ A i , (2.9) Naturally, for i = 1 (resp. i = r), remove S <i (resp. S >i ) from the formula above. In the following, do this kind of operation when i = 1 or i = r.

Notice that for every
(2.10) If there exists a sequence of (r−1)-dimensional distributions (m =i ) i∈[ [1,r]] such that all distributions π i m =i are equal, then the Markov kernels (π i ) i∈[ [1,r]] are compatible. If no such sequence (m =i ) i∈[ [1,r]] exists, then we wish to find a sequence (m =i ) i∈[ [1,r]] that makes the π i m =i share some "common ground". The following definition expresses this constraint formally. [1,r]] (each m =i being a probability distribution on j =i A j ) is said to be compatible with the sequence of Markov kernels (π i ) i∈[ [1,r] So the "common ground" we require for the sequence of distributions (π i m =i ) i∈[ [1,r] ! ] is that their (r − 1)marginal distributions should be the same on average. Other constraints would have been possible, and we discuss some of them in Section 3.1 below.
In the definition of a compromise, the first condition exists to give meaning to the definition of an optimal compromise below. It is reasonable on its own though: a compromise should not deem events impossible if they are considered possible by the Markov kernels.
Returning to the previous example, both π 1 m =1 and π 2 m =2 are compromises, and any convex combination of these two distributions is one as well. The following definition introduces a cost function for compromises in order to determine which compromises are optimal.
Definition 2.5. Let λ be a positive measure on A. Let P be a compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] that is absolutely continuous with respect to λ. P is called an optimal compromise with respect to λ between the sequence of Markov kernels (π i ) i∈[ [1,r]] if it minimizes the functional E λ over all compromises between (π i ) i∈[ [1,r]] that are absolutely continuous with respect to λ. E λ is defined by: In the previous example, the optimal compromise with respect to the Lebesgue measure on R × R can be shown (cf. Section 2.2) to be 1/2 π 1 m −1 + 1/2 π 2 m −2 .
Proposition 2.6. The set of all compromises between (π i ) i∈[ [1,r]] is convex, as is the subset of all compromises absolutely continuous with respect to λ.
If the Markov kernels (π i ) i∈[ [1,r]] are compatible and there exists a joint distribution π on A that agrees with them all, then for every positive measure λ on A such that π is absolutely continuous with respect to λ, E λ (π) = 0 and π is an optimal compromise with respect to λ.
Even though Definition 2.5 makes it seem like the notion of optimal compromise is tied to a reference measure λ, it turns out that in many situations there exists a compromise that is optimal with respect to all possible reference measures -cf. Theorem 2.10 below. In the previous example, it is given by 1/2π 1 m −1 + 1/2π 2 m −2 .

Deriving the optimal compromise
The concepts of compromise and optimal compromise defined in Section 2.1 are intimately linked to Gibbs sampling, or (because the Markov kernels are incompatible) pseudo-Gibbs sampling (PGS). Let us therefore recall the Gibbs sampling algorithm.
The method used to sample the index i is called scanning order. A systematic scan moves through each index in turn using a deterministic pattern while an equiprobable random scan selects for each step an index within [ [1, r]] with probability 1/r.
If the Markov kernels (π i ) i∈[ [1,r]] are compatible, then the scanning order influences only the rate of convergence of the algorithm. He et al. [14] provide theoretical results on the subject and Mitliagkas and Mackey [24] propose a measure for the quality of any scan order for Gibbs sampling on finite state spaces.
If the Markov kernels are not compatible, then every scan order may produce a different target distribution. Kuo and Wang [21] thoroughly study the links between all possible systematic scan orders for Markov kernels on finite state spaces. The equiprobable random scan order has yet another target distribution.
The notion of Gibbs compromise is central to this theory of compromises between incompatible Markov kernels.
Definition 2.7. A probability distribution P G on A is called a Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] if it satisfies: (2.14) If it exists, a Gibbs compromise is a stationary distribution for the Gibbs sampler with equiprobable random scan order. This fact makes it practical from a sampling standpoint. In the following, we show how it relates to the concepts of compromise and optimal compromise in the sense of Definitions 2.4 and 2.5.
Proposition 2.8. A Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] is also a compromise between this sequence of Markov kernels in the sense of Definition 2.4.
The denomination "Gibbs compromise" is justified because it is a stationary distribution for the Gibbs sampler with random equiprobable scanning order.
The proposition below shows that all compromises are tied to Gibbs compromises. Proposition 2.9. If a sequence of (r − 1)-dimensional probability distributions (m =i ) i∈[ [1,r]] (each m =i being a probability distribution on j =i A j ) is compatible with the sequence of Markov kernels (π i ) i∈[ [1,r]] , then it is the sequence of (r − 1)-dimensional distributions of a Gibbs compromise between the Markov kernels (π i ) i∈[ [1,r]] .
Remark (Equivalence relation and convexity). Let us say that two compromises are equivalent if they share the same sequence of (r − 1)-marginal distributions. This is obviously an equivalence relation and every class of equivalence can be represented by a single Gibbs compromise. Moreover, each class of equivalence is a convex subset of the set of all compromises. And for any positive measure λ on A, its intersection with the set of all compromises absolutely continuous with respect to λ is also convex. Finally, the functional E λ is convex over this intersection.

A theoretical justification of Pseudo-Gibbs sampling
At this stage, all necessary tools are available to derive the two most important results of this theory of compromise between incompatible Markov kernels.
Theorem 2.10. If there exists a unique Gibbs compromise P G between the sequence of Markov kernels (π i ) i∈[ [1,r]] , then the following statements holds: (1) P G is absolutely continuous with respect to any compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] ; (2) for any positive measure λ on A such that P G is absolutely continuous with respect to λ, P G is the unique optimal compromise with respect to λ. Because of these two properties, we call P G the optimal compromise. Remark (Equivalence class and convexity -continued). The arguments used in the proof of Theorem 2.10 (cf. Appendix A) can also be used to deal with the case where there exist several different Gibbs compromises. First, one can show that each Gibbs compromise is absolutely continuous with respect to any equivalent compromise. Now let C be an equivalence class and let π C be the unique Gibbs compromise in this class of equivalence. Let λ be a positive measure on A such that π C is absolutely continuous with respect to λ. π C (uniquely) minimizes E λ over the intersection of C and the set of all compromises absolutely continuous with respect to λ. This implies that for any positive measure λ on A, any optimal compromise with respect to λ is a Gibbs compromise. Finally, note that the set of all Gibbs compromises is convex (however there is no reason E λ should be convex over this set!).
Theorem 2.10 has important practical implications. It opens the possibility of using Gibbs sampling to find the optimal compromise between incompatible Markov kernels and justifies using PIGS.
The next result shows that, under the conditions of Theorem 2.10, the optimal compromise remains the same for a fairly large class of reparametrizations. This result is key to the application of PIGS in an Objective Bayesian framework, where some degree of invariance by reparametrization of priors and posteriors is usually expected.
For every i ∈ [ [1, r]], let (Ω i ,Ã i ) be a measurable space and let f i be bijective measurable mapping Ω i →Ω i whose inverse f −1 i is also measurable. Define f = (f 1 , ..., f r ) : × i∈[ [1,r]] Ω i → × i∈[ [1,r]]Ω i and for Also letπ i be the Markov kernel × j =iΩ j ×Ã i → [0, 1] such that for every ω −i ∈ × j =i Ω j and every Proposition 2.11. Assume there exists a unique Gibbs compromise P G between the sequence of Markov kernels (π i ) i∈[ [1,r]] . Then the push-forward measure of P G by fP G := P G * f is the unique Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] .

Testing the definitions of compatibility
While the previous section presented the theory of compromise, this section aims to test its foundations, namely Definitions 2.3 and 2.4. Subsection 3.1 shows that strengthening their requirements is not possible since it would in many cases threaten the very existence of a compromise. Subsection 3.2 on the other hand shows that these requirements cannot be weakened at a small price.

Stronger definitions of compromises are not possible
While the definition of the optimal compromise is straightforward, as it involves minimizing some measure of distance between the "targeted" conditionals and the conditionals of the compromise, the definition of a compromise may seem arbitrary. To motivate this definition, let us focus on the two-dimensional case.
Suppose that r = 2 and that π 1 and π 2 are incompatible. This means there exists no joint distribution π which agrees with both Markov kernels. This being the case, it seems sensible to weaken the definition of compatibility by applying it to the "marginals" instead of the "joint" distribution. The following definition makes this idea precise.
Definition 3.1. A pair of probability distributions m =1 (resp. m =2 ) on A 2 (resp. A 1 ) is compatible with the pair of Markov kernels π 1 and π 2 if the distributions π 1 m =1 and π 2 m =2 verify While this definition may seem more restrictive at first glance than Definition 2.3, both definitions are in fact equivalent when applied to a pair of Markov kernels, because (r − 1)-dimensional distributions are simply 1-dimensional distributions in this case. Indeed, following directly from Definition 2.3, we have this result which holds for any r: If a sequence of (r−1)-dimensional probability distributions (m =i ) i∈[ [1,r]] (each m =i being a probability distribution on j =i A j ) is compatible (in the sense of Definition 2.3) with the sequence of Markov kernels (π i ) i∈[ [1,r]] , then all joint distributions in the sequence (π i m =i ) i∈[ [1,r]] share the same marginals, that is Now let us consider the three-dimensional case (r = 3). Because the aim of this section is merely to motivate the definitions of compromises and optimal compromises, there is no need for the discussion to be fully general. Let us therefore restrict the discussion to an important particular case. Assume that Ω 1 , Ω 2 and Ω 3 are finite sets and that A 1 , A 2 and A 3 are respectively the sets of all their subsets. This has an important consequence: any mapping from a subset of Ω to a subset of Ω is measurable.
Also assume that the Markov kernels π 1 , π 2 and π 3 are positive mappings. For π 1 , this means that for If we consider ω 3 known, then the situation is reduced to the two-dimensional case. Because π 1 and π 2 are positive mappings and both Ω 1 and Ω 2 are finite sets, Markov chain theory ensures there exists a unique Gibbs compromise P (·|ω 3 ). For every (ω 1 , ω 2 ) ∈ Ω 1 × Ω 2 , Thanks to Theorem 2.10, P (·|ω 3 ) is the optimal compromise. Moreover, notice that Equation We may similarly derive Markov kernels Q : Once again, because π 1 , π 2 and π 3 are positive mappings, it follows from Markov chain theory that P , Q and R are also positive mappings. We now show it using only elementary arguments, because these arguments will be useful again later.
Assume P is not a positive mapping. Then there exists (ω 3 ) is supposed to be a probability distribution. So P is a positive mapping.
Ideally, we would wish to define the optimal compromise between π 1 , π 2 and π 3 as the joint distribution φ such that for every Unfortunately, the existence of such an "optimal compromise" φ implies that π 1 , π 2 and π 3 are compatible. First, one can show that for every (ω 1 , The arguments are similar to those used above to show that P is a positive mapping. Rewriting Equations (3.4) and (3.6) yields Now combine Equations (3.7) and (3.8): As this holds for every (ω 1 , A similar proof then shows that π 3 φ −3 = π 1 φ −1 . This means that if an "optimal compromise" φ exists, then π 1 , π 2 and π 3 are compatible and no compromise was needed. Similarly to what was done in the two-dimensional case, we avoid this difficulty by weakening the compatibility requirements: we no longer require P (·|ω 3 ) to be the optimal compromise between π 1 and π 2 for every ω 3 ∈ Ω 3 (as expressed by Equation (3.3)), but only on average over ω 3 ∈ Ω 3 . So a compromise φ should still verify Equation (3.4), but it would only need to verify this weakened version of Equation (3.3) for every (ω 1 , ω 2 ) ∈ Ω 1 × Ω 2 : To sum this part of the discussion up, the necessity of weakening our "ideal" requirements for "optimal compatibility" made us downgrade from a requirement about a "joint" 3-dimensional distribution φ to requirements about its (3 − 1)-marginal distributions φ −1 , φ −2 and φ −3 . Definition 2.3 is just another formulation of these requirements, which are taken to define a compatible sequence of (r − 1)-dimensional distributions. Proposition 2.9 shows that in cases where there exists a unique Gibbs compromise, even with this weakened set of requirements for compatibility, there exists only one compatible sequence of (r − 1)-dimensional distributions, so we may not strenghten it if there is to be any solution. Indeed, in cases where no Gibbs compromise exists, no compatible set of (r − 1)-dimensional distributions exists either!

Weaker definitions of compromises are inconvenient
As was shown in the previous subsection, the requirements given by Definition 2.3 for the compatibility of a sequence of (r−1)-dimensional distributions with a given sequence of Markov kernels cannot be strengthened. The following shows they cannot be weakened either.
3.2.1. Why we need some notion of compatibility: example in 2 dimensions.
Why bother with the compatibility of (r − 1)-dimensional distributions and not simply minimize the functional E λ of Definition 2.5 over all distributions absolutely continuous with respect to λ ? The following 2-dimensional example shows that doing so yields unsatisfactory results.
Observe that according to π 1 , X 2 = 0 implies X 1 = 1 but that according to π 2 , X 1 = 1 implies that X 2 = 1 = 0. This discrepancy is a major source of incompatibility between the two Markov kernels. So, as π E makes both X 1 = 1 and X 2 = 0 less likely than π C , it "ignores the inconsistent parts" of π 1 and π 2 to some extent. Therefore, if the marginals are not set in advance (say, by imposing compatibility with the conditionals in the sense of Definition 2.3), one may "cheat" by having the marginals disadvantage inconvenient values for the parameters.

3.2.2.
Why the compatibility requirements can hardly be weakened: example in 3 dimensions.
In the two-dimensional case, because of Proposition 3.2, Definitions 3.1 and 2.3 give the same meaning to the concept of compatibility of marginals, so Definition 2.3 may be thought of as a generalization of Definition 3.1 to cases with more than two dimensions. However, another generalization of the latter definition is possible. To avoid confusion, this other generalization will be called weak compatibility. In the following, the sequence of Markov kernels (π i ) i∈[ [1,r]] is defined as in Section 2.1.
Proposition 3.2 means that compatibility in the sense of Definition 2.3 implies weak compatibility in the sense of Definition 3.3, hence its denomination as "weak".
Using the concept of weak compatibility of a sequence of (r − 1)-marginal distributions, we define weak compromises and the optimal weak compromise as analogues to compromises and optimal compromises respectively. Definition 3.4. A probability distribution P on i∈[[1,r]] A i is called a weak compromise between the Markov kernels (π i ) i∈[ [1,r]] if these two conditions are verified: Definition 3.5. Let λ be a positive measure on A. Let P be a weak compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] that is absolutely continuous with respect to λ. P is called an optimal weak compromise with respect to λ between the sequence of Markov kernels (π i ) i∈[ [1,r]] if it minimizes the functional E λ over all compromises between (π i ) i∈[ [1,r]] that are absolutely continuous with respect to λ. E λ is defined by Equation (2.13).
Because, for any positive measure λ on A, the set of all weak compromises absolutely continuous with respect to λ includes the set of all compromises absolutely continuous with respect to λ, an optimal weak compromise with respect to λ makes the functional E λ no greater than an optimal compromise with respect to λ. However, as shown in the following example with r = 3, optimal weak compromises may have undesirable behavior. Assume Consider the following Markov kernels. For every (ω 1 , Notice that, provided ω 3 is known, π 1 and π 2 are compatible Markov kernels. The unique probability distribution on A 1 ⊗ A 2 that fits both π 1 and π 2 (conditional to ω 3 ) verifies for every (ω 1 , Thus, any joint distribution fitting the Markov kernel P would make X 1 , X 2 and X 3 mutually independent. Unfortunately, no such joint distribution could fit π 3 , but we may expect compromises between π 1 , π 2 and π 3 to retain the independence of X 1 and X 2 .
Let λ be the counting measure on A.
Denote by π C the (unique) optimal compromise between π 1 , π 2 and π 3 with respect to λ: we have So, as expected, π C retains the independence of X 1 and X 2 . Because π C is the unique Gibbs compromise between π 1 , π 2 and π 3 , Proposition 2.9 implies any other compromise between π 1 , π 2 and π 3 also retains this property.
Let us now consider an optimal weak compromise π W between π 1 , π 2 and π 3 with respect to λ. Numerical computation gives us the following approximation, with E λ (π W ) ≈ 0.019. For every (ω 1 , Thus the independence of X 1 and X 2 is lost. Therefore, weak compatibility is no adequate notion of compatibility. As a matter of fact, although (π W ) −3 and (π C ) −3 share the same marginals, that is

Optimal compromise between Objective Posterior conditional distributions in Gaussian Process regression
Kriging is a surrogate model used to emulate a real-valued function on a spatial domain D when said function can only be evaluated on a finite subset of D called "design set". The "Kriging prediction" is the mean function of the process taken conditionally to all known values of the emulated function, i.e. the values at the points in the design set. The main advantage of the framework is its natural way of representing uncertainty about the value of the function at unobserved points [31]. Prediction does not consist of a single value but of a complete Normal distribution. "Kriging" is the name given to the framework in the geostatistical literature [18], but is also frequently used in the context of computer experiments and machine learning under the label "Gaussian Process regression" [26]. In this work, we focus on Simple Kriging, where the Gaussian Process is assumed to be stationary with known mean, as opposed to Universal Kriging, which incorporates an unknown mean function.
The probability distribution of a stationary Gaussian Process is characterized by a variance parameter and a correlation function (also known as "correlation kernel") which itself depends on parameters. So one should deal with uncertainty about model parameters.
The problem is "notoriously difficult", as highlighted by Kennedy and O'Hagan [20], because the likelihood function may often be quite flat [23]. In a Bayesian framework, this uncertainty is represented by a prior distribution on the parameters.

Issues raised by objective prior elicitation for Gaussian processes
Let Y (x), x ∈ D be a real-valued random field on a bounded subset D of R r . We assume Y is Gaussian with zero mean (or known mean) and with covariance of the form Cov(Y (x), Y (x )) = σ 2 K θ (x−x ). σ 2 thus denotes the variance of the Gaussian Process and θ ∈ (0, +∞) r , hereafter named the "vector of correlation lengths", is the vector of scaling parameters used by the chosen class of correlation kernels K θ .
Consider a set of n ∈ N points (x (i) ) i∈[ [1,n]] belonging to the domain D. This set is called the design set and Y is observed at all points of this set. Let Y be the Gaussian vector (Y (x (i) )) i∈[ [1,n]] and let Σ θ be its correlation matrix: the distribution of Y is therefore N (0 n , σ 2 Σ θ ).
Let y be the vector of observations. When applied to a matrix, | · | refers to its determinant.
With these notations, the likelihood of the parameters σ 2 and θ is The reference prior with parameter ordering σ 2 ≺ θ is given by Berger et al. [6]: The distribution π(σ 2 |θ) has infinite mass: it is an improper prior.
Let us integrate σ 2 out of the likelihood (4.1): Ren et al. [27] provide the reference prior π(θ), where θ is regarded as a single multidimensional parameter. It is proportional to the square root of the determinant of the r × r matrix with (i, j)-th entry However, this method has the disadvantage of requiring the use of a multidimensional Jeffreys-rule prior distribution, which may show the sort of undesirable behavior mentioned in the introduction.
Alternatively, we could draw inspiration from the one-dimensional case in the following way. Suppose that we know every entry of θ except one, θ i . Then, according to Equation (4.4), the prior density on θ i knowing all entries θ j (j = i) would be The density functions define Markov kernels. Indeed, they are continuous with respect to the θ j (j = i) and are probability densities with respect to the Lebesgue measure. They are unfortunately likely to violate the necessary condition for compatibility given by Equation (2.6).
Let us now consider the corresponding posterior conditional densities π i (θ i | y, θ j ∀j = i) (i ∈ [[1, r]]). Just like their prior counterparts, they are likely to violate the necessary condition for compatibility. However, each of them represents our opinion about one parameter if all others were known. This is a setting where the results of Section 2 can be applied in order to find the optimal compromise between the Markov kernels R r−1 × B(R) they define. This optimal compromise will then be taken as posterior probability of the vector θ. In the following, we describe settings in which there exists a single Gibbs compromise between these Markov kernels. Theorem 2.10 then asserts it is the optimal compromise. We call this compromise the Gibbs reference posterior distribution because of its link to the reference posterior distribution in settings with a one-dimensional parameter θ.
However, even though we call it a "posterior" distribution, it is unclear whether there exists a prior distribution from which it could be derived using Bayes' rule. Denote by π G (θ|y) the probability density with respect to the Lebesgue measure of the Gibbs reference posterior distribution. Bayes' rule requires that in case a (proper or improper) prior density π G (θ) exists, there should also exist a functionL(y) such that, for almost every θ ∈ R r in the sense of the Lebesgue measure, As we have no explicit expression of π G (θ|y), we have no way to check whether Equation (4.6) holds or not.
In this section, we establish that, whenever Matérn anisotropic geometric or tensorized kernels with known smoothness parameter ν are used, under certain conditions to be detailed later, there exists a unique Gibbs compromise between the reference posterior conditionals, which thanks to Theorem 2.10 is the optimal compromise. Henceforth, it will be called "Gibbs reference posterior distribution", even though this "posterior" has not been derived from a prior distribution using the Bayes rule.
All proofs for this section can be found in Appendix B.

Definitions
In this work, we use the following convention for the Fourier transform: the Fourier transform g of a smooth function g : R r → R verifies g(x) = R r g(ω)e i ω|x dω and g(ω) = (2π) −r R r g(x)e −i ω|x dx. Let us set up a few notations. (a) K ν is the modified Bessel function of second kind with parameter ν ; (b) K r,ν is the r-dimensional Matérn isotropic covariance kernel with variance 1, correlation length 1 and smoothness ν ∈ (0, +∞) and K r,ν is its Fourier transform: .
Similarly, we define the Matérn tensorized covariance kernel with variance parameter σ 2 , correlation lengths θ (resp. inverse correlation lengths µ) and smoothness ν as the function x → σ 2 K tens r,ν x θ (resp. x → σ 2 K tens r,ν (xµ)). Thanks to Proposition 2.11, we may choose any parametrization we wish for the Matérn correlation kernels. We have found that the parametrization involving inverse correlation lengths makes proofs easier.
Several key passages in the proofs (to be found in Appendix B) involve a technical assumption on the design set: Most randomly sampled design sets almost surely have coordinate-distinct points -for instance Latin Hypercube Sampling. Cartesian product design sets, however, do not.
Remark. The reference posterior conditionals are invariant by reparametrization, so the Markov kernel P y does not depend on whether the chosen parametrization uses correlation lengths θ or inverse correlation lengths µ. Due to Proposition 2.11, the Gibbs compromise does not either. The parametrization using inverse correlation lenghts µ is more convenient for proving this theorem, however.
Notice that in such a Kriging model, the vector of observations y almost surely belongs to R n \ H, so this assumption is of no practical consequence. Theorem 4.2 therefore asserts that the Gibbs compromise between the incompatible conditionals π i (µ i |y, µ −i ) exists, is unique, and can be sampled from using Potentially Incompatible Gibbs Sampling (PIGS). In the following, it is called "Gibbs reference posterior distribution".

Using the Gibbs reference posterior distribution
Let x 0 be a point in the domain D that does not belong to the design set. Denote by Σ θ,0,· the correlation matrix between Y (x 0 ) and Y , and by Σ θ,·,0 its transpose the correlation matrix between Y and Y (x 0 ). Conditionally to Y = y and assuming θ is known, the random variable Z 0 defined below follows the Student t-distribution with n degrees of freedom.
Remark. If n exceeds 30, it is usually accepted that the Student t-distribution with n degrees of freedom can be approximated by the standard Normal distribution. As this threshold should be exceeded in practical cases, we would recommend performing all computations as though the Student t-distribution were Normal.
This Proposition implicitely gives the distribution of y 0 := Y (x 0 ) conditionally to Y = y and θ. For later reference, denote it by L 1 (y 0 |y, θ). In practice, when θ is unknown, the distribution of y 0 = Y (x 0 ) conditionally to Y = y can be obtained once θ has been sampled from the Gibbs reference posterior distribution: P (y 0 |y) := L 1 (y 0 |y, θ)π G (θ|y)dθ.
Its cdf can be approximated by averaging the cdfs of the Student t-distributions (or their Normal approximations) corresponding to every point in the sample.

Comparisons between the MLE and MAP estimators
To illustrate the inferential performance of the Gibbs reference posterior distribution, let us introduce the Maximum A Posteriori estimator (MAP). It takes the valueθ M AP of θ where the density with respect to the Lebesgue measure of the Gibbs reference posterior distribution is largest. We contrast it with the Maximum Likelihood Estimator (MLE)θ M LE which does the same with the likelihood function.

Methodology
In this section, we compare the MLE and MAP estimators for accuracy and robustness.
Our test cases are 3-dimensional Gaussian Processes with Matérn anisotropic geometric correlation kernels with smoothness 5/2. Their mean is the null function, which only leaves us with the matter of estimating their correlation length for each dimension.
We use uniform designs: our observation points are randomly generated according to the uniform distribution on a cube with side length 1.
In order to measure the performance of our estimators, we define a suitable distance between two vectors of correlation lengths. Then the error of an estimator is defined as its distance to the "true" vector of correlation lengths.
Let g be the function such that for any t in (−1, 1), g(t) = argtanh(t) and g(−1) = g(1) = 0. We use the convention that, for any matrix M with elements in [0, 1], g(M ) is the matrix resulting from applying g to every element of M .
This distance involves applying the Fisher transformation [17] (that is, the inverse hyperbolic tangent function) to every (non-unitary) correlation coefficient in both associated correlation matrices. This is a variance-stabilizing transformation. For any random variables U and V following the normal distribution with mean 0 and variance 1, let ρ denote the correlation coefficient between U and V (−1 < ρ < 1). If n is a random variable and argtanh(ρ) follows the normal distribution with mean argtanh(ρ) and variance 1/ (N − 3). So the variance of argtanh(ρ) does not depend on ρ, whereas the variance ofρ does and goes to zero for |ρ| → 1. Involving the Fisher transformation in the definition of the distance between two vectors of correlation lengths is therefore a way to assert that vectors of correlation lengths can be far apart even if they both lead to highly correlated observations. This allows us to make sure errors made when estimating near-1 correlation coefficients are no less taken into account than errors made when estimating near-0 correlation coefficients.
Let us choose a "true" vector of correlation lengths (and also a variance parameter, but this parameter has no effect on either the MLE or the MAP). Then we need to: (1) Sample n points randomly according to the uniform distribution on the unit cube (in the following, n = 30).

Results
This subsection provides results obtained on 3-dimensional Gaussian Processes with null mean function and Matérn anisotropic geometric correlation kernels with smoothness 5/2. The results are divided by "true" vectors of correlation lengths. In each case, we give in Table 1 the empirical Root Mean Square Errors (RMSEs) of both MLE and MAP estimators as functions of varying instances of the Gaussian Process and uniform design sets on the unit cube. Table 1 were selected in a way to showcase the behavior of both estimators in strongly anisotropic cases, but one (0.5 -0.5 -0.5) also showcases their behavior if the true kernel is actually isotropic. And the final one (0.8 -1 -0.9) is used to illustrate the performance in the case of a strongly correlated Gaussian Process: this case is fundamentally different from all others, because the Matérn anisotropic geometric family of correlation kernels is designed in such a way that the correlation length with greatest influence is the lowest. Informally speaking, it is enough for one correlation length to be near zero to make the whole process very uncorrelated, even should all other correlation lengths be very high.

Most of the "true" vectors of correlation lengths featured in
In all studied cases, the MAP estimator was more robust than the MLE estimator: its RMSE (Root Mean Square Error) was between 9 and 15% lower, as showcased in Table 1.
To get a better sense of the distribution of the error when the design set and the realization of the Gaussian Process vary, we give in Figure 1 Table 1. RMSE (where the error is measured in terms of the distance in Definition 5.1 ) of the MLE and MAP estimators for several "true" vectors of correlation lengths. The last column displays in percents the decrease of the RMSE of the MAP estimator with respect to the MLE. 6. Comparison of the MLE and MAP plug-in distributions and the full posterior predictive distribution

Methodology
We use the same test cases as before. In this section, our goal is to assess the accuracy of prediction intervals associated with both estimators and with the full posterior distribution. We consider 95% intervals: the lower bound is the 2.5% quantile and the upper bound the 97.5% quantile of plug-in distributionŝ P M LE (y 0 |y) = L 1 (y 0 |y,θ M LE ),P M AP (y 0 |y) = L 1 (y 0 |y,θ M AP ) and P (y 0 |y) = L 1 (y 0 |y, θ) π G (θ|y) dθ. For the sake of comprehensiveness, we also consider predictive intervals associated with the "true" predictive distribution L(y 0 | y, σ 2 , θ), which is the predictive distribution we would use if we knew the correct values of the parameters σ 2 and θ.
Let us choose a "true" vector of correlation lengths θ (and also a variance parameter σ 2 , but this parameter has no effect on predictive accuracy). Then we do the following: (1) Sample n observation points randomly according to the uniform distribution on the unit cube (in the following, n = 30). (2) Generate the observations of the Gaussian Process at the sampled points according to the selected "true" variance and correlation lengths.
(3) Sample the vector of correlation lengths according to the Gibbs reference posterior distribution π G (θ|y) through PIGS. (4) Compute the MLE and the MAP of the vector of correlation lengths. (5) Sample n 0 test points randomly according to the uniform distribution on the unit cube (in the following, n 0 = 100). (6) At each point, determine the 95% prediction intervals derived from L(y 0 | y, σ 2 , θ) (σ 2 and θ being the "true" parameters),P M LE (y 0 | y),P M AP (y 0 | y) and P (y 0 | y). Divide the counts by n 0 : this yields four coverages corresponding to each type of predictive intervals. Also compute the mean length of every type of prediction interval. (9) Repeat steps 1 to 8 m − 1 times (in the following, m = 500).

Results
There is no reason for individual coverages of 95% predictive intervals given by the predictive distribution to be equal to 95%. Recall that any coverage is given for a unique realization of the Gaussian Process, and that the values of this process at different points are correlated. If the predictive interval at some point fails to cover the true value at this point, it is likely that predictive intervals at neighboring points will also fail to cover the true values at those points, even though the nominal value is 95% everywhere. Conversely, if it actually covers the true value, then prediction intervals at neighboring points are more than 95% likely to cover their true values.
In short, prediction intervals give information that is only valid if understood to refer to what can be guessed on the sole basis of the observations made at the design points, which is why coverages for individual realizations of the Gaussian Process are not necessarily 95% even if the predictive distribution is perfectly accurate (i.e. based on the true values of σ 2 and θ).
However, if the predictive distribution is perfectly accurate, then the average of the coverages is the nominal value: 95%. It is thus interesting to compute the average of the coverages for all distributions, whether they are plug-in distributions based on the MLE or MAP estimator, or the predictive distribution based on the full posterior distribution (hereafter noted FPD). In the above described methodology, the average was taken over the realizations of the Gaussian Process with the chosen true parameters and over all design sets with n design points. The results below are obtained in this way.
The results given in Table 2 show that using the full posterior distribution (FPD) to derive the predictive distribution is the best possible choice from a frequentist point of view as the nominal value is nearly matched by the average coverage. Predictive intervals derived from the MAP estimator do not perform as well, and predictive intervals derived from the MLE perform even worse.  Table 2. Average with respect to randomly sampled design sets and realizations of the Gaussian Process (with variance parameter 1 and smoothness parameter 5/2) of the coverage of 95% Prediction Intervals across the sample space. "True" stands for the prediction based on the knowledge of the true variance parameter and the true vector of correlation lengths.
Let us now focus on the average (with respect to the uniform design sets and realizations of the Gaussian Process) of the mean (over the test set for a given realization of the Gaussian Process and a given uniform design set) length of prediction intervals. The results are given in Table 3, where the figures between parentheses give the increase or decrease (in percents) of the average mean length when compared to the average mean length of prediction intervals obtained using the true values of the parameters.  Table 3. Average with respect to randomly sampled design sets and realizations of the Gaussian Process (with variance parameter 1 and smoothness parameter 5/2) of the mean length of 95% Prediction Intervals across the sample space. The numbers in parentheses represent in percents the increase when using the MLE/MAP/FPD instead of the "true" vector of correlation lengths and variance parameter.
Predictive intervals derived from the full posterior distribution (FPD) are on average the largest, but not much larger than predictive intervals derived using the true parameters. In the tests we conducted, they seemed on average to be larger by about one fifth at worst. Predictive intervals derived from the MLE and MAP estimators are on average shorter than those derived from the true parameters. This can be interpreted as an under-estimation of the uncertainty of the prediction when fixing the vector of correlation lengths to the most likely value given the observations, and this can explain the low observed coverage in Table 2.
In Figure 2, we give violin plots of coverage and mean length of Prediction Intervals in the two most extreme cases: correlation lengths 0.4 -0.8 -0.2 (very low correlation) and 0.8 -1.0 -0.9 (very high correlation). The results are similar and illustrate the fact that the FPD gives larger intervals in order to reach the derived coverage value.

A higher-dimensional case
In this subsection, we emulate using Simple Kriging the 10-dimensional Ackley function: The goal in this section is to emulate the Ackley function on the unit hypercube [0, 1] 10 using design sets with 100 observation points. Although the impact of the design set type is not the focus of this study, we present the results with a randomly chosen design according to the Uniform distribution on the domain [0, 1] 10 , a design obtained through Latin Hypercube Sampling (LHS), and a design obtained through LHS and subsequently optimized to maximize the minimum distance between two points. The Simple Kriging model uses the null function as mean function and the Matérn anisotropic geometric covariance kernel family with smoothness parameter 5/2. The Gibbs reference posterior distribution is accessed through a sample of 1000 points. The conditional densities are sampled using the Metropolis algorithm with normal instrumental density with standard deviation 0.4 and a 100-step burn-in period.
To evaluate the performance of prediction intervals, we follow steps 3, 4, 5, 6 and 8 of the method presented in this section (step 7 is skipped as the "values of the Gaussian process" are naturally the values of the Ackley function) with n 0 = 1000. The results are presented in Tables 4 and 5.
As is shown in Table 4, prediction intervals derived using the Full Posterior Distribution perform better than those derived from the MAP, which themselves perform better than those derived from the MLE. This order of performance is the same regardless of the type of design set, although the optimized design set leads to much worse performances on average for prediction intervals than unoptimized designs. The latter fact   Table 4. Coverage of 95 % prediction intervals when emulating the Ackley function on the unit hypercube using a Gaussian Process with null mean function and a Matérn anisotropic geometric covariance kernel with smoothness 5/2, unknown variance parameter and unknown vector of correlation lengths. The design sets contain 100 points.
is not surprising since space-filling designs ensure than no two points can be very close to each other, which makes it harder to determine the correlation lengths.  Table 5. Mean length of 95 % prediction intervals when emulating the Ackley function on the unit hypercube using a Gaussian Process with null mean function and a Matérn anisotropic geometric covariance kernel with smoothness 5/2, unknown variance parameter and unknown vector of correlation lengths. The design sets contain 100 points.
As expected, prediction intervals derived from the Full Posterior Distribution are on average longer than those derived from the MAP and a fortiori the MLE. Notice that prediction intervals are on average shorter with the optimized design set, which explains the poorer performances in terms of coverage.

Conclusion and Perspectives
We provided theoretical foundation to the claim that the stationary distribution of the Markov chain underlying PIGS with random scanning order is the optimal compromise between the potentially incompatible conditional distributions.
This theory is mainly derived from intuitive conceptions of what a compromise should be. In places where such conceptions were inconclusive, we relied on concrete examples to precisely determine what was acceptable and what was not in a compromise and used it to complete the definition. One strength of this theory is that it can be applied to continuous as well as discrete probability distributions, whereas previous studies focused on the discrete, or even finite, case.
A question that remains open outside the finite-state case is how compatibility of conditional distributions is to be checked in practice. Although compromises are useful, not needing them is better.
Further investigation is needed to fully understand the properties of the optimal compromise. Nevertheless, its invariance by reparametrization and its respect of pairwise independence show that it preserves important features of the conditional distributions.
The theory of optimal compromise suggests a framework for deriving a new objective posterior distribution based on the conditionals yielded by the reference prior theory on Simple Kriging parameters. Applying this framework to Matérn anisotropic kernels, we showed prediction to have good frequentist properties.
Future work should investigate whether this posterior distribution formally corresponds to some joint prior distribution. And if it does, how the joint prior distribution could be accessed, and how it relates to the conditional prior distributions.
Regarding the specific Kriging application presented in this paper, the next step is to extend the framework to Universal Kriging, where instead of being known, the mean function is only assumed to be a linear combination of known functions f 1 , ..., f p . The linear coefficients β 1 , ..., β p are then considered parameters of the model. This extension is of practical relevance, because the mean function can rarely be considered known. It can probably be done in the same way Berger et al. [6] extended the reference prior from the Simple Kriging to the Universal Kriging framework: they used the flat improper prior as joint prior on β 1 , ..., β p conditional to σ 2 and θ and used it to integrate β 1 , ..., β p out of the likelihood function, and then proceeded to derive the reference prior on σ 2 and θ with respect to the integrated likelihood.
A further extension would involve deriving an objective prior on the smoothness parameter ν. In this endeavor, one should take into account the relation between correlation length θ and smoothness ν. Unfortunately, asymptotic theory is not of much help in this regard, as Anderes [2] shows that provided the spatial domain D is of dimension at least 5, then all parameters of the Matérn anisotropic geometric kernel are microergodic (Zhang [35] shows this to be untrue for spatial domains of dimension 1, 2 or 3, but the nonmicroergodic parameters are σ 2 and θ, not ν). This means that the Gaussian measures on D corresponding to Gaussian Processes with two different smoothness parameters are orthogonal, which suggests that there exists a consistent estimator (the MLE possibly). Stein [32] (section 6.6) considers the Fisher information on θ and ν, and gives examples (with a one-dimensional sample space D) showing that the Fisher information on these parameters depends a lot on the design set. Fisher information relative to the smoothness parameter ν increases when design points are chosen to be close to one another (relative to the "true" correlation length θ), whereas Fisher information relative to correlation length θ is maximized for design points that are farther apart. This, according to him, is coherent with the fact that θ has greater influence on the low frequency behavior of the Matérn kernel while ν has greater influence on its high frequency behavior. This also suggests to us that the smoothness parameter ν, like the variance parameter σ 2 , can only be meaningfully estimated if the vector of correlation lengths θ is known. Otherwise, the estimator could hardly tell which design points are close to each other, which intuitively seems a prerequisite to evaluating the smoothness of the process. If we wish to apply the reference prior algorithm to the case where ν is unknown, we should thus probably derive the reference prior on ν conditional to θ.
Let A be a measurable set such that (1 − t)P (0) + tP ( So Condition (2) is also verified.
Proof of Proposition 2.8. Let P G be a Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] . Then, for any measurable set A such that P G (A) = 0, Its second condition is also fulfilled because for every integer i ∈ [[1, r]], Proof of Proposition 2.9. Define the probability distribution P on A as follows: Then for every i ∈ [ [1, r]] the i-th (r − 1)-marginal distribution P −i of P is given by where the last equality is due to (m =i ) i∈[ [1,r]] being compatible with (π i ) i∈[ [1,r]] . Plugging this into Equation (A.4), we obtain that P is a Gibbs compromise. Equation (A.5) then yields the result.
Proof of Theorem 2.10. If P G is the unique Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] , then Proposition 2.9 asserts that ((P G ) −i ) i∈[ [1,r]] is the only compatible sequence of (r − 1)dimensional distributions. So any compromise has the same sequence of (r − 1)-marginal distributions. Let P be such a compromise.
For every i ∈ [[1, r]], π i (P G ) −i = π i P −i is absolutely continuous with respect to P . So the average P G is also absolutely continuous with respect to P .
Let λ be a positive measure on A such that P is absolutely continuous with respect to λ. Because P G and P share the same sequence of (r − 1)-marginal distributions, for λ-almost any ω ∈ Ω, Moreover, Equation (A.4) implies that for λ-almost any ω ∈ Ω, dP G dλ (ω) is the arithmetic average between , so it minimizes the mean squared error. Together with Equation (A.6), this implies that for λ-almost any ω ∈ Ω, Consequently, E λ (P G ) E λ (P ). Moreover, if P = P G , then there exists S ∈ A such that λ(S) > 0 and for every ω ∈ S, So for every ω ∈ S, Equation (A.7) is a strict inequality and thus E λ (P G ) < E λ (P ). P G is therefore the unique optimal compromise with respect to λ.
Proof of Proposition 2.11. For every i ∈ [[1, r]], for everyS i ∈Ã i , Now, for every i ∈ [ [1, r]], for everyT i ∈Ã i , Then, returning to Equation (A.9), Finally, we obtainP P G is therefore a Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] . We now prove its uniqueness. For every i ∈ [ [1, r]], f i is bijective and f −1 i is measurable. So for any Gibbs compromiseQ G between the sequence of Markov kernels (π i ) i∈[ [1,r]] ,Q G * f −1 is a Gibbs compromise between the sequence of Markov kernels (π i ) i∈[ [1,r]] . Given P G is the unique Gibbs compromise between the Markov kernels in this sequence,

B. Proofs of Section 4
The following holds where there is no mention of the contrary. When applied to a vector, · denotes the Euclidean norm and when applied to a matrix, it denotes the Frobenius norm. The choice of norm does not matter much because in finite-dimensional vector spaces, all norms are equivalent.

B.1. Differentiating the Matérn correlation kernel
Lemma B.1. The partial derivative with respect to µ i of the Matérn tensorized kernel of variance σ 2 , smoothness ν and inverse correlation length vector µ is: This can be rewritten as: Proof. The first assertion is a simple matter of differentiating Equation (4.9). In the following calculation, the fourth line is given by formula 9.6.28 (page 376) in Abramowitz and Stegun [1].
Definition B.5. An anisotropic geometric or tensorized correlation kernel is said to be "well-behaved" if its one-dimensional version is, for any set of parameters, a positive decreasing function on [0, +∞) that vanishes in the neighborhood of +∞.
Lemma B.6. Provided a coordinate-distinct design set is used, a well-behaved anisotropic geometric or tensorized correlation kernel parametrized by µ has the following properties: (a) for any fixed µ −i ∈ [0, +∞) r−1 , it is a decreasing function of µ i ; (b) as µ → ∞, Σ µ − I n → 0.
Lemma B.7. For any well-behaved correlation kernel, as µ → ∞, Proof. This result is due to the fact that all ∂ ∂µi Σ µ 's diagonal coefficients are null and Σ µ goes to the identity matrix as µ → ∞.

Let us now define
Lemma B.8. For any well-behaved correlation kernel, as µ → ∞, Proof. Because Σ µ goes to the identity matrix, this is a direct consequence of Lemma B.7.
Corollary B.9. For any well-behaved correlation kernel, there exist S > 0, and 0 < a < b such that, whenever µ S, In the following, Σ µ −i is the correlation matrix that would be obtained if µ i were replaced by 0. Moreover, if M is a matrix, M (kl) is its element in the k-th row and l-th column.
Lemma B.10. If a well-behaved correlation kernel is used, there exist real constants S > 0 and c > 0 such that, for all µ i ∈ (0, +∞) and whenever µ −i S, Proof. If a well-behaved correlation kernel is used, then for any for any > 0, Corollary B.9 implies that The last inequality holds because the Frobenius norm of any matrix is smaller than or equal to the sum of the absolute values of its elements and the correlation kernel is a decreasing function of µ i . Now, for all k = l, when µ i → +∞, Σ (kl) µ → 0 and when µ i = 0, Σ (kl) From this, we deduce that This Lemma has the following immediate consequence: Proposition B.11. If a well-behaved tensorized kernel is used, there exists S > 0 and for every i ∈ [ [1, r]], there exists a function M i : (0, +∞) → (0, +∞) such that for all Proof. If a tensorized correlation kernel is used, for every pair of integers (k, l) ∈ [[1, r]] 2 such that k = l, Proof. From (P1), we gather that for all a > 0 and t . Now, because the correlation kernel is well-behaved, K is a decreasing function. As √ t 2 + a 2 t + a, K( √ t 2 + a 2 ) K(t + a). Plugging this into the previous inequality, we get |K . If t max(S 1 , S 2 (a)), we can then use (P2) to obtain (B.16) Independently from this, we have the following algebraic fact: Because we use a well-behaved anisotropic geometric kernel, defining the function M (kl) i , we can write: i |µ i and t kl : −i )µ −i (and thus, naturally, t 2 kl + a 2 kl = (x (k) − x (l) )µ ), and provided µ −i is sufficiently large to make all t kl s meet the conditions necessary to apply (P1) and (P2) (that depend in the case of (P2) on the a kl s), Equation (B.16) yields the existence of some number m Proof. (P1) is given by Lemma B.2, after noticing that, denoting by K ν the Matérn one-dimensional kernel of smoothness ν > 1, provided t is sufficiently large, K ν (t) K ν−1 (t). This inequality ensues from the fact that ∀ν 0, as t → +∞, . Moreover, this last equivalence relation also implies (P2).
Proposition B.15. For Matérn anisotropic geometric correlation kernels with smoothness ν > 1 and for Matérn tensorized correlation kernels with smoothness ν > 0, for any y ∈ R n \ {0} n , any δ > 0 and any Proof. Set δ > 0 and y ∈ R n \ {0} n . There exist m δ > 0 and M δ > 0 s.t. ∀µ ∈ (0, +∞) r ,  This part of the proof relies on the combination of some spectral study of the Matérn kernels and on the study of the matrices that are part of the series expansion of the correlation matrix Σ µ when µ → 0 for three types of Matérn kernels: isotropic, tensorized and anisotropic geometric.
Lemma B.17. For every design set with coordinate-distinct points x (1) , ..., x (n) , there exists a constant c x > 0 such that for all µ ∈ (R + ) r , Proof. For every design set x (1) , ..., x (n) , the set of all design sets that can be written µ µ ∞ x (1) , ..., µ µ ∞ x (n) (µ ∈ (R + ) r ) is compact. If the design set x (1) , ..., x (n) has coordinate-distinct points, then every design set in the aforementioned compact set has no overlapping points. Thus the conclusion follows from Lemma B.16.
Proposition B.18. With Matérn anisotropic geometric or tensorized kernels, for every design set with coordinate-distinct points x (1) , ..., x (n) , as µ → 0, Proof. For Matérn anisotropic geometric kernels, we need only apply Lemma B.17. In the case of tensorized Matérn kernels, analoguous results to Lemma B.16 and then Lemma B.17 may be used.
Abramowitz and Stegun [1] give the following results on the modified Bessel function of second kind (usually noted K ν and which we note K ν in order to avoid confusion with the Matérn correlation kernel). If I ν is the modified Bessel function of first kind and ψ is the function defined in (6.3.2) by ψ : This gives us the series expansion of K 1,ν (z) (ν ∈ [0, +∞) \ N) when z → 0: In the remainder of this subsection, we consider a fixed design set with n coordinate-distinct points x (k) (k ∈ [[1, n]]) in R r . Moreover, all Matérn kernels we consider are assumed to have non-integer smoothness parameter ν.
Let us now define, for every nonnegative integer k < ν the matrix D k whose (i, j) element is Let us also define the matrix D ν whose (i, j) element is If the correlation kernel is Matérn isotropic, Σ µ has the following series expansion if ν is not an integer when µ → 0+: In this expansion, µ −2ν R µ → 0.
For any integer i ∈ [ [1, r]] and any nonnegative integer k < ν define the matrix D k i whose (m, p) element is and also the matrix D ν i whose (m, p) element is For every i ∈ [ [1, r]], if the points in the design set differed only through their i-th coordinate, the series expansion of the correlation matrix (using a Matérn anisotropic geometric or tensorized kernel) when µ → 0 (and thus when µ i → 0) would be Note the following identities: If a tensorized correlation kernel is used, the correlation matrix Σ tens µ may be written where the subscript • above the symbol serves to denote the Hadamard product of matrices.
In case a Matérn anisotropic geometric kernel is used, then define for any nonnegative interger k < ν the matrix D k (µ) whose (m, p) element is And, similarly, we may define the matrix D ν (µ) whose (m, p) element is We thus have (if ν ∈ [0, +∞) \ N) Similar identities to those of Equation (B.31) can be derived to make Equation (B.37) more explicit for small values of ν. (B.38) Fortunately, for small values of ν, Σ tens µ can also be simply written.
Define k ν as the orthogonal complement in R n of the vector space spanned by: , for any µ ∈ (0, +∞) r such that µ is small enough, there exists c > 0 such that for any v ∈ k ν , Proposition B.19. For a Matérn anisotropic geometric or tensorized correlation kernel with smoothness parameter ν ∈ (0, 1) ∪ (1, 2) ∪ (2, 3), for any vector y ∈ R n not orthogonal to k ν , when µ → 0, µ −2ν = O y Σ −1 µ y . Proof. Let d ν be the dimension of k ν and let O ν be an orthogonal n × n matrix whose first n − d ν columns form an orthonormal basis of k ⊥ ν and whose last d ν columns form an orthonormal basis of k ν .
where the blocks A µ , B µ and C µ are respectively (n − d ν ) × (n − d µ ), (n − d µ ) × d µ and d µ × d µ matrices. Note that A µ and C µ represent the restriction of the scalar product defined by Σ µ to k ⊥ ν and k ν respectively. When µ is small enough, defining c > 0 as in Equation (B.46), C µ c µ 2ν .

B.4. Lower bound for conditional reference posterior densities
The following Lemma provides the key to proving Theorem 4.2: Lemma B.23. In a Simple Kriging model with the characteristics described above, there exists a hyperplane H of R n such that, for any y ∈ R n \H and any i ∈ [ [1, r]], there exists a measurable function m i,y : (0, +∞) → (0, +∞) such that, for all µ −i ∈ (0, +∞) r−1 , the conditional reference posterior density verifies: Proof. This proof consists in combining Proposition B.15 and Proposition B.22, which respectively deal with large and small values of µ −i .