Efficient sequential experimental design for surrogate modeling of nested codes

Thanks to computing power increase, the certification and the conception of complex systems relies more and more on simulation. To this end, predictive codes are needed, which have generally to be evaluated in a huge number of input points. When the computational cost of these codes is high, surrogate models are introduced to emulate the response of these codes. In this paper, we consider the situation when the system response can be modeled by two nested computer codes. By two nested computer codes, we mean that some inputs of the second code are outputs of the first code. More precisely, the idea is to propose sequential designs to improve the accuracy of the nested code's predictor by exploiting the nested structure of the codes. In particular, a selection criterion is proposed to allow the modeler to choose the code to call, depending on the expected learning rate and the computational cost of each code. The sequential designs are based on the minimization of the prediction variance, so adaptations of the Gaussian process formalism are proposed for this particular configuration in order to quickly evaluate the mean and the variance of the predictor. The proposed methods are then applied to examples.


Introduction
A lot of industrial issues involve multi-physics phenomena, which can be associated with a series of computer codes.However, when these code networks are used for conception, uncertainty quantification, or risk analysis purposes, they are generally considered as a single code.In that case, all the inputs characterizing the system of interest are gathered in a single input vector, and little attention is paid to the potential intermediate results.When trying to emulate such code networks, this is clearly sub-optimal, as much information is lost in the statistical learning, such that too many evaluations of each code are likely to be required to get a satisfying prediction precision.In this paper, we focus on the case of two nested computer codes, which means that the output of the first code is an input of the second code.We assume that these two computer codes are deterministic, but expensive to evaluate.To predict the value of this nested code in a unobserved point, a Bayesian formalism [23] is adopted in the following.Each computer code is a priori modeled by a Gaussian process, and the idea is to identify the posterior distribution of the combination of these two processes given a limited number of evaluations of the two codes.The Gaussian process hypothesis is widely used in computer sciences ( [24,25,22,14,15,4,18,16]), as it allows a very good trade-off between error control, complexity, and efficiency.The two main issues of this approach, also called Kriging, concern the choice of the statistical properties of the Gaussian processes that are used, and the choice of the points where to evaluate the codes.When a single computer code is considered, several methods exist to add one new point or a batch of new points sequentially to an already existing Design of Experiments ( [24,25,3,7,6]), in order to minimize the global prediction uncertainty.These methods are generally based on a post-processing of the variance of the code output prediction, which expression can be explicitly derived under mildly restrictive conditions on the mean and the covariance of the prior Gaussian distribution.The adaptation of these selection criteria to the case of two nested codes is not direct.Indeed, the combination of two Gaussian processes is not Gaussian, such that the prediction uncertainty is much more complicated to estimate.Moreover, if the two codes can be launched separately, the selection criterion has also to indicate which one of the two codes to launch.In that prospect, the first objective of this paper is to propose several adaptations of the Gaussian Process formalism to the nested case, in order to be able to evaluate the two first statistical moments of the code output predictor quickly.Then, original sequential selection criteria are introduced, which try to exploit as much as possible the nested structure of the studied codes.In particular, these criteria are able to integrate the fact that the computational cost associated with the evaluation of each code can be different.The outline of this paper is the following.Section 2 presents the theoretical framework of the Gaussian process-based surrogate models, its generalization to the nested case, and introduces several selection criteria based on the prediction variance to reduce the prediction uncertainty sequentially.Section 3 introduces a series of simplifications to allow a quick evaluation of the prediction variance.In section 4, the presented methods are eventually applied to two examples.
The proofs of the results that will be presented in the following sections have been moved to the appendix.

Notations
In this paper, the following notations will be adopted: • x, y correspond to scalars.
• x, y correspond to vectors.
• X, Y correspond to matrices.
• The entries of a vector x are denoted by (x) i , whereas the entries of a matrix X are denoted by (X) ij .
• X T denotes the transpose of a matrix X.
• N (x, X) corresponds to the multidimensional Gaussian distribution, whose mean vector and covariance matrix are respectively given by x and X.
• GP(m, k) corresponds to the distribution of a Gaussian process whose mean function is m, and whose covariance function is k.
• E [•] and V(•) are the mathematical expectation and the variance respectively.
• For all real-valued functions y and z that are square integrable on X, (•, •) X and • X denote respectively the classical scalar product and norm in the space of square integrable real-valued functions on X: (y, z) X := X y(x)z(x)dx, y 2 X := (y, y) X . (2.1)

General framework
Let S be a system that is characterized by a vector of input parameters, x nest ∈ X nest .Let y nest : X nest → R be a deterministic mapping that is used to analyze the studied system.In this paper, we focus on the case where the function x nest → y nest (x nest ) can be modeled by two nested codes.Two quantities of interest, y 1 and y 2 , are thus introduced to characterize these two codes, which are supposed to be two real-valued continuous functions on their respective definition domains X 1 and R×X 2 .Given these two functions, the nested code is defined as follows: where The sets X 1 and X 2 are moreover supposed to be two compact subsets of R d 1 and R d 2 respectively, where d 1 and d 2 are two positive integers.In theory, the definition domains may be unbounded, but the reduction to compact sets enables the square integrability of y nest on X nest .Given a limited number of evaluations of the functions x 1 → y 1 (x 1 ) and (ϕ 1 , x 2 ) → y 2 (ϕ 1 , x 2 ), the objective is to build a stochastic predictor of y nest with the following properties: • its mean is as close as possible to the real output of the nested code, that is, the bias is small, • its uncertainty (given by its variance) is as small as possible.
In other words, the mean square error of the stochastic predictor has to be small.

Gaussian process-based surrogate models
The Gaussian process regression (GPR), or Kriging, is a technique that is widely used to replace an expensive computer code by a surrogate model, that is to say a fast to evaluate mathematical function.The GPR is based on the assumption that the two code outputs, y 1 and y 2 , can be seen as the sample paths of two stochastic processes, y 1 and y 2 , which are supposed to be Gaussian for the sake of tractability: where for all 1 ≤ i ≤ 2, µ i and C i denote respectively the mean and the covariance functions of y i .Let x 1 , . . ., x be N 1 elements of X 1 and ϕ (1) , . . ., (ϕ

Denoting by
1 ), . . ., y 1 (x )), y obs 2 := (y 2 (ϕ 2 ), . . ., y 2 (ϕ )), (2.4) the vectors that gather the evaluations of y 1 and y 2 in these points, it can be shown that: and we refer to [24,25] for further details about the expressions of conditioned mean functions, µ c i , and conditioned covariance functions, C c i .According to Eq. (2.2), the nested code, x nest → y nest (x nest ), can thus be seen as a particular realization of the conditioned process y c nest , such that for all ( Under this Gaussian formalism, the best prediction of y nest in any unobserved point given by the mean value of y c nest (x 1 , x 2 ), whereas its variance can be used to characterize the trust we can put in that prediction.As explained in Introduction, there is no reason for y c nest to be Gaussian, but according to Proposition 2.1, the first-and second-order moments can be obtained by computing two one-dimensional integrals with respect to a Gaussian measure.This can be done by quadrature rules or by Monte-Carlo methods ( [2]).

Parametric representations of the mean and covariance functions
As explained in Introduction, the relevance of the Gaussian process predictor strongly depends on the definitions of µ i and C i .When the maximal information about y i is a finite set of evaluations, these functions are generally chosen in general parametric families.In this paper, functions C i are supposed to be two elements of the Matérn-5/2 class (see [25,17] for further details about classical parametric expressions for C i ), with θ i be the hyper-parameters that characterize these covariance functions, whereas linear representations are considered for the mean functions, where h i is a given M i -dimensional vector of functions (see [21] for further details on the choice of the basis functions).In the following, the framework of the "Universal Kriging" is adopted, which consists in: • assuming an (improper) uniform distribution for β i , • conditioning all the results by the maximum likelihood estimate of θ i , • integrating over β i the conditioned distribution of y i .
In that case, the distribution of y c i , which is defined by Eq. 2.5 is Gaussian, and its statistical moments can explicitly be derived (see [24,5,3,21]).

Sequential designs for the improvement of Gaussian process predictors
The relevance of the predictor y c nest strongly depends on the space filling properties of the sets gathering the inputs of the available observations of y 1 and y 2 , which are generally called Designs of Experiments (DoE).Spacefilling Latin Hypercube Samplings (LHS) or quasi-Monte-Carlo samplings are generally chosen to define such a priori DoE ( [9,8,20]).The relevance of the predictor can then be improved by adding new points to an already existing DoE, as the higher the values of N 1 and N 2 , the more chance there is for Xnest to be small.
In the case of a single code, most of the existing selection criteria to add a new point are based on the minimization of a quantity associated with the predictor variance, such as its integral over the input domain for instance [24,25,7,3,6,19,13,11].Indeed, if z is a Gaussian process that is indexed by x in X, and if we denote by k its covariance function, the variance of the conditioned random variable z(x) | z(x new ), where x and x new are any elements of X, is given by: such that it does not depend on the (unknown) value of z(x new ).To minimize the global uncertainty over z at a reduced computational cost, a natural approach would consist in searching the value of x new such that is minimal (under the condition that this integral exists).
In the nested case, we also have to choose on which code to add a new observation point.To this end, let τ 1 and τ 2 be the numerical costs (in CPU time for instance) that are associated with the evaluations of y 1 and y 2 respectively.For the sake of simplicity, we assume that these numerical costs are independent on the value of the input parameters, and that they are a priori known.Two selection criteria are eventually proposed to optimize the relevance of the Gaussian process predictor sequentially.To simplify the reading, the following notation is proposed: and we denote by V( y c nest (x nest )| x i ) the variance of y c nest (x nest ) under the hypothesis that the code(s) corresponding to the new point x i is(are) evaluated in this point (in practice, we remind that these code evaluations are not required for the estimation of this variance).
• First, the chained I-optimal criterion selects the best point in X 1 × X 2 to minimize the integrated variance of the predictor of the nested code: Such a criterion is a priori adapted to the case when it is not possible to run independently the codes 1 and 2.
• Secondly, the best I-optimal criterion selects the best among the candidates in X 1 and X 2 in order to maximize the decrease per unit of computational cost of the integrated predictor variance of the nested code: (2.14) In that case, the difference in the computational costs is taken into account, and a linear expected improvement per unit of computational cost is assumed for the sake of simplicity.

Fast evaluation of the prediction variance
As explained in Section 2.5, to choose the position of the new point, for each potential value of x i in X i , we need to compute the value of Var( y c nest (x nest )| x i ) for all x nest in X nest .If quadrature rules or Monte Carlo approaches are used to evaluate this variance, as it is proposed in Section 2.3, the optimization procedure quickly becomes extremely demanding, even if discretized approximations of the optimization problem defined by Eqs.(2.14) and (2.13) are considered, that is to say where the integral over X nest is replaced by an empirical mean over any N nest -dimensional set of randomly chosen points of X nest .To circumvent this problem, we present in this section several approaches to make the evaluation of Var( y c nest (x nest )| x i ) explicit, and therefore extremely fast to evaluate.
3.1.Explicit derivation of the two first statistical moments of the nested code predictor Proposition 3.1.Using the notations of the Universal Kriging framework that is introduced in Section 2.4, and denoting by g the family of functions such that g (x, α) where m k is a deterministic function from X 2 to R and In other words, if the prior of the Gaussian process modeling the function y 2 can be seen as any derivative of a Gaussian process with a trend which is a linear combination of products of polynomials by exponentials of order less than 2, and a covariance function of the Gaussian class, then conditionally to some integration criteria, the moments of order 1 and 2 of the coupling of the predictors of the two codes can be computed explicitly at a reduced cost.However, the approach cannot be generalized to the coupling of more than two codes.

Linearized approach
In the cases where the conditions for Proposition 3.1 are not fulfilled (or if more than two codes were considered), another approach is proposed in this section, which is based on a linearization of the process modeling the nested code.Indeed, for i ∈ {1, 2}, let ε c i be the Gaussian process such that: By construction, ε c i is the residual prediction uncertainty once y i has been conditioned by N i evaluations of y i .We remind that these two Gaussian processes are statistically independent.Under the condition that N 1 is not too small compared to the complexity of y 1 , it is therefore reasonable to assume that ε c 1 is small compared to µ c 1 .
Proposition 3.2.If: 1. the predictor of two nested computer codes can be written , where y c i are Gaussian processes which can be written as and ε c 1 is small enough for the linearization to be valid, then the predictor of the two nested computer codes can be defined as a Gaussian process with the following mean and covariance functions: Hence, thanks to the proposed linearization, the variance of y c nest (x nest ) but also the one of y c nest (x nest )| x i can explicitly be derived for all (x nest , x i ) in X nest × X i .Under the condition that the linearization is valid, this approach can be applied to configurations with more than two nested codes.However it can be inferred from equation (3.3) that the variance depends on y obs 1 through µ c 1 .To circumvent this problem for the evaluation of the forward variance in the sequential designs, we assume that a candidate x 1 is associated with the current estimate of the output of the first code µ c 1 (x 1 ), in accordance with the Kriging Believer strategy proposed in [10].

Applications
The previously proposed methods are applied to two examples: an analytical one-dimensional one and a multidimensional one.

Characteristics of the examples 4.1.1. Analytical example
In the analytical example the properties of the Gaussian process mean functions and of the codes are: ) where x 1 ∈ [−7, 7].In this example X 2 = ∅.
Figure 1 shows the variations of the outputs of the codes 1, 2 and nested.The codes 1 and 2 outputs are relatively smooth compared with the one of the nested code.The amplitude of the variations is strongly non-stationary for the nested code.

Hydrodynamic example
This example consists in the coupling of two computer codes.The objective is to determine the impact point of a conical projectile.
The first code computes the drag coefficient of a cone divided by the height of the cone.Its inputs are the height and the half-angle of the cone, so the dimension of x 1 is 2.
The second code computes the range of the ballistic trajectory of a cone.Its inputs are the output of the first code, associated with ϕ 1 , and the initial velocity and angle of the ballistic trajectory of the cone, gathered in x 2 .The dimension of x 2 is therefore 2. Figure 3 shows the variations of the output with respect to each component of the input for each code.This figure enables to propose a basis of functions for the prior mean of the processes associated with the two codes.
For the first code the scatter plots highlight a linear variation with respect to (x 1 ) 1 and a multiplicative inverse variation with respect to (x 1 ) 2 , so the proposed basis functions are: For the second code only a multiplicative inverse variation with respect to y 1 is evident, so the proposed basis functions are: The denominator has a lower boundary ϕ 1 min in order to avoid any inversion problem around zero.This boundary is small and set arbitrarily.

Reference: "blind box" method
In this method, the nested computer code is considered as a single computer code.Only the inputs x nest and the output y nest are taken into account.The intermediary information ϕ 1 is not considered.A Gaussian process regression of this single computer code is done.
Only the chained I-optimal sequential design could be applied in this framework, the other proposed sequential design requiring to consider the partial information.

Choice of the covariance functions and estimation of their hyperparameters
In the analytical example the covariance functions are Gaussian.This implies that the sample paths of the Gaussian processes associated with the codes are infinitely differentiable functions.This enables to apply Proposition 3.1 and Proposition 3.2 to this example.In the hydrodynamic example the covariance functions are Matérn 5  2 , which implies that the sample paths of the Gaussian processes associated with the codes are mean square one time continuously differentiable functions (see [22]).This enables to perform the linearization of Proposition 3.2.In both cases the covariance functions include a non-zero nugget term (see [12] for further details).
The hyperparameters of the covariance functions are estimated for each set of observations, including the sequential designs.They are estimated by maximizing the Leave-One-Out log predictive probability (see [22], chapter 5, and [1]).

Comparison between the analytical and the linearized method
Figure 4 illustrates the convergence of the two first statistical moments estimated with the Monte Carlo (see Proposition 2.1) and the linearized methods (see Proposition 3.2) towards their real values calculated with the analytical method described in Proposition 3.1.Both methods converge when the uncertainty of the first code predictor decreases.It can be seen that the linearized method is a very good compromise between computation time and accuracy compared to the Monte Carlo method.the evaluations of the nested code in these points, the performance criterion of the nested predictor mean, also called error on the mean can be defined as:

Comparison between the blind box and the linearized methods
Figure 5 shows that the linearized method enables to better take into account the non-stationarity of the variations of the nested code output.On the contrary, in the blind box method the magnitude of the prediction interval is the same across the input domain and depends only on the distance to the observation points.The prediction interval is too big in the area with small variations and too small in the area with larger variations.
Figure 6 shows the similar accuracies of the prediction mean computed with the analytical and linearized methods proposed in Proposition 3.1 and Proposition 3.2.
For both examples, the precision of the prediction mean is better with the linearized method than with the blind box method, showing the interest of taking into account the intermediary information.
4.7.Performances of the sequential designs with identical computational costs Figure 7 shows the relevance of the proposed sequential designs for improving the prediction mean of the linearized nested predictor, compared to the maximin LHS design on X nest .
In the analytical example, the best I-optimal sequential design enables to obtain the most accurate prediction mean at a given computational cost.
In the hydrodynamic example, the different sequential designs give similar results, except for the first new observation points added, where the best I-optimal is better.
In both examples the new observation points are mostly added on the first code, as shown in figure 8.It seems that the uncertainty propagated from the first code into the second code is predominant at the beginning.The best I-optimal sequential design aims therefore to reduce this uncertainty by first adding new observation points on the first code.Then new observations points can be added on both codes.
4.8.Performances of the sequential designs with different computational costs Figure 9 shows the prediction mean accuracy with a best I-optimal sequential design when the costs of the two codes are different.It can be seen that at a given total computational cost the accuracy of prediction is better when the cost of the first code is lower.In other words the prediction mean accuracy is better at a given computational budget when more observation points can be added to the first code for the same computational budget.These results are consistent with those of figure 8.

Conclusions and future work
In this paper the Gaussian process formalism is adapted to the case of two nested computer codes.Two methods to evaluate quickly the mean and variance of the nested code predictor have been proposed.The first one, called "analytical" computes the exact value of the two first moments of the predictor.But it cannot be applied to the coupling of more than two codes.The second one, called "linearized", enables to obtain a Gaussian predictor of the two nested codes, with mean and variance that can be instantly computed.The approach could be generalized to the coupling of more than two codes.Both proposed methods take into account the intermediary information, that means the output of the first code.A comparison to the reference method, called "blind box", is made.In this method a Gaussian process regression of the block of the two codes is made without considering the intermediary 20

PSfrag replacements
Computational cost Error on the mean LHS Chained I-optimal Best I-optimal (b) Hydrodynamic example Figure 7: Comparison of the linearized predictor mean precision with the maximin LHS design on X nest and the sequential designs applied to the two examples.In the hydrodynamic example, the two curves representing the sequential designs are almost superimposed.The initial designs are the same for the three curves, with a size of 10 points for the analytical example and 20 points for the hydrodynamical example.The draw of the chained maximin LHS designs is repeated 50 times and the curves present the median of the associated results.The costs of the two codes are assumed to be the same.Moreover, two sequential designs are proposed in order to improve the prediction accuracy of the nested predictor.The first one, the "chained" I-optimal sequential design, corresponds to the case when the two codes cannot be launched separately.The second one, the "best" I-optimal sequential design, allows to choose on which of the two codes to add a new observation point and to take into account the different computational costs of the two codes.The numerical applications show the interest of the sequential designs compared to a space-filling design (maximin LHS).Furthermore, they illustrate the advantage, in terms of prediction mean accuracy, of choosing on which code to add a new observation point compared to simply adding new observation points of the nested code.The results obtained show an amplification of the uncertainties in the chain of codes, leading to the addition of observation points on the first code firstly in the best I-optimal sequential design.It can be assumed that this should be similar with the coupling of more than two codes.In other words, the uncertainty of the beginning of the chain should be reduced as a priority.This paper has been focused on the case of two nested codes with a scalar intermediary variable.Considering the case of a functional intermediary variable seems promising for future work.Given that the moments of a Gaussian variable can be calculated analytically, E x c g and therefore E [f (x, a, b, c)] can be computed analytically.So we have shown that if x ∼ N (µ, σ 2 ), and f (x, a, b, c) = x c exp (ax + bx 2 ) then, under the integrability condition 1 − 2bσ 2 > 0, the mean of f (x, a, b, c) can be calculated analytically.

First moment
In the framework of Universal Kriging, the conditional mean function of the process modeling the second code can be written: (v c ) i
According to the assumptions of Proposition 3.1 the mean basis functions h 2 can be written: with m i deterministic functions.
In the same way, the covariance function C 2 can be written: with k : x → exp (−x 2 /2), n ϕ 1 and n i positive integers and k (n) denoting the n-th derivative of function k.So, we can written that: where l is a deterministic function defined according to the previous equation and a j real numbers.
So the terms (1) and ( 2) of the previous equation can be written: (1) = According to the fact that m i and l are deterministic functions, v h , v c , x and x 2 deterministic vectors, and ϕ (i) and a j deterministic real numbers, then: The means E [(1)] and E [(2)] can therefore be calculated analytically, and consequently, the mean E [µ c 2 ((ϕ 1 , x 2 ))] can be calculated analytically.
an element of the Gaussian class or corresponds to the covariance function of any derivative of a zero-mean process with covariance function of the Gaussian class, then the conditional moments of order 1 and 2 of y c nest (x 1 , x 2 ), which are defined by Eqs.(2.7) and (2.8) can be calculated analytically.

Figure 2 Figure 1 :
Figure 2 illustrates the two codes inputs and outputs.

Figure 2 :
Figure 2: Hydrodynamic example: Inputs and outputs of the two codes.

2 Figure 3 :
Figure 3: Hydrodynamic example: variation of the outputs y 1 and y 2 of the two codes with respect to the components of the inputs x 1 and x 2 .The 20 input points are drawn according to a maximin LHS design on X 1 × X 2 .

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Comparison of the linearized (Proposition 3.2) and Monte-Carlo (Proposition 2.1) methods in terms of computation time and accuracy for the evaluation of the two first moments of the process y c nest .The Monte Carlo method is run with 100 and 1000 points to compute the one-dimensional integral with a Gaussian measure.The Monte Carlo draws are repeated 50 times and the curves correspond to the median of these repetitions.The real values are computed with the analytical method (Proposition 3.1).The covariance functions are Gaussian.The predictor of the first code is of the form y c 1 = µ c 1 + σ c 1 u with u ∼ N (0, 1), σ c 1 ∈ {10 −4 , 10 −3 , 10 −2 , 10 −1 } and for each value of σ c 1 , 100 values of µ c 1 on a grid on [−2, 4] are considered.The predictor of the second code is build using 20 input observation points drawn on a grid on [−2, 4] for the second code of the analytical example.
evaluations observations.The numerical examples illustrate the interest of taking into account the intermediary information in terms of prediction mean accuracy.
A set of validation observations if available.Let x