QUANTIFYING THE CLOSENESS TO A SET OF RANDOM CURVES VIA THE MEAN MARGINAL LIKELIHOOD

In this paper, we tackle the problem of quantifying the closeness of a newly observed curve to a given sample of random functions, supposed to have been sampled from the same distribution. We define a probabilistic criterion for such a purpose, based on the marginal density functions of an underlying random process. For practical applications, a class of estimators based on the aggregation of multivariate density estimators is introduced and proved to be consistent. We illustrate the effectiveness of our estimators, as well as the practical usefulness of the proposed criterion, by applying our method to a dataset of real aircraft trajectories. Mathematics Subject Classification. 62G07, 62G05, 62P30, 62M09. Received April 23, 2019. Accepted December 10, 2020.

Density estimation has been a longstanding problem in statistics and machine learning. Many parametric and nonparametric techniques have been proposed ever since to address it in a finite-dimensional setting.
For functional data, density estimation has been studied for example by [5], who proposed an extension of the well-known kernel density estimator. Similarly, reference [24] developed a delta-sequence method for functional density estimation.
As the distribution of a functional random variable is hard to grasp and does not present good topological properties because it is defined on sets of a space which is "too large" (see e.g. [2,15]), alternatives were proposed for casting this problem into a finite-dimensional setting. For example, reference [25] proposed to project the random curves on basis functions, while [14] suggested to study the structure of the distribution of a functional random variable by estimating its modes and density ascent lines. We propose another finitedimensional approach, by estimating an aggregation of the marginal densities, which reduces to a finite sequence of multivariate density estimation problems.

Core contributions
After introducing our method in Section 2.1, an empirical version of it is presented for practical applications (Sect. 2.2). The obtained statistic is built using marginal density estimators which are shown to be consistent in Section 2.3. In Section 3, we propose an implementation of our approach using the self-consistent kernel density estimator from [1] and extending it to the functional data context. We illustrate the effectiveness of our method, as well as its usefulness as an exploratory analysis tool for functional data, on a dataset of real aircraft trajectories and compare it to more standard approaches (Sect. 4).
The code used for the experimental part can be found in the following link: https://github.com/cedricrommel/ mean marginal likelihood

Mean marginal likelihood
Let (Ω, F, P) be a probability space and T = [0; t f ] be an interval of R. We denote by E a compact subset of R d , d ∈ N * , endowed with the Borel σ-field B. Let Z = (Z t ) t∈T be a random variable valued in C(T, E), the set of continuous functions from T to E, and suppose that a training set of m observations of Z is available: T = {z 1 , . . . , z m } ⊂ C(T, E). We denote by µ t the marginal distribution of Z t for any t ∈ T, and we assume that it has a density f t relative to the Lebesgue measure on R d . We assume that (t, z) ∈ T × E → f t (z) is continuous. Let y ∈ C(T, E) be some arbitrary new curve that we would like to assess.
Given t ∈ T, we can interpret the quantity f t (y(t)) as the likelihood of observing Z t = y(t). By summarizing in some way the infinite collection {f t (y(t)) : t ∈ T}, which we call the marginal likelihoods of y hereafter, we hope to build a global and simple likelihood indicator.
The first idea for aggregating these quantities is to average them with respect to time: The main problem with this criterion is that it mixes elements from densities which may have very different shapes. Indeed, density values of likely observations at two times t 1 , t 2 ∈ T may have completely different orders of magnitude. For this reason, we propose to use some continuous scaling map ψ : L 1 (E, R + ) × E → [0; 1] prior to averaging: and we call the obtained quantity the mean marginal likelihood of y given Z. A natural choice for ψ is simply the function that normalizes f t (y(t)) over E: However, we can also consider more meaningful scaling maps, such as the confidence level at y(t): Definition 2.1. Let X be a random variable on E with continuous density function f . We call the confidence level of f at a ∈ E the probability that f (X) lies in the region of density lower or equal to f (a): . (2.4) In this case, ψ[f t , y(t)] corresponds to the probability of Z t falling outside the smallest confidence region containing y(t), as illustrated in Figure 1.
A numerical comparison of these two scalings can be found in Section 4, while a class of estimators of the mean marginal likelihood is presented in the next section.

Empirical version
Usually in FDA, one only has access to discrete observations of the random functions under study. We assume to be in this context: for 1 ≤ r ≤ m, each path z r of the training set T is assumed to be observed at n ∈ N * discrete times {t r 1 < t r 2 < · · · < t r n } ⊂ T, drawn independently from some random variable T , supposed independent of Z. Note that the sample sizes are the same for all curves z r , but the sampling times t r j are different. Hence, we denote by T D the set of all discrete observations: where z r j = z r (t r j ). Likewise, we assume that the new curve y is observed atñ ∈ N discrete times {t j }ñ j=1 ⊂ T and we denote these observations by where y j = y(t j ).
Had we enough observations of y, we could approximate the integral in (2.2) by a Riemann sum: where f j := ft j , ∆t j :=t j −t j−1 andt 0 = 0. Yet, the marginal densities {f j }ñ j=1 are unknown and need to be estimated for practical use.
Our approach is based on the idea of partitioning T into q m intervals, or bins, of same length b m = t f /q m . For 1 ≤ ≤ q m , let τ := τ 0 + b m , where τ 0 = inf T = 0. Denote by B := [τ −1 ; τ ) the th bin for = 1, . . . , q m − 1 and B qm := [τ qm−1 ; τ qm ]. Similarly, T = {(t r j , z r j ) : t r j ∈ B } ⊂ T D denotes the set of observations whose sampling time fall into B . For some m and 1 ≤ j ≤ñ, let be such thatt j ∈ B . For b m small enough, we estimate f j by building a density estimator with the partial data contained in T . This is done by applying a common statistic Θ : . Hence, we consider a single density estimator per bin, averaging along the times in it. We denote this estimated quantities {f j }ñ j=1 and by summing them we can build the following plug-in estimator, called the Empirical Mean Marginal Likelihood hereafter: In Section 2.3, sufficient conditions are given for the consistent estimation of the marginal densities f t , while Section 3 presents a possible class of kernel density estimators to compute {f j }ñ j=1 .

General case
In this section we state that by using some well chosen statistic to build density estimators in the bins described in Section 2.2 we obtain pointwise consistent estimations of the marginal densities of Z. We describe the main ideas of the proof here, while the technical details can be found in the supplementary material. Our consistency result is summarized in Theorem 2.6 and relies on the following 4 assumptions: Assumption 2.2. The random variable T is absolutely continuous and ν ∈ L ∞ (E, R + ), its density relative to the Lebesgue measure, satisfies: is continuous on both variables and Lipschitz in time with constant L > 0: for any z ∈ E and t 1 , Let S = {(z k ) N k=1 ∈ E N : N ∈ N * } be the set of finite sequences with values in the compact set E ⊂ R d . We also need to assume that the statistic Θ : S → L 1 (E, R + ) used to build the density estimators leads to uniformly consistent density estimations in a standard i.i.d setting, which is summarized in the following assumption: Assumption 2.5. Let G be an arbitrary family of probability density functions on E. Given a density ρ ∈ G, let S N ρ be an i.i.d sample of size N valued in S. The estimator obtained by applying Θ to S N ρ , denoted bŷ is a (pointwise) consistent density estimator, uniformly in ρ: For m ∈ N * , let m : T → N * be the function mapping any point t ∈ T = [0; t f ] to the index of the bin containing it: We denote byf m m (t) the estimator obtained by applying Θ to the subset of data points T m m (t) whose sampling times fall in the bin containing t. Theorem 2.6. Under assumptions 2.2 to 2.5, for any z ∈ E and t ∈ T,f m m (t) (z) consistently approximates the marginal density f t (z) as the number of curves m grows: Important Remark Assumption 2.5 does not lead directly to Theorem 2.6, whose proof is technically involving and presented in details in Appendix A. Indeed note that, unlike Assumption 2.5, the convergence in Theorem 2.6 is written in terms of m. This is a key difference since the number of observation points used is supposed to be controlled in the multivariate setting of Assumption 2.5, while it is a random variable in Theorem 2.6 (number of observations falling in the bin T m m (t) ). Moreover, while Assumption 2.5 lies in the classical i.i.d framework, this is not the case in Theorem 2.6. Finally, although Assumption 2.5 might seem strong, we show later on Theorem 2.8 that the consistency of EMML holds with standard kernel density estimators.
Before explaining the proof of such a theorem, notice that the observations falling into a certain bin for a given number of curves m follow some distribution whose density function can be explicitly derived. Indeed, for V ⊂ E and B ⊂ T two compact sets, we have . (2.18) This shows that the set The proof of Theorem 2.6 relies on the fact thatf m m (t) converges pointwise to the marginal density f t as m tends to infinity. It is indeed quite straightforward to show this by using the assumptions that f t is Lipschitz in time (2.11) and that the bin sizes b m tend to 0 (2.12). From there, the idea is to try to apply the consistency result from assumption 2.5 to show thatf m m (t) converges pointwise in probability to f m m (t) . However, two main difficulties arise here: is trained using the observations from T m m (t) and the number of elements contained in this subset, denoted by N m m (t) , is random; 2. we need to trainf on i.i.d observations whose number tend to infinity in order to apply (2.15).
The first difficulty can be tackled by conditioning on N m m (t) . For the second one, we use the fact that, as the bin size tend to 0 and as the number n of observations per curve is fixed with respect to m, than each training subset has, with high probability, at most one observation per curve asymptotically. Hence, because the curves are independent observations of Z, the observations contained in T m m (t) for m large enough will be independently drawn from f m m (t) with probability 1. Furthermore, we can show that if the bin size does not decrease too fast, as required by (2.13), than N m m (t) diverges to +∞ in probability. The detailed proof of Theorem 2.6 can be found in the supplementary material.

Specific case: example of a kernel estimator with deterministic kernel
In this paragraph we state a stronger consistency result for this particular setting. (2.20) where K : R d → R is symmetric kernel summing to 1 and σ > 0 is the bandwidth, chosen to be scalar here for simplicity. More details on this type of estimators are given in Section 3. We denote the second moments of the random variables of density K σ and K 2 σ respectively by and we denote the kernel risk by For this particular setting, we state in Theorem 2.8 that, under certain conditions,f m m (t) (z) approximates f t (z) consistently in expected squared-error, which is stronger than the convergence in probability stated in Theorem 2.6: Assumption 2.7. The function (t, z) ∈ T × E → f t (z) has continuous partial derivatives, up to order 4 in z and order 1 in t; the Lipschitz constant of the function is denoted by L > 0: for any z ∈ E and t 1 , t 2 ∈ T, The previous result applies to any kernel density estimator with a symmetric deterministic kernel. As an example, according to (2.13) from assumption 2.4, Theorem 2.8 applies to the case of a Gaussian kernel and a bandwidth σ m = 1/ √ mb m . Unfortunately, it does not apply to the marginal density estimator presented in the next section, whose kernel is random.

Possible choice of density estimator: the self-consistent estimator
In the previous section we presented a general estimator of some discrepancy used to quantify how close a certain curve is to a set of other curves, called the Mean Marginal Likelihood. As explained, such plug-in estimator is based on the aggregation of other consistent density estimators trained on uniform bins. One may wonder what local density estimator to use in this situation.
As most statistical learning problems, density estimation can be tackled in a parametric or a nonparametric setting. In the first case, a specific class of density functions has to be fixed a priori, a finite set of unknown parameters needing to be tuned using information contained in the data. Such approaches, as for example Maximum Likelihood estimation, are known to be fast to train and evaluate, they have the best learning rate attainable and are usually very scalable and accurate if the model assumptions are correct. However, the nonparametric density learning techniques make little to no assumptions on the shape of the density to be estimated. As explained in Section 2.1, the marginal densities at different times may greatly vary in shape, which is why we preferred to consider nonparametric estimators in this article and more precisely the popular kernel density estimators.
For d-dimensional data, kernel density estimators (KDE) have the following general form when trained on N i.i.d observations {x k } N k=1 of a random variable X of density f : In (3.1), the function K : R d → R + is called a smoothing kernel and H, invertible d × d matrix, is the kernel's bandwidth matrix.
In general, the main drawback of kernel density estimators (3.1) lies on the subjective choice of the kernel K and bandwidth H. However, it is well-known folklore in the density estimation literature ( [34], Chap. 20.3) that KDE's accuracy is not really sensitive to the choice of K and depends mainly on the bandwidth H used. Several rules and heuristics have been suggested since then to choose such a parameter, but they are usually based on quite strict assumptions, such as Silverman's rule of thumb for the estimation of a 1-dimensional Gaussian density [30]. Another possibility is to select H through cross-validation [31] but this approach is computationally intensive, specially if d is larger than 1. For these reasons, we decided to consider a similar method proposed by [1], called the self-consistent density estimator. It consists indeed in a KDE whose kernel incorporates the bandwidth and is learned directly from the data, hence not requiring any parameter tuning. Its derivation is based on the use of a fixed-point equation to approximate the optimal kernel estimator in the sense of the Mean Integrated Squared-Error (MISE). The obtained estimator takes the form of the Fourier transform of the following characteristic function estimator: Reference [1] proved for 1D data that, under mild assumptions on A N , the self-consistent estimator converges almost-surely to the true density f as the number of observations N grows and is hence (strongly) consistent. This result is summarized in the following theorem: then the density estimator defined byf converges almost surely to f as N tends to infinity: It has been demonstrated through extensive numerical experiments in [1,21,22] that the self-consistent estimator achieves state-of-the-art MISE accuracy for many types of underlying densities. Furthermore, modern implementations of this estimator proposed by [21,22] are shown to be several times faster to compute than a regular KDE. This is achieved thanks to the smart use of the Non Uniform Fast Fourier Transform [12] to compute the empirical characteristic function ∆. This property is particularly important in our case because of the potentially large number of trainings needed to compute the EMML, which is equal to the number of bins of T's partition.
For all these reasons, all EMML numerical experiments presented in Section 4 make use of this density estimator. More details concerning its derivation and implementation can be found in the supplementary material.

Experiments motivation
In this section we illustrate our approach on real data recorded from m = 424 flights of the same medium haul aircraft T = {z 1 , . . . , z m }, which corresponds to 334 531 observation points. These trajectories are used to estimate the differential system describing the aircraft dynamics. Then, by numerically solving an optimal control problem defined using the estimated aircraft dynamics, a new trajectory y is obtained, supposed to minimize the overall fuel consumption for some future flight [27].
Note that: -The dynamics model is not guaranteed to be valid outside of the region occupied by the data used to estimate it. Hence, it is natural to want the optimized trajectory to avoid going too far from its validity region. -Furthermore, it is desirable for the proposed trajectory not to be too unusual compared to standard climb profiles for better acceptance by the pilots and Air Traffic Control.
The two previous points motivate the need for an indicator of closeness between the optimized trajectory and the set of recorded flights.

Training set
The training data used for our experiments were extracted from the Quick Access Recorder (QAR) of the same aircraft, whose sampling rate is of one measurement per second. We only used the recordings of 5 variables, which are the altitude h, the true airspeed V , the path angle γ, the angle of attack α and the rotational engine speed N 1 . These variables are not all directly accessible and some were computed from other measurements using standard formulas from flight mechanics [27].
Only the portion corresponding to the climb phase of these signals was kept for our experiments, i.e. data corresponding to altitudes between 1524 and 12000 m. The training set of curves obtained by the described procedure is displayed in Figure 2a.

Test set
In order to evaluate the estimated mean marginal likelihood on relevant examples, the following test flights were considered: -"Real": 50 real flights, extracted from the training set before training; -"Opt1": 50 simulated trajectories which were optimized as described in Section 4.1, with constraints limiting the speed V (typically 250kt below 10000ft) and a N 1 fixed to a standard climb profile; -"Opt2": and 50 simulated trajectories optimized without the constraints on V and N 1 .
We evaluated the likelihood of these 150 test trajectories using the EMML and the competing methods described in the following section in order to assess and compare their discriminative power and computation efficiency.

Alternative approaches based on standard methods
The problem of quantifying how close a newly observed random curve is with respect to a set of observations from the same stochastic process has not been treated by the statistical learning literature to our knowledge. However, this problem is related to other more standard approaches from Functional Data Analysis and Conditional Density Estimation, which could be adapted quite straightforwardly for this purpose. For this reason, we discuss in the following paragraphs the characteristics of two of these other existing methods, before comparing them to our approach in the numerical results of Section 2.

Functional principal components analysis
Functional Principal Components Analysis (FPCA) is a standard tool in FDA capable of building a small number of descriptors which summarize the structure of a set of random functions. As explained for example in [19], this dimension reduction method can be used to project the train set of infinite-dimensional random trajectories into a finite (low) dimensional space.
Following the same reasoning used to derive the MML, our idea here consists in estimating the density function of these low dimensional representations of the training set. Then, after projecting the new trajectory y into the same descriptors, we can evaluate the density estimate at it and obtain an approximation of its likelihood. This approach was originally proposed in [7].

Least-squares conditional density estimation
From a completely different point of view, we could forget for a moment that we are considering a random process Z and look at (T, Z T ) as a pair of standard random variables valued on the finite dimensional space T × E. In this case, we could see the marginal densities f t as the conditional probability density functions We could hence estimate (4.1) at the observed points of the new trajectory y and use them to compute the EMML indicator (2.8). It is however well-known in density ratio estimation that approximating f (T,Z T ) (t, z) and f T (t) separately before building the ratio in (4.1) is not a good idea because it magnifies the errors. For this reason, reference [32] proposed to use a linear model for this purpose where θ = (θ 1 , . . . , θ p ) is a vector of scalar parameters and φ(t, z) = (φ 1 (t, z), . . . , φ p (t, z)) is a family of nonnegative basis functions. The parameters θ are then chosen so as to minimize a L 2 -penalized least-squares criterion, which is shown to have a closed-form solution. This method was coined Least-Squares Conditional Density Estimation (LS-CDE) by the authors, and is also known as Unconstrained Least-Squares Importance Fitting (uLSIF) in the density ratio estimation literature [16]. The extensive numerical results presented in [32] indicate that this approach have state-of-the-art accuracy in conditional density estimation.

Algorithms settings
For all the methods tested, the altitude h played the role of "time". This is a natural assumption made when optimizing the climb profile of a civil airliner, since the altitude is an increasing function of the time and every other variable depends on it. This allowed us to reduce the dimension of our problem from 5 to 4.

MML with self-consistent kernel estimator settings
The python library fastKDE [21,22] was used to compute the marginal densities from the bins data. It contains the implementation of the Self-Consistent kernel estimator described in Section 3. The precision of the density estimations were set to single. The confidence levels were approximated by numerical integration using the trapezoidal rule over a fine grid of approximately 300 points per bin. Concerning bin sizes, we chose to use an uneven partition in our experiments. The reason for this are the climb-steps visible in the trajectories between 3000 and 4000 m, which correspond to phases during which the aircraft decreases considerably its ascent speed, leading to slowly increasing altitudes. Such behaviors translate into rapidly increasing speeds V with respect to the altitude, as well as into plummeting values of γ and N 1 (see Fig. 2a). This brought us to consider tighter bins around these climb-step altitudes: -between 1524 and 3000m and between 4000 and 12000m, we partitioned the altitudes homogeneously into bins of size b (1) m = 20m; -between 3000 and 4000m, we used a bin size twice smaller b

FPCA settings
Concerning the Functional Principal Components Analysis method, all training and testing flights were resampled on an equispaced grid of altitudes, using a step size of 5m. The trajectories were then centered and decomposed into a basis of 128 cubic B-splines. The SVD decomposition was carried using the PCA class from scikit-learn python library [23]. We kept 4 components for each variable (V, γ, α and N 1 ), which was enough to explain more than 90%, 65%, 60% and 75% of their respective variance. The relatively low gain in explained variance obtained by using a higher dimension did not seem to be worth it here, as it was observed in Section 5.1 of [7] that good results can be obtained with this method even with a small number of components. The density estimation was then carried using Gaussian mixture models, in order to try to avoid overfitting. Indeed, as FPCA transforms each training curve into a single point in the 4-dimensional space spanned by the principal functions, other nonparametric density estimation approaches, such as kernel density estimation, may overfit to each point of the data set if the number of curves is not large (see eg. [7]). The Gaussian mixtures models were trained using a standard EM algorithm, implemented in scikit-learn as well. The number of components was selected between 1 and 5 using the Bayesian information criterion (BIC).

LS-CDE settings
For the Least-Squares Conditional Density Estimation, the python package densratio, due to [18], implementing the uLSIF method from [16] was used and adapted. The basis functions chosen were p = 100 Gaussian kernels with the same variance σ and different centers. These centers were randomly drawn from the training data points using a uniform distribution, as suggested in [32]. The variance σ, as well as the L 2 penalty weight λ needed for minimizing the least-squares criterion were selected by cross-validation. These seemed like fair settings here as all 23 different experiments conducted in [32] were carried this way.  and normalized density based MML are comparable in terms of discrimination power and seem adequate for the task of assessing optimized aircraft climb trajectories.

Results and comments
We notice indeed for both types of scaling functions that the test flight categories are nicely separated by three clearly distinct ranges of scores. Looking more closely at the scores, the higher differences between the two types of optimized flights can be seen for the variables V and N 1 , which makes sense since those are the variables which are left free for Opt2 flights and constrained for Opt1 flights (see Sect. 4.2.2). We can also see that the scores of α are particularly low for Opt1 and Opt2 when compared to Real. The same can be seen for the N 1 score of Opt2 flights when compared to Opt1 and Real flights. This also makes sense as α and N 1 are control variables in the optimization problems which generated trajectories in Opt1 and Opt2. As a matter of fact, it happens that these optimal controls tend to be rather non smooth, with many discontinuities, unless explicitly constrained otherwise (such as N 1 for Opt1 ). Table 1b contains the estimated Mean Marginal Likelihood scores in a 2-dimensional setting, where the pairs (V, γ) and (α, N 1 ) have been treated together. The average training time needed here was 16 times larger than in the 1D case, i.e. 1 minute 20 seconds. The scores observed are globally really low and the test flight categories are not well separated. Moreover, we expected to obtain large scores for the real flights, since we used the marginal densities of their category to build the criterion, but this is not the case here. It appears that in our case, the MML criteria based on the self-consistent estimator is less efficient in 2D than 1D. It is well-known ( [34], Chap. 21.3) that as the dimension grows, the amount of data needed to attain a given accuracy with kernel density estimators skyrockets, which may explain this poor performance. We investigated in [28] a possible remedy based on Gaussian mixture models, which are more robust to an increase in dimension than kernel density estimators. From a practical point of view, a reference value or threshold is needed if one wanted to use our method to determine automatically whether a given optimized flight should be accepted or not. In such a context, a quite straightforward solution would be to use a leave-one-out cross-validation approach: compute the MML score of each real flight leaving it out of the training data and then averaging over the obtained scores. These reference values have been computed for our dataset and are summarized in Table 2. We note that the values obtained are very close to the average scores of the real flights showed in Table 1a. Table 3a contains the scores obtained using the Functional PCA based method presented in Section 4.3.1. The training time needed here was of 20 seconds in average. As for the MML in 2D, the real flights' scores are surprisingly low and the two types of simulated trajectories are not well discriminated by the criterion. This might be caused by the fact that this method encodes each training trajectory by a single point in the 4-dimensional space spanned by the principal functions. The training set obtained is hence of m = 424 points, which might be too small too attain sufficient accuracy from the Gaussian mixture density estimator in such a high dimension, as already encountered in [7]. The principal functions used and scatter plots of the projected trajectories can be found in the online version.
Concerning the LS-CDE approach, because the algorithm needs large Gram matrices (of size O(nm 2 )) to be stored, we encountered several memory problems when trying to run it on our dataset of 334 531 observation points. For this reason, our results were obtained by applying it to 100 uniform batches. These batches were obtained by partitioning the data according to the altitude. Although the three categories are well-separated  Table 3b, the variances are a bit larger so that the groups overlap. Also, the total time needed to train the estimators on every batch was much longer, approximately 14 hours. We didn't test both alternate methods in the 2D setting since the problems observed in 1D (dimension 4 for FPCA and memory/time for LS-CDE) would be aggravated.
In conclusion, our numerical results indicate that the MML criterion has better discriminative power than FPCA and LS-CDE for the task of assessing curves relatively to a set of "good" examples. Furthermore, the training time and memory needed for using LS-CDE in datasets of this size seems crippling. Concerning the FPCA method, it does not seem to be applicable to datasets with so few curves and present a higher training time than MML.

Conclusions
In this paper we proposed a new approach for the problem of quantifying the closeness from a curve to a set of random functions. We introduced a class of probabilistic criteria for this context called the Mean Marginal Likelihood (MML), and analyzed two possible scaling functions used to build them. We also derived a class of estimators of our criteria, which make use of local density estimators proved to consistently approximate the marginal densities of a random process. For practical applications, we suggested a particular flexible density estimator believed to have the right properties needed in this setting, called the self-consistent kernel estimator.
Numerical experiments using real aircraft data were carried to compare the MML with other well-established approaches from Functional Data Analysis and Conditional Density Estimation. The results show that, although the MML does not take into account the temporal structure of the data as other standard functional data analysis methods, it is a good candidate for the type of applications suggested. This seems to be especially the case if the number of training trajectories is too small for using FPCA or if the total number of observation points is too large for conditional density estimation. Moreover, the training times obtained for MML are by far the shortest among the compared methods, which confirms the relevance of the self-consistent kernel estimator. Furthermore, the ease to visualize, localize and interpret discrepancy zones allowed by MML make it a good exploratory analysis tool for functional data (see e.g. Fig. 2). We also note that our method does not perform as well in a multidimensional setting, but that the training time should not be an obstacle. In future work we intend to test the MML with parametric density estimators, which should be less affected by the curse of dimensionality.

Appendix A. Marginal densities consistency proof
In this section we prove Theorem 2.6 from Section 2.3. It relies on 5 lemmas stated and proved hereafter.
Lemma A.1. Let t ∈ T and z ∈ E. Under assumptions 2.3 and 2.4, f m m (t) (z) converges to f t (z): According to assumption 2.3, Hence, Since b m → 0 by assumption 2.4, the conclusion follows.
Lemma A.2. Let m ∈ N * and t ∈ T. Under assumption 2.4, the probability that T m m (t) (the subset of training points whose sampling time fall in the bin containing t) contains at most one observation point per curve is asymptotically equal to 1, meaning that for large enough m, the observations in T m m (t) will be independent with high probability: is the probability of a new observation of the r th curve falling in T m r, m (t) . The probability of N m r, m (t) being at most equal to 1 writes Proof. Let M > 0. As in the proof of Lemma A.2, we have N m m (t) ∼ B(mn, P m m (t) ) and hence, As the sum has is finite when we fix M , the limit of expression (A.13) when m tends to infinity is the sum of the limits of its terms. (A.14) Using (2.13) from assumption 2.4, we obtain that (A.14) tends to 0 which proves that In what follows, for any M > 0 we denote C M the following event: is also a binomial random variable where at least m(n − 1) Bernouilli have failed, leaving m trials. Hence, we know that (N m m (t) |N m r, m (t) ≤ 1; r = 1, . . . , m) ∼ B(m, P m m (t) ). As in the proof of Lemma A.3, whose limit is the sum of the limits of the sum terms. As in the proof of Lemma A.3, we have Lemma A.5. For any z ∈ E and t ∈ T, if assumptions 2.4 and 2.5 hold for some function Θ used to computê f m m (t) , then Proof. The randomness off m m (t) (z) comes from both the sample of observations drawn from the bin's density f m m (t) and the random number of observations falling in the bin N m m (t) . Hence, the idea here is to separate these two sources of randomness by conditioning on one of them. As we would like to use the result from assumption 2.5, it makes sense to condition on N m m (t) here. Indeed, for any Hence, according to assumption 2.5, for any ε > 0, for any α 1 > 0, there is some N ε,α1,m such that, defining C Nε,α 1 ,m by (A.15): As seen in Lemma A.2, the conditioning on N m r, m (t) ≤ 1 comes from the fact that when T m m (t) contains at most one observation per curve, those observations are independent. Furthermore, for ε, α 1 fixed, let M := sup m {N ε,α1,m } (which exists according to assumption 2.5). In this case, when assumption 2.4 holds, we have from Lemma A.4 that for any α 2 > 0,there is some m ε,α1,α2 such that, Moreover, we can also write the following useful inequality: By combining (A.23), (A.25) and (A.26) we get that for any ε, α 1 , α 2 > 0, there is m ε,α1,α2 such that Using these lemmas we can finally prove Theorem 2.6: Proof of Theorem 2.6 Let (t, z) ∈ T × E and let ε > 0. According to Lemma A.1, under assumptions 2.3 and 2.4, there is m ε ∈ N * such that for any m ≥ m ε , According to Lemma A.5 we have that which allow us to obtain (2.17) from (A.30).
In the following section we prove an even stronger convergence (in the L 2 sense) for standard kernel density estimators.
Appendix B. L 2 consistency for kernel density estimators andf m m (t) for simplicity. Furthermore, and, according to Lemma A.3, lim m→∞ P N m m (t) > 0 = 1. In the following, we use the following notations for the conditional expectation and variance: In the remaining of this section we derive the conditions under whichf m m (t) converges in expected squarederror to f t . We recall that a sufficient condition for this is having its bias and variance tending to 0. The proof was greatly inspired by the derivations presented in [29] for the standard multivariate case.
Similarly to Lemma A.1, we can prove the following convergence result: Lemma B.1. Under assumption 2.7, for any (t, z) ∈ T × E, and Proof. Similar argument to Lemma A.1.
Furthermore, the following bounds of the estimator's bias and variance hold under the same assumption: Lemma B.2. If assumption 2.7 holds, Proof. From the law of total expectation [34](p55) .

(B.7)
For large enough m,f m m (t) writes as an empirical average of independent identically distributed random variables Z m,t of density f m m (t) and hencẽ The second order integral Taylor expansion of f m m (t) around z gives which leads toẼ (B.10) As K σ is a symmetric probability density function, we have K σ (w)dw = 1, wK σ (w)dw = 0, (B.11) which leads toẼ By combining (B.7) and (B.12) we obtain the following bias expressioñ proving bound (B.5).
Concerning the variance, we apply the law of total variance [34](p55): . (B.14) As in (B.8), we may express the conditional variance off m m (t) (z) using the conditional variance of the kernel, which brings the first term in (B.14) tõ .
(B.15) Furthermore, the kernel variance can be developped as follows which is hence smaller than the first term, i.e. the kernels second moment. Using integral Taylor expansion of f m m (t) around z (B.9) truncated to the first order, such a quantity can be written as follows: where we used the fact that K 2 σ is an even function. By using (B.15)-(B.16) and (B.17) we get .

(B.18)
We still need to bound the second term in (B.14). We havẽ By denoting and plugging (B.12)-(B.13) in (B.19), we obtaiñ which leads to the following bound of the second term in (B.14) and proves inequality (B.6).
As a direct consequence, we get the following bounds for the simpler case where kernel and bandwidth are deterministic: Lemma B.3. If σ = σ m depends on m and K is fixed, both deterministic (do not depend on the sample), then under assumption 2.7: Hence, by looking at expressions (B.24) and (B.25), it seems clear that the convergence of the bias and the variance to zero will strongly depend on the asymptotics ofẼ 1/N m m (t) . This motivates the following lemma: ν(t)dt. According to the main theorem proved in [4], by the event N m m (t) > 0 and we have By computing the difference between the S 1 and S 0 we obtain the following bound: We conclude that S 0 ≤ 2S 1 and hence . (B.32) Finally, the last equality in (B.28) comes from the fact that P m m (t) ≥ ν − b m .
In conclusion, we can now prove Theorem 2.8 stating that the kernel marginal density estimator will be consistent in expected squared-error (which implies convergence in probability stated in theorem (2.6)).
where K : R d → R is a smoothing kernel in L 2 (R d , R). One can interpret (C.1) as a kernel density estimator with an implicit bandwidth H, invertible d × d matrix hidden inside the kernel function K(z) =K( Denote by E the expectation relative to the random sample S N , defined for any deterministic function ϕ : In this context, a common quality measure of density estimators is the Mean Integrated Squared Error: As we will show, it becomes relatively easy to minimize such a criterion with regard to the choice of the kernel K once we've shifted it to the Fourier domain. Hence, for any function v ∈ L 2 (R d , R), we define its Fourier transform hereafter with the following convention where i = √ −1, its inverse being defined by As f,f ∈ L 2 , Plancherel's theorem gives that where Φ := F[f ], usually called the characteristic function, andΦ := F[f ]. By noticing thatf can be seen as the convolution between the kernel and a sum of Dirac functions centered on the data pointŝ it follows thatΦ The function ∆ is commonly called the empirical characteristic function (ECF).
Plugging (C.8) in the MISE expression (C.6) and expanding the square gives: the arguments s being omitted for lighter notation and c * denoting the complex conjugate of any c ∈ C. Furthermore, as shown in Section 1.3 and Lemma 1.2 of [33] for example, the ECF is an unbiased estimator of the characteristic function and its second moment is Passing from line (C. 15) to (C.16) is based on the assumption that the random variables {z k } N k=1 are independent.
Replacing (C.12) and (C. 19) in (C.11), we obtain As initially shown in [35], expression (C.20) can be minimized with respect to the transformed kernel κ. Indeed, for any s ∈ R d , we get the following first order optimality condition by differentiating the quadratic integrand in (C.20) relative to κ(s): (C.24)

C.2 Self-consistent estimator
The practical problem with estimator (C.24) is that it depends on the true characteristic function Φ, which is unknown. Hence, the solutions to the fixed-point equation (C.25) was suggested by [1]  After some analysis, we can show thatΦ + is a stable fixed-point, whileΦ − is unstable. [1] suggest to keep only the stable one, which brings us to the self-consistent estimator of the characteristic function:

C.3 Practical considerations
Heuristics for choosing A N were proposed in [1,21] for the univariate case and in [22] in a multivariate setting.
One practical problem with the self-consistent estimator is thatf N sc = F −1 [Φ N sc ] is not lower-bounded by zero. This can be corrected by translatingf N sc downwards until the positive part integrates to one and then setting the negative part to 0. Indeed, it was proven by [11] that such a transformation induces no cost in terms of MISE accuracy.
Another practical drawback with estimatorΦ N sc is that the direct computation of the empirical characteristic function ∆ can be expensive: O(N · M ) exponential evaluations, where M is the number of frequency points s ∈ R d . Noting from definition (C.10) that the expression of ∆ is equivalent to some Discrete Fourier Transform a k e is·z k , (C. 33) where the Fourier coefficients a k are all equal to 1, the idea of using the Fast Fourier Transform algorithm (FFT) proposed by [3] seems natural. However, the latter only applies to the case of uniformly spaced data, which is not the case of {z k } N k=1 . For this reason, [21] proposed to use an implementation of Nonuniform Fast Fourier Transform (NUFFT) developped by [12].
It consists in interpolating the original data points {z k } N k=1 on a new equispaced grid {z j }Ñ j=1 by using another Gaussian kernel density estimator:f As computing {f (z j )}Ñ j=1 still takes O(Ñ · N ) operations, [8] suggested to use only N c < N surrounding points from {z k } N k=1 to evaluate each new grid pointz j . We obtain an overall complexity of O(N c ·Ñ + M log M ) which, in the case where N c < M ≤ N , is better than the original DFT formulation O(N · M ).
The analysis conducted in [12] indicates that simple precision can be achieved in (C.35) by normalizing the data {z k } N k=1 to the range [0, 2π] and setting 2N c = 12,Ñ = 2N and σ 2 = 24/N 2 . Hence, for a desired precision, this step of the algorithm introduces no additional hyperparameter to be tuned (see [12] for the double-precision settings).