A PARTIAL GRAPHICAL MODEL WITH A STRUCTURAL PRIOR ON THE DIRECT LINKS BETWEEN PREDICTORS AND RESPONSES

. This paper is devoted to the estimation of a partial graphical model with a structural Bayesian penalization. Precisely, we are interested in the linear regression setting where the estimation is made through the direct links between potentially high-dimensional predictors and multiple responses, since it is known that Gaussian graphical models enable to exhibit direct links only, whereas coeﬃcients in linear regressions contain both direct and indirect relations (due e.g. to strong correlations among the variables). A smooth penalty reﬂecting a generalized Gaussian Bayesian prior on the covariates is added, either enforcing patterns (like row structures) in the direct links or regulating the joint inﬂuence of predictors. We give a theoretical guarantee for our method, taking the form of an upper bound on the estimation error arising with high probability, provided that the model is suitably regularized. Empirical studies on synthetic data and a real dataset are conducted


Introduction
We are interested in the recovery and estimation of direct links between high-dimensional predictors and a set of responses.Whereas the graphical models seem a natural way to go, we propose to take account of a prior knowledge on the predictors, when possible.This is typically the case when dealing with genetic markers whose joint influence may be anticipated thanks to some kind of genetic distance, or when the predictors are supposed to represent a continuous phenomenon so that consecutive covariates probably act together.In this regard, while taking up the graphical approach, we introduce some Bayesian information in a structural regularization of the estimation procedure, although the inference remains frequentist, thereby following the idea of Chiquet et al.
to the very complete handbook recently edited by Maathuis et al. [16].We also refer the reader to the book of Hastie et al. [11] and to the one of Giraud [10], both related to the standard high-dimensional statistical methods.Before introducing the model and the organization of this work, let us describe the notation used throughout the paper.

Notation
For any matrix A, |A| * = vec(A) * is the elementwise * norm of A and |A| − * is |A| * deprived of the diagonal terms of A. We also note A F = |A| 2 the Frobenius norm of A and A 2 the spectral norm of A. The Frobenius inner product between any matrices A and B of same dimensions is A, B = vec(A), vec(B) = tr(A t B) whereas u, v = u t v is the inner product of the Euclidean real space.For any vector u, |u| 0 is the number of non-zero values in u.For a matrix A, [A] C is to be understood as the matrix A whose elements outside of the set of coordinates C are set to zero and vec(A) is the vectorization of A into a column vector.The eigenvalues of a square matrix A of size d with spectrum sp(A) are λ i (A) taken in decreasing order (from λ 1 (A) = λ max (A) to λ d (A) = λ min (A)).The cones of symmetric positive semi-definite and definite matrices of dimension d are S d + and S d ++ respectively.

The partial graphical model
In the classic Gaussian graphical model (GGM) setting, we aim at estimating the precision matrix Ω = Σ −1 of jointly normally distributed random vectors Y ∈ R q and X ∈ R p with zero mean and covariance Σ.The point is that it induces a graphical structure among the variables and the support of Ω is closely related to the conditional interdependences between them.Let us consider, now and in all the study, the sample covariances of n independent observations (Y i , X i ), denoted by Maximizing the penalized likelihood of a GGM boils down to finding Ω ∈ S p+q ++ that minimizes the convex objective L n (Ω) = − ln det(Ω) + S (n) , Ω + λ pen(Ω) (1.2) where S (n) is the full sample covariance built from the blocks (1.1).The penalty function pen(Ω) is usually |Ω| 1 or even |Ω| − 1 .Efficient algorithms exist to get solutions for (1.2), see e.g.Banerjee et al. [2], Yuan and Lin [28], Lu [15] or the graphical Lasso of Friedman et al. [9].The reader may also look at the theoretical guarantees of Ravikumar et al. [21].However, thinking of X i as a predictor of size p associated with a response Y i of size q, the partial Gaussian graphical model (PGGM), developped e.g. by Sohn and Kim [26] or Yuan and Zhang [29], appears as a powerful tool to exhibit direct relationships between the predictors and the responses.To understand this, consider the decomposition into blocks where Ω yy ∈ S q ++ , Ω yx ∈ R q×p and Ω xx ∈ S p ++ and where the same goes for Σ xx , Σ yx and Σ xx .The precision matrix Ω = Σ −1 satisfies, by blockwise inversion, The conditional distribution peculiar to Gaussian vectors gives a new light on the multiple-output regression Y i = B t X i + E i with Gaussian noise E i ∼ N (0, R), through the reparametrization B = −Ω t yx Ω −1 yy and R = Ω −1 yy .Whereas B contains direct and indirect links between the predictors and the responses (due e.g. to strong correlations among the variables), Ω yx only contains direct links, as it is shown by the graphical models theory.In other words, the direct links are closely related to the concept of partial correlations between X and Y (see Meinshausen and Bülmann [17] or Peng et al. [19], for the univariate case).For example, the direct link between predictor k and response may be evaluated through the partial correlation Corr(Y , X k | Y = , X = k ) contained, apart from a multiplicative coefficient, in the -th row and k-th column of Ω yx (see e.g.Cor.A.6 in [10]) with the particularly interesting consequence that the support of Ω yx is sufficient to identify direct relationships between X and Y .Hence, in the partial setting, the objective reduces to the estimation of the direct links Ω yx together with the conditional precision matrix of the responses Ω yy .Maximizing the penalized conditional log-likelihood of the model now comes down to minimizing the new convex objective over (Ω yy , Ω yx ) ∈ S q ++ × R q×p for some usual penalty functions.It is worth noting that pen(Ω yx ) often plays a crucial role in modern statistics dealing with high-dimensional predictors (and the natural choice is |Ω yx | 1 to get sparsity) while we may choose λ = 0, because the number of responses is generally small.In the seminal papers [26,29], the authors consider |Ω yy | 1 and |Ω yy | − 1 for pen(Ω yy ), respectively.Yuan and Zhang [29] also point out that no estimation of Ω xx is needed anymore.In a graphical model, the estimation of Ω yx and Ω yy depends on the accuracy of the estimation of Ω which, in turn, is strongly affected by the one of Ω xx , especially in a high-dimensional setting.The partial model overrides this issue, the focus is on Ω yx and Ω yy while Ω xx has disappeared from the objective function (1.4).The latter is obtained either by considering the multiple-output Gaussian regression scheme, or, as it is done in [29], by eliminating Ω xx thanks to a first optimization step in (1.2).In this paper, we will consider the penalties which correspond to the PGGM (Gm) of [29].The Spring (Spr) of [7] can also be seen as a PGGM but with no penalty on Ω yy (replaced with an additional structuring one on Ω yx , we will come back to this point thereafter), so for Spr we may consider λ = 0.The generalized procedure (GenGm) at the heart of the study relies on a combination between these two approaches.We will see in due time that we keep both the penalties of Gm and the structuring one of Spr on Ω yx .Finally, the intermediate solution consisting in estimating Ω yy and B through the conditional distribution ) with penalizations both on B and Ω yy , presented and analyzed by Rothman et al. [23] and by Lee and Liu [14], is better known as a multivariate regression with covariance estimation (MRCE).However, it has been shown that the objective function suffers from a lack of convexity and that the optimization procedure may be debatable, in addition to the less convenient setup for statistical interpretation (B contains both direct and indirect influences) compared to PGGM.Without claiming to be exhaustive, let us conclude this quick introduction by citing some related works, like the structural generalization of the Elastic-Net of Slawski et al. [25], the Dantzig approach of Cai et al. [6] put in practice on genomic data [5], the greedy research of the non-zero pattern in Ω of Johnson et al. [13], the approach of Fan et al. [8] using a non-convex SCAD penalty to reduce the bias of the Lasso in the estimation of Ω, the eQTL data analysis of Yin and Li [27] which makes use of a sparse conditional GGM, and so on.All the references inside will complete this concise list.

Organization of the paper
To sum up, we have two goals in this paper: 1. Give some theoretical guarantees to the (slightly modified) model introduced in Chiquet et al. [7].2. Generalize the result of Yuan and Zhang [29] to the case where a structural penalization is added in the estimation step.
In Section 2, we introduce the model, consisting in putting a generalized Gaussian prior on the direct links before the procedure of estimation of Ω yy and Ω yx , and we detail the new convex objective.Then we provide some error bounds for our estimates, useful as theoretical guarantees of performance.Section 3 is devoted to empirical considerations.We explain how we deal with the minimization of the new objective and we test the method on simulations first, and next on a real dataset (a Canadian average annual weather cycle, see e.g.[20]).After a short conclusion in Section 4, we finally prove our results in Section 5.The numerous constants appearing in the results and the proofs are gathered in the Appendix, for the sake of readability.

A generalized Gaussian prior on the direct links
We use the definition given in formulas ( 1)-( 2) of [18] for the so-called d-dimensional multivariate generalized Gaussian GN (0, 1, V, β) distribution with mean 0, scale 1, scatter parameter V ∈ S d ++ and shape parameter β > 0. According to the authors, the density takes the form of where Γ is the Euler Gamma function.
We clearly recognize the Gaussian N (0, V ) setting for β = 1.Moreover, for β = 1/2, it can be seen as a multivariate Laplace distribution whereas it is known to converge to some uniform distribution as β → +∞.The marginal shapes (d = 1 and V = 1) of the distribution are represented in Figure 1, depending on whether β < 1, β = 1 or β > 1.Our results hold for all β 1 but, as will be explained in due course, we shall not theoretically deviate too much from the Gaussianity in the prior (even if we will allow ourselves some exceptions in the practical works).The usual Bayesian approach for multiple-output Gaussian regression having B as matrix of coefficients and R as noise variance consists in a conjugate prior vec(B) ∼ N (b, R ⊗ L −1 ) for some information matrix L ∈ S p ++ and a centering value b (see e.g.Sect.2.8.5 of [22]).In the PGGM reformulation, we have R = Ω −1 yy and B = −Ω t yx Ω −1 yy as explained in Section 1, and of course we shall choose b = 0 to meet our purposes.Thus, is a natural prior for the direct links (this is in particular the choice of the authors of [7]).Following the same logic, let us choose Ω yy ⊗ L −1 for scatter parameter and suppose that vec(Ω t yx ) ∼ GN (0, 1, Ω yy ⊗ L −1 , β). (2.1) In this way, we can play on the intensity of the constraint we want to bring on Ω yx , from a non-informative prior to quasi-boundedness through Laplace and Gaussian distributions.This prior entails an additional smooth term acting as a structural penalization in the objective (1.4) that becomes with three regularization parameters (λ, µ, η).The smooth penalization lends weight to the prior on Ω yx and thereby plays on the extent of shrinkage and structuring through β, whereas |Ω yx | 1 and |Ω yy | − 1 are designed to induce sparsity.One can note that this is closely related to the log-likelihood of a hierarchical model of the form where the emphasis is on Ω yx in the prior and Ω yy remains a fixed parameter, although it is important to see that, in this work, the estimation step does not rely on a posterior distribution.The following proposition is related to the existence of a global minimum for our objective (2.2) with respect to (Ω yy , Ω yx ) as soon as β 1.
Proof.See Section 5.2.Now and throughout the rest of the paper, denote by θ = (Ω yy , Ω yx ) ∈ Θ = S q ++ × R q×p the (q × (q + p))matrix of parameters of the model, with true value θ * = (Ω * yy , Ω * yx ).As it is usually done in studies implying sparsity, we will also consider S of cardinality |S|, the true active set of θ * defined as S = {(i, j), θ * i,j = 0}, and its complement S. Our results also depend on some basic assumptions related to the true covariances of the Gaussian observations, and we will assume that the following holds.
This is a natural hypothesis in our framework, in particular we suppose that there is at least a link between X and Y .
Remark 2.2 (Null model).Even if it is of less interest, our study does not exclude the case where Ω * yx = 0. Indeed, we might as well consider that Ω * yx = 0 and get the same results, but some constants should be refined.On the other hand, Σ * xx ∈ S p ++ and Ω * yy ∈ S q ++ are crucial.
Under (H 1 ), the random matrices and are going to play a fundamental role, especially h a and h b .Let us now provide some theoretical guarantees for the estimation of θ in our model, provided that the regularization parameters are located in a particular area (λ, µ, η) ∈ Λ.Consider the penalized likelihood λ,µ,η (θ) given in (2.2), and estimate θ by the global minimum obtained for β 1.To facilitate reading, we postpone the precise definition of the numerous constants to the Appendix.We recall that p is the number of predictors, q is the number of responses and |S| is the size of the true active set.
1, e λ > 0 and e µ > 0, and assume that the regularization parameters satisfy (λ, µ, η) where for some non-random constants s L , a and b defined in (A.2) and (A.3), and the random constants h a and h b given above.Then, under (H 1 ), there exists absolute constants b 1 > 0 and b 2 > 0 such that, for any 0 < b 3 < 1 and as soon as n > n 0 , with probability no less that 1 − e −b2n − b 3 , the estimator (2.5) satisfies where γ r,η,β,p , c λ,µ and m * are technical constants defined in (A.7), (A.8) and (A.9), respectively, and where the minimal number of observations is given by with s α defined in (A.5) and r * in (A.6).
Among all these constants, we can note that s L , a , b , h a and h b are useful to properly describe and restrict Λ, the domain of validity of (λ, µ, η) for the theorem to hold.Once Λ is fixed, the other constants take part in the upper bound of the estimation error.However, as it stands, the theorem is very difficult to interpret.The next two remarks seem essential to have an overview of the orders of magnitude involved for the number of observations, for p and q, for the estimation error and for the regularization parameters.
Remark 2.4 (Validity band).Of course the degree of sparsity |S| is crucial in the estimation error, but it also plays an indirect role in the probability associated with the theorem and in the numerous constants.In virtue of Lemma 5.12, we can hope that λ and µ have a wide validity band, by adjusting c λ , c µ , d λ and d µ .In turn, η also has a non-negligible area of validity, provided of course that a , b and s L , all depending on combinations between Ω * yx , Ω * −1 yy and L, are small enough.Accordingly, it would be to our advantage if L was both sparse and not chosen with too large elements.As it always appears together with η, we may as well take a normalized version of L (e.g.|L| ∞ 1).
Remark 2.5 (Order of magnitude).Even if the result holds for any β 1, the terms ∝ p β−1 appearing in some upper bounds of the proof clearly argue in favor of a moderate choice β ∈ [1, 1 + ] for a small > 0, depending on p.In other words, we cannot deviate too much from the Gaussianity in the prior on the direct links.For example in a very high-dimensional setting (p ∼ 10 7 ), choosing = 0.1 leads to p β−1 ≈ 5 whereas we may try larger values of for the more common high-dimensional settings p ∼ 10 3 or p ∼ 10 4 .By contrast, we can see that n 0 must (at least) grow like q for the theorem to hold, so high-dimensional responses are excluded.However in multiple-output regressions, even when p is extremely large, q generally remains small.According to all these considerations, we may roughly say that, in a high-dimensional setting with respect to p, with a large probability, under a suitable regularization of the model.We recognize the usual terms appearing in the error bounds of regressions with high-dimensional covariates, like the 2 error of the Lasso (see e.g.Chap. 11 of [11]).This is the same bound as in [29], but our additional structural penalty restricts Λ.

Simulations and real dataset
The minimization problem (2.5) is solved using a coordinate descent procedure, alternating between the computations of Ω yy = arg min Each step is done by an Orthant-Wise Limited-Memory Quasi-Newton (OWL-QN) algorithm (see e.g.[1]).The first subproblem is performed through half-vectorization (vech) to ensure symmetry and we set the objective to +∞ on S q ++ to ensure positive definiteness of the solution.The coordinate descent is stopped when following two consecutive iterations t − 1 and t, where > 0 is a small threshold depending on the desired precision.We are now going to try our method on synthetic data first, and then on a real dataset.We will pay attention to the role played by β, in particular we will see that it can be useful as well as counterproductive, depending on the situations.

Simulations
For each scenario, we first generate i.i.d.standard Gaussian vectors X i ∈ R p , then Y i ∈ R q is simulated according to the setting and we estimate Ω yy and Ω yx .From the relations detailed in Section 1, we recall that In a compact form, we may also write where the i-th row of Y is Y t i and the i-th row of X is X t i .Thus, we can estimate B using the Lasso (Las) and the Group-Lasso (GLas) in the vectorized form, to provide a basis for comparison between our method and the usual penalized methods.The Lasso penalty is obviously vec(B) 1 to promote coordinate sparsity while, for the Group-Lasso, we use the penalty B 1 2 + . . .+ B p 2 where B i is the i-th row of B, to promote row sparsity and exclude altogether some predictors from the model.We also implement some variants of our generalized graphical model (GenGm).The case where Ω yy = R −1 is known and does not need to be estimated is the Oracle (Or) and the case where η = 0 so that β has no influence is the classic PGGM (Gm).The case where λ = 0 and β = 1 is called the Spring (Spr) by the authors of [7].We will focus on structured scenarios.With no structure in Ω yx , there is no reason why our method should outperform the usual PGGM.In a completely random setting, we have observed that all PGGM procedures perform identically.In fact, a slight gain can be obtained compared to Spr and Gm simply due to the flexibility induced by the additional parameter (Spr and Gm are particular cases of GenGm).However, that clearly cannot counterbalance the extended computational times, and GenGm should not be used for such situations.The calibration of the regularization parameters is made using a cross-validation on a training set of size n t = 150 and the accuracy is evaluated thanks to the mean squared prediction error (MSPE) on a validation set of size Due to the large amount of treatments, the grids for cross-validation are not very sharp here but they will be carefully refined for the real dataset of the next section.The covariance between the outputs is R = (r |i−j| ) 1 i, j q for r = 1 2 and we work with p = 100.Each scenario is repeated N = 500 times and GenGm is evaluated with numerous values of β, from 0.25 to 2 with a step of 0.25.The results of the following scenarios are summarized in Figures 2-4, respectively.
-Scenario 2 (q = 2).We draw ω = ± 1 2 and one randomly selected row of Ω yx is filled with ω while the other is identically 0. -Scenario 3 (q = 3).We draw ω i = ± 1  2 and we fill a randomly selected section of size 30 on the i-th row of Ω yx with ω i , for i = 1, 2, 3.The remaining part of Ω yx is 0.
The row structure is promoted by a normalized first finite difference operator which, through Ω yx L Ω t yx , tends to penalize the difference between two consecutive values on a same row (as does Fused-Lasso with 1 penalty).Yet, the Fused-Lasso is not a suitable alternative to GLas and Las in this precise context because B = −Ω t yx Ω −1 yy is not supposed to have a row structure even if Ω yx has one.For this choice of L, one can note that, in the particular case where R = diag(σ 2 1 , . . ., σ 2 q ), where ω i,j is the (i, j)-th element of Ω yx , so we may fairly expect that β 1 is going to strengthen the smoothness of the estimation and to enforce all the more the structuring.
Remark 3.1 (Validity of the hypotheses).We could as well add a small diagonal element in the matrix L defined above, positive semi-definite but not invertible.The resulting effect would be a negligible ridge-like penalization on the elements of Ω yx .This is not required for the estimation procedure but useful for Theorem 2.3 to hold (see e.g.(H 1 )).Likewise, it seemed interesting to test some settings with β < 1 even if the theory developped in the paper does not give any guarantee for them, as a basis for comparison.
First of all, one can observe that Las and GLas are left behind in all our simulations.This is not surprising since the covariance between the outputs cannot be recovered with the standard Lasso, at least for q 2. Generally, GLas remains more robust compared to Las, probably due to the high level of sparsity in Ω yx approximately passed to B (provided that the covariances in R are small enough), and exploited by the grouping effect.In the weakly structured setting (Scenario 1), we also observe that, as expected, all PGGM procedures perform almost identically, with obviously an advantage for Or (although small, illustrating the accuracy of the estimation).In the strongly structured settings (Scenarios 2 and 3), Gm gives results below the expected level, because it is not designed to promote such layouts.On the contrary, thanks to this choice of L showing here great efficiency, GenGm and Spr are doing pretty well.Note that, in this context, GenGm with β = 1 is almost the same as Spr since, q being small, λ does not play a crucial role.However, some empirical facts draw our attention: the prediction error decreases with β to some extent, but the most interesting fact seems to be the simultaneous decrease of its variance.It is likely that the increasing pressure exerted by β on the estimation procedure leads to a higher homogeneity in the numerical results, despite the repetitions of random experiments under random settings.In other words, the structuring seems to be strengthened and we also observe that the convergence of the algorithm is faster, which logically follows from the latter remarks (especially clear when we compare β = 0.25 and β = 2).On the other hand, for the opposite reason, we notice that the predictions are hardly better than Gm (even worse in some cases), both on average and in terms of variability, for β < 1, and these simulations tend to undermine such values of the hyperparameter.On the whole, GenGm with β > 1 might be a sound approach for practitioners who place a high priority on structuring the estimations, even if Remark 3.2 below should probably temper this statement.To conclude, let us consider the strongly structured scenarios  with L = I p (without structuring) in the Oracle setting with β = 2, and let us compare the results with those of Figures 3 and 4, obtained with the correct version of L given in (3.2).The results are displayed in Figure 5 where we can see that the benefit of structuring is manifest.Unsurprisingly, the results without structuring are close to those of Gm since L = I p only strengthens the shrinkage effect with ridge-like additional penalties.Remark 3.2 (Computational time).To estimate (Ω yy , Ω yx ) in the model Spr, the authors of [7] use a very judicious and efficient method relying, in each step of the coordinate descent procedure, on a direct computation of the estimation of Ω yy together with an Elastic-Net estimation of Ω yx .This is possible for λ = 0 and β = 1, but unfortunately cannot be implemented in the general setting.As a result, computational times remain an issue that should be paid attention to.
Remark 3.3 (Oracle-type errors).The mean value of the estimation errors Ω yx − Ω yx 2 F leads to the same kind of observations for the models being compared in the simulations.But the minimal prediction error does not always coincide with an optimal support recovery due to the shrinkage effect on the estimation of Ω yx , when the coefficients or the covariates are not very contrasting.The so-called F -score is given by F = 2 p r r e p r + r e where p r = TP TP + FP and r e = TP TP + FN are the precision and the recall, respectively, and where T/F and P/N stand for true/false and positive/negative.In the strongly structured scenarios, F is generally located between 0.60 and 0.65, and a deeper analysis shows that a proportion of more than 0.99 of true non-zero values are recovered (that is, the part of the true active set S related to Ω yx ).If the models are not calibrated to reach the best prediction error but the best F -score, F regularly exceeds 0.90, at least for the structured procedures.
Nevertheless, Scenarios 2 and 3 are very strongly structured, more than one would expect from an unknown underlying generating process, and the real dataset of the next section is going to highlight the fact that the improvement may be hardly noticeable with respect to β.But we will see that β can still be useful for variable selection.

A real dataset
The dataset available as CanadianWeather in the R package fda contains daily temperature and precipitation at 35 different locations in Canada, averaged over annual reports starting in 1960 and ending in 1994 (see e.g.[20]).We intend to look at the direct links between the minimal and maximal rainfall (on the log 10 scale) and the temperature pattern in the 35 weather stations, so as to identify the times of the year that have a strong effect on rainfall (positive as well as negative).In this context, n = 35, q = 2 and p = 365.Figure 6 shows temperature and log-precipitation measured over a year in Montreal, chosen as an example, together with the empirical distribution of the minimal and maximal log-precipitation for the 35 weather stations.We can note that, since the data are averaged over numerous years, outliers are unlikely even for the extremes (min and max).
Some authors (see e.g.[24]) have already highlighted the pertinence of using the matrix L defined in (3.2) in this dataset, because the predictors are ordered temporally so that the selection of isolated days instead of relevant sequences of days seems an unreliable procedure for statistical interpretation.To assess the models, we repeat N = 100 times the following experiment: n t = 25 observations are randomly selected for calibration (via 2-fold cross-validation) and estimation, the remaining n v = 10 observations are used to compute the MSPE (3.1) related to the prediction of the minimum (min p ) and maximum (max p ) precipitation.We can see in Figure 7 that all structured PGGM perform almost identically, with the phenomenon described in the previous section still visible but to a lesser extent.We can even notice that structuring is hardly beneficial for this dataset, from a purely numerical point of view.This conclusion can also be found in [24], where the author compares the structured Elastic-Net with unstructured alternatives to predict the 0.25-, 0.50-and 0.75-quantiles of the logprecipitation, through independent regressions.But we will see that, in terms of variable selection and statistical interpretation, L and β still have a substancial role to play.
The point is that we have observed that the best prediction error does not usually coincide with a sparse solution (see Rem. 3.3) when the coefficients or the covariates are not very contrasting.In particular, this was the case of our simulation study with ± 1 2 coefficients and N (0, 1) covariates.So, just as they look at the Lasso's regularization paths, practitioners may choose the desired degree of sparsity, depending on p/n, by adjusting the hyperparameters.Here, on the basis of the MSPE, most of the time we must retain µ 10 −2 and only a few direct links are set to zero.To look for sequences of days directly related to min p and max p , we decided to constraint µ 10 −2 and focus on variable selection.The active set of Ω yx is evaluated on the basis of n t = 25 randomly chosen observations.The experiment is repeated N = 100 times, and the locations having a frequency of occurrence that exceeds 0.5 are retained (or, equivalently, those whose estimates have a non-zero median).This can be seen as a measure of variable importance.The results are given in Figures 8 and 9 for min p and max p , respectively, with a fixed set of regularization parameters and increasing values of β.The objective is to show the influence of the latter, all other things being equal.The colored areas highlight the days having a frequency of occurrence, represented by gray crosses, that exceeds 0.5 in the N = 100 repetitions of the experiment.Note that, since we retain λ = 0 in these experiments, GenGm for β = 1 coincides with Spr.We can see that the increasing pressure exerted by β on the estimation procedure tends to refine the selection by giving priority to the most important variables and by dropping the others much more easily, at the cost of  that we observe for the estimated active sets is clearly a guarantee of quality for the selected variables.The median values of the estimated direct links between the temperature of the days and the pair (min p , max p ) are represented in Figure 10 together with the estimated regression coefficients, for β = 2.We recall that the relation B = −Ω t yx Ω −1 yy simply leads to We detect sequences of influent days in November, December, January and February, especially related to min p , positively at the end of the year and negatively at the beggining.This is broadly consistent with the analysis of [24] -even if the responses are not extremes but quantiles in it -with however two differences: the regression coefficients associated with max p are much lower compared to min p whereas it is not that clear in the reference, and an activity is also detected between July and August.The main explanation, at least for the first of them, probably lies in the use of graphical models that take into account the correlation between responses.Indeed, as can be seen in Figure 11 which gives an overview of the estimation of R obtained from the repeated experiments, a non-zero correlation is detected between the responses (≈ 0.32).The influence of November and December on all quantiles and that of January and February on the 0.75-quantile in [24] might actually be an artificial effect of the correlation with the 0.25-quantile.This is what our study suggests by highlighting min p compared to max p : the 'real' effect appears to be on min p whereas max p seems to react only through a phenomenon of correlation with min p .From this point of view, the interest of graphical models instead of independent regressions is particularly obvious.Let us also mention that, interestingly enough, we notice that the role of η tends to depreciate for the large values of β.For example, for the same regularization parameters (λ, µ) = (0, 0.05) and β = 2, the difference between the estimated active sets for η = 0.1 and η = 1 is almost negligible (depending on the experiments, between 1 and 3 days are concerned, on average).Based on these studies and observations, we might conclude that β is insignificant when we are interested in the best prediction error on a validation set (even counterproductive with respect to computational times, e.g.compared to Spr), whereas it seems to have a substancial role  to play when focusing on selection, by accelerating the discrimination of variables.In the first case, η has to be carefully adjusted while in the second case, β will quickly help to reach the desired sparsity.
Remark 3.4 (Structure matrix).For the simulations and the real dataset, we have used the popular first finite difference operator given in (3.2).Other examples can be found in the literature, like the promotion of a genetic distance for genomic selection in Brassica napus [7] or the bidimensional discretization of the Laplacian to work on handwritten digit recognition [24].More generally, L can be used in a classic Bayesian prior supposed to promote some covariance structure on the direct links, with no 'physical' structuring in mind (like temporal, spatial or genetic proximity).

Conclusion
In conclusion, our work is a generalization of [29], using the same technical tools to establish an upper bound on the estimation error when a prior on the direct links generates an additional structural penalty in the objective, provided that the model is suitably regularized.Our work is also an improvement of [7] since, while being inspired by the methodology of the authors, we generalize the prior and give some theoretical guarantees.The empirical study shows that the hyperparametrization in the prior, although more expensive in adjusting the parameters, is likely to refine the selection results but clearly, this does not appear as a crucial improvement compared to the two previous points.Let us conclude the paper by highlighting two weaknesses that might be trails for future studies.On the one hand, the Laplace distribution is often used as a prior in the Bayesian Lasso (see e.g.Sect.6.1 of [11]).However, our reasonings do not allow β = 1/2, which may correspond to a multivariate Laplace distribution on the direct links.Combined with the first finite difference operator L, the choice β = 1/2 could generate a Fused-Lasso-type penalty.In this regard, it would be challenging and interesting to obtain some theoretical guarantees for β 1/2 and not only for β 1, even if our probably too brief simulation study does not encourage the choice of β < 1.On the other hand, λ = 0 is a natural choice when q is small (this is in particular the configuration of [7]), not to mention that it is computionally faster.But, the proof of our theorem needs λ > c λ h a > 0 to hold.We think that a reasoning enabling to deal with λ = 0 should also be beneficial to the study.More generally, it would be instructive to consider a very high-dimensional setting (p n and not only p ∼ 10 2 although always larger than n, as in our experiments).Such studies should follow with omic data.

Technical proofs
We start in a first part by some useful linear algebra lemmas that will be repeatedly used subsequently, well-known for most of them.In a second part, we prove the joint convexity of the objective and our main result.

Linear algebra Lemma Let A ∈ S d
+ and U ∈ R d× .Then, U t AU ∈ S + .Proof.Since A is symmetric with non-negative eigenvalues, there is an orthogonal matrix P such that A = P DP t with D = diag(sp(A)) ∈ S d + .Thus, for all v ∈ R , it follows that v, U t A U v = D 1/2 P t U v 2 0.
Proof.The equality AB = A 1/2 (A 1/2 B A 1/2 ) A −1/2 shows that AB and A 1/2 B A 1/2 are similar, so they must share the same eigenvalues.From Lemma 5.1, λ i (A 1/2 B A 1/2 ) 0 .Proof.On the one hand, λ max (AB) AB 2 A 2 B 2 = λ max (A) λ max (B), since A and B are symmetric and since, from Lemma 5.2 and by hypothesis, all eigenvalues appearing in the relation are non-negative.Suppose now that B is invertible so that both A −1 and B −1 belong to S d ++ .Then, λ max ((AB) −1 ) λ max (A −1 ) λ max (B −1 ) and this immediately gives λ min (AB) λ min (A) λ min (B).If B is not invertible, the relation trivially holds since we still have λ min (AB) 0 from Lemma 5.2.
Lemma 5.5.Let A ∈ S d + and U ∈ R d× .Then, Proof.Denote by u i the i-th column of U .It is not hard to see that the i-th diagonal element of U t AU is The upper bound stems from 0 Lemma 5.6.Let A and B be symmetric matrices of same dimensions.Then, Proof.These are just two special cases of Weyl inequalities.We refer the reader to Theorem 4.3.1 of [12], for example.

Convexity of the objective
We know from Proposition 1 of [29] and the convexity of the elementwise 1 norm that L n (Ω yy , Ω yx ) − η L, Ω t yx Ω −1 yy Ω yx β is itself convex, but it remains to show that this is still the case with the additional smooth penalty.

Proof of Proposition 2.1
Recall that Θ = S q ++ × R q×p and consider the mapping Φ : Θ → S p + defined as We can already note from Lemma 5.1 that tr(Φ(A, B)) 0.Moreover, for all 0 h 1 and all is the Schur complement of hA 1 + (1 − h)A 2 in the matrix But the decomposition 2) is symmetric and positive semi-definite.It is well-known (see e.g.Appendix A.5.5 of [4]) that in that case, the Schur complement (5.1) must also be positive semi-definite.Consequently, for Z i = (Ω i,yy , Ω i,yx L 1/2 ), i = 1, 2, taking the trace of S h (Z 1 , Z 2 ) and considering β 1, where . This convexity inequality concludes the proof.
For the sake of readability, we refer the reader to the Appendix for the numerous constants that are about to appear in the following lemmas and proofs.Thereafter, N r,α (θ * ) will always refer to α in (A.4) and r * in (A.6), while the second hypothesis (H 2 ) given below is to be assumed with the smallest integer greater than s α in (A.5).This is a random hypothesis, which will be controlled with a probability, related to the proximity between the empirical covariance and the true covariance of the predictors, since we recall that S (n) has no reason to be an excellent approximation of Σ * when p n.This is also assumed by the authors of [29], it is a kind of restricted isometry propertie (RIP), well-known in high-dimensional studies.In particular, we will see through Lemma 5.12 that it is satisfied with high probability provided that n is large enough.
In addition, The next two lemmas give some bounds for expressions that will appear repeatedly.
Lemma 5.7.Under (H 1 ) and (H 2 ), for all θ ∈ N r,α (θ * ), we have the bound where ω S is given in (A.1).In addition, Proof.Similar reasonings may be found in the proofs of Lemmas 1-2 of [29].We simply reworked the constants to make them stick to our study.
Lemma 5.9.Assume that λ, µ and η are chosen according to the configuration of the theorem.Then, under (H 1 ), the estimation error satisfies where α > 0 is given in (A.4).
Proof.Taking t = 1 in the Taylor expansion (5.4) with θ = θ and considering the definition of φ in (5.5), by convexity, The first derivative of φ will be explicitely computed in (5.11).For t = 0, we find φ (0) = − Ω * −1 yy , δϑ yy + S (n) yy , δϑ yy + 2 where s L is given in (A.3), where, through the blockwise relations (1.3), we recognize the random matrices A n (with max norm h a ) and B n (with max norm h b ) defined in (2.3) and (2.4), and where, coming from the structural regularization term, Whence it follows from the well-known relation |tr( where M 1 and M 2 are compatible matrices, that making use of the constants (A.3), λ c λ h a and µ c µ h b .For the sake of clarity, let For all θ ∈ Θ, from the triangle inequality and the fact that, as Ω * yy is positive definite, the diagonal must belong to S, i.e. (j, j) ∈ S for all 1 j q so that any square matrix M of size q is such that [M ] S has diagonal elements all equal to zero.A similar bound obviously holds for where Thus, provided that c > 0, which is stated in the configuration of the theorem, it only remains to note that, necessarily, The identification of α given in (A.4) easily follows.
Proof.From the definition of φ in (5.5) and the fact that φ (5.9) To simplify the calculations, let The second derivative is tedious to write but straightforward to establish, (5.12) First, from the combination of Lemmas 5.1 and 5.8, we clearly have u L 0. We also note that 0 F for any c = 0 and any matrices M 1 and M 2 of same dimensions.It follows, after some reorganizations, that for any c = 0 and d = 0, Here we exploited the previous inequality twice, u L 0 and β 1.From Lemmas 5.1, 5.3, 5.7 and 5.8, using sp(M 1 M 2 ) = sp(M 2 M 1 ) for square matrices M 1 and M 2 , we obtain where ω L is defined in (A.1).Replacing L by S xx and ω L by ω S , a similar bound obviously holds.Suppose that c and d are chosen so that c 1 > 0, d 1 > 0, c 2 < 0 and d 2 < 0.Then, Now choose S > 0 and L > 0 small enough so that S ω S + ηβ p where these positive constants are respectively given by The combination of Lemmas 5.
As a corollary, it holds that δϑ F > r * ⇒ c λ,µ |S| max{h a , h b } r * γ r,η,β,p or, conversely written, c λ,µ |S| max{h a , h b } < r * γ r,η,β,p ⇒ δϑ F r * .Lemma 5.12.Assume that λ, µ and η are chosen according to the configuration of the theorem.Then, under (H 1 ), there exists absolute constants b 1 > 0 and b 2 > 0 such that, for any b 3 ∈ ]0, 1[ and as soon as n max b 1 (q + s α ln(p + q)), ln(10(p + q) 2 ) − ln(b 3 ) , with probability no less that 1 − e −b2n − b 3 both the random hypothesis (H 2 ) is satisfied and the upper bound max{h a , h b } 16 m * ln(10(p + q) 2 ) − ln(b 3 ) n holds, where h a and h b are given in (2.3) and (2.4), s α is defined in (A.5) and m * in (A.9).Hence, one can find a minimal number of observations n 0 such that the theorem holds with high probability as soon as n > n 0 .
Proof.All the ingredients of the proof are established in [29].The authors start by recalling that there exists absolute constants b 1 > 0 and b 2 > 0 such that hypothesis (H 2 ) is satisfied with probability no less than 1 − e −b2n as soon as n b 1 (q + s α ln(p + q)).We also refer the reader to Lemma 5.1 and Theorem 5.2 of [3], or to Lemma 7.4 of [10] for the random bounds of the restricted isometry constants.Afterwards, they prove (see Prop. 4) that, as soon as n ln(10(p + q) 2 ) − ln(b 3 ) for some b 3 > 0, with probability 1 − b 3 , max{h a , h b } 16 m * ln(10(p + q) 2 ) − ln(b 3 ) n .
To find the minimal number of observations, we just need to make sure that the above bound is itself smaller than the one of Lemma 5.11.It is then not hard to see that we may retain the minimal size n 0 given in (2.6).

Appendix A. Some constants
This appendix is entirely dedicated to the constants appearing in the theoretical guarantees.Indeed, a centralization seemed necessary to clarify the rest of the paper, especially the understanding of the main theorem.First, we need to define some constants related to L and to the true values of the model. .
Together with α given above, r * is necessary to build the so-called neighborhood N r,α (θ * ) defined in (5.7), which plays a fundamental role in all our reasonings.It is important to note that, under the configuration of the theorem and hypothesis (H 1 ), α > 0 and r * > 0.Then, Lemma 5. is going to play a significative role in the upper bound of the theorem.

Figure 2 .
Figure 2. Mean squared prediction error for N = 500 repetitions of the weakly structured Scenario 1.

Figure 3 .
Figure 3. Mean squared prediction error for N = 500 repetitions of the strongly structured Scenario 2.

Figure 4 .
Figure 4. Mean squared prediction error for N = 500 repetitions of the strongly structured Scenario 3.

Figure 5 .
Figure 5. Mean squared prediction error for N = 500 repetitions of the strongly structured Scenario 2 (left) and Scenario 3 (right) for Or, Gm and the unstructured Or (L = I p ), with β = 2.

Figure 6 .
Figure 6.Temperature and log-precipitation measured over a year in Montreal (left).Empirical distribution of the minimal and maximal log-precipitation for the 35 weather stations (right).