ASYMPTOTIC ANALYSIS OF A MATRIX LATENT DECOMPOSITION MODEL

. Matrix data sets arise in network analysis for medical applications, where each network belongs to a subject and represents a measurable phenotype. These large dimensional data are often modeled using lower-dimensional latent variables, which explain most of the observed variability and can be used for predictive purposes. In this paper, we provide asymptotic convergence guarantees for the estimation of a hierarchical statistical model for matrix data sets. It captures the variability of matrices by modeling a truncation of their eigendecomposition. We show that this model is identiﬁable, and that consistent Maximum A Posteriori (MAP) estimation can be performed to estimate the distribution of eigenvalues and eigenvectors. The MAP estimator is shown to be asymptotically normal for a restricted version of the model.

network.The need to understand the complex structure of the interactions within networks has brought the development of low-dimensional representations of these networks, with methods like sparse dictionary learning or graph auto-encoders [18,39].In many cases, the core modeling assumption relies on the low rank of the observed matrices [10].In that regard, such models can be interpreted as constraints on the distribution of the eigenvalues and the eigenvectors.However, although these recent works have achieved great performance on practical tasks, little has been done in the literature so far to analyze their theoretical soundness.
In this paper, we provide an asymptotic analysis for a recently proposed network data set analysis model [43] which, in terms of generative modeling, can be considered a generalization of several current similar models relying on graph auto-encoders [34] and dictionary learning [19].The model quantifies the variability in the spectral decomposition of network adjacency matrices: the leading eigenvectors, taking values in the Stiefel manifold, and the related eigenvalues are considered as latent variables in a hierarchical generative model.It relies on the classical assumption that the relevant information in a matrix of interaction coefficients can be captured by a low-rank approximation [49].The model structure introduced in [43] was shown to be able to account for the complex variability of functional brain networks using a restricted number of parameters, and provides an interpretable representation of this variability.
We first show that the model is identifiable, and consider the parameter estimation problem.We show that, although the Maximum Likelihood Estimator may not be defined, the Maximum A Posteriori estimator exists for wide classes of prior distributions.Finally, we show the almost sure consistency of the estimator and its asymptotic normality as the number of samples goes to infinity.The technical difficulties arise from the hierarchical structure of the model: only a few specific such cases have received attention in the literature.For instance, the identifiability of latent variable models remains an open question for most latent variable network analysis models.Although our results take stock on the model structure, we believe that they can be transposed without hurdle to many similar models.

Notations
In the next sections, we use the following notations: -A denotes matrix transposition, Tr(A) the trace and det(A) the determinant, -x denotes the canonical Euclidean norm for vectors, and the related operator norm for matrices, -A F denotes the Frobenius norm and A, B F = Tr(A B) the related inner product for matrices, -If X is a n × p matrix, x i ∈ R n denotes its i-th column, so that X = (x 1 , . . ., x p ), -V np is the Stiefel manifold of n × p matrices X such that X X = I p .
-O n (R) is the orthogonal group V nn , -For λ a vector and X a matrix, we define λ • X = X Diag(λ)X, -For A a n × n matrix and X a n × p matrix, we define A * X = (x i Ax i ) p i=1 .

A statistical model for spectral decomposition
2.1.Model definition

Observations distribution
We study the generative model for sets of weighted graph adjacency matrices A 1 , . . ., A N ∈ R n×n proposed in [43].It draws symmetric low rank adjacency matrices A by generating their eigenvectors X = (x 1 , . . ., x p ) ∈ R n×p and eigenvalues λ = (λ 1 , . . ., λ p ) ∈ R p , and combining them with an additive noise ε ∈ R n×n .
In practice, the adjacency matrix A represents a network.n corresponds to the number of nodes (e.g. in the case of brain connectivity, the number of brain regions), and p n is chosen such that the residual term ε is small.The eigenvectors take values in the Stiefel manifold V np of matrices such that X X = I p .Their probability distribution will be described in the next section.The eigenvalues follow a multivariate Gaussian distribution λ ∼ N (µ, σ 2 λ I p ).The noise ε is a symmetric matrix whose coefficients above the diagonal also follow a Gaussian distribution N (0, σ 2 ε I n×(n+1)/2 ).We assume that the variables λ, X, ε are independent.This assumption is strong: it might not be satisfied in practice, as the variation of a pattern x i should be naturally correlated to a variation of the related λ i .However, it also allows keeping a small number of parameters, which allows for robust estimation in practice when the number of observed matrices is low.Their interpretations will be given in Section 2.2.2 on simpler alternative models.

Eigenvectors distribution
As an element of the Stiefel manifold V np , the eigenvector matrix X is described by a probability distribution over V np .The canonical framework for these distributions is exposed in [13], and consists in taking a measure with density with respect to the Haar measure over the Stiefel manifold.The Haar measure [dX] is defined, up to a constant, as the only measure invariant to orthogonal transformations, i.e., for S ⊂ V np and O ∈ O n (R), [dX].It can be rescaled by a constant factor to correspond to the Hausdorff measure over V np [29].
The distribution considered for X is the von Mises-Fisher (vMF) distribution, also called Matrix Langevin distribution in the literature.It was first introduced by [32], who derived basic properties of the distribution and its Maximum Likelihood Estimator (MLE), and was further studied for both theoretical and algorithmic purposes [12,30,35,46].The von Mises-Fisher distribution over V np is defined by its probability density function (p.d.f.) with respect to the Haar measure: with C(F ) the normalizing constant and F = (f 1 , . . ., f p ) = M Diag(s) = (m 1 , . . ., m p )Diag(s 1 , . . ., s p ) the parameter of the distribution (F ∈ R n×p ).In the model considered here, M ∈ V np and the s i 's are non-negative to ensure identifiability.By definition, the modal point M has maximal probability.The s i 's control the spread around the modal point, and are called the concentration parameters of the distribution.
The vMF distribution has a simple interpretation and requires few parameters.It imposes no dependency between the columns of X, except the orthogonality constraint.It forms an exponential family of distributions, and as such lends itself to efficient numerical estimation procedures.The normalizing constant C(F ) has an analytic expression relying on the hypergeometric function of a matrix argument, and represents the main difficulty when analyzing the distribution, as it prevents from getting an explicit expression of its moments.
With this definition, we can write the full density of the model defined in the previous section.The likelihood of an observed matrix A writes: where we introduced the notation λ • X = XDiag(λ)X to lighten the formula, and θ = (F, µ, σ λ , σ ε ) regroups the model parameters.
Remark 2.1.The overall model structure (2.1) can be compared with equation (1.11) in [23], which states that, for any continuous probability distribution p(A) over the space of symmetric matrices: for any bounded continuous function h, with [dX] the normalized Haar measure over the group of orthogonal matrices O n (R).In other words, any matrix distribution is equivalently characterized by the joint distribution of its eigenvalues and eigenvectors.In that regard, our main hypotheses consist in constraining on the number of non-zero eigenvalues and imposing that the distributions of X and λ can be decoupled.

Beyond the graphon model
The graphon [42] is the standard reference model used in network theory to analyze large graphs from a probabilistic perspective.Many pieces of work in both the theoretical [28,33,55] and applied [36,44,50] literatures focus on the properties of the model it describes and its statistical estimation.
A graphon is a symmetric function w : [0, 1] 2 → [0, 1], which is to be understood as a continuous adjacency matrix with an infinite number of nodes.The graphon defines a distribution over n × n symmetric adjacency matrices by drawing n uniform numbers U 1 , . . ., U n ∼ U([0, 1]), and forming the matrix in the case of binary networks.The graphon inference problem thus consists, given one or several matrices A, in determining both the function w and the positions (U i ) of the nodes.
The main application of the graphon model is the Stochastic Block-Model (SBM), which assumes that w is block-wise constant.It amounts to dividing the set of nodes into clusters with given probabilities, and determining the connection between the nodes with the connection between their clusters.The SBM provides a well-studied [1,45,47] framework which is particularly relevant for a clustering analysis of networks, i.e. finding the most relevant partition among the nodes.
Both the graphon model and the SBM were conceived to analyze networks where nodes are drawn randomly and play interchangeable roles.They mostly focus on understanding the structure of the hidden graphon dynamic, which requires identifying the U i 's or the cluster labels.
Given a data set of matrices, both graphon and SBM would either (1) assume that the U i 's are drawn independently for each matrix or (2) take the same U i 's for each matrix in the data set.The first case yields a distribution whose expectation has constant coefficients: ).The second case results in a constant distribution with A ij = w(U i , U j ) for every sample matrix A, or a matrix of independent Bernoulli variables A ij ∼ B(w(U i , U j )) in the case of binary networks.Both options lead to simplistic distributions which are not relevant from a practical perspective.
In the context considered here, the nodes remain the same from one matrix to another (e.g.brain regions), and cannot be permuted.This allows to easily estimate the average interactions, which is the main difficulty for the graphon and the SBM.Modeling the matrices' spectral decomposition goes one step further than the SBM, and induces a dependency between the coefficients.It allows for instance computing the distribution of a set of matrix coefficients given other observed matrix coefficients.

Accounting for the full network variability
Two similar approaches currently co-exist in the literature to analyze sets of networks.On the one hand, Variational Graph Auto-Encoders (VGAE) [34] assume that each node i is represented by a low-dimensional vector z i ∈ R p , and models the adjacency matrix as A ij = h(z i z j ), with h a non-linear function.The model thus characterizes A by a low-dimensional representation Z ∈ R n×p , and retrieves The matrices 1 p • Z are constrained to having positive eigenvalues.Additionally, the VGAE model considers all variables z i as independent and identically distributed.
On the other hand, a dictionary model was proposed by [19], and writes each adjacency matrix A as a weighted combination of fixed rank one matrices: Here, the goal is to find the best λ for each matrix A, while the matrix X is the same for all networks.This model thus imposes a strong dimension constraint on the adjacency matrices.
Each of these two approaches capture one aspect of the variability: for the VGAE, only the "eigenvectors" vary, and for the dictionary model, only the "eigenvalues" depend on the network.The model we study here simultaneously accounts for these two sources of variability, and thus allows for a richer representation, while keeping a latent space dimension comparable to that of VGAE.From the VGAE perspective, the rows (M ki ) p i=1 shape the distribution of z k , and the parameters (µ, σ 2 λ ) determine the (possibly non-positive) inner products between the z i 's.From the dictionary model perspective, the column m i = (M ki ) n k=1 gives the i-th dictionary element and s i its concentration; the coefficient M ki gives the strength of the contribution of pattern i to the interactions of node k in the network.The parameters (µ, σ 2 λ ) give the distribution of the dictionary weights.

Conditional distribution
Summarizing the model definition in Section 2.1, we assume that an observed adjacency matrix A writes as A = λ • X + ε, with (λ, X) ∈ R p × V np being independent latent variables and ε a symmetric matrix of Gaussian distributed noise coefficients.The full model p.d.f.writes: .
From this expression, we can express the conditional distribution of the latent variables (X, λ) given A as follows.In the remainder of the paper, we will denote 3) The expression of the conditional density p(X, λ | A, θ) of the latent variables given the observed variable A writes as: The proof of this equation follows the same lines as in Lemma C.1 in Appendix C. We will be using this expression of the conditional distribution in Section 5 on asymptotic normality.The 1 2σ 2 p µ AX 2 term in the distribution of (X | A) is typically much larger than X, F F as long as n p, and it thus determines the shape of the distribution.As shown in the following proposition, it is maximized by the eigenvectors of A. Proposition 2.2.For A ∈ R n×n , µ AX 2 is maximized by taking X among the eigenvectors of A. Furthermore, if the eigenvalues of A all have multiplicity one, this maximization is strict.
Proof.Let A = U DU be the eigendecomposition of A, with U U = I n .Without loss of generality, we take σ λ = σ ε = 1.We have: Since Q is orthogonal, the matrix S = Q Q is doubly stochastic.Furthermore, the Birkhoff-von Neumann theorem states that the set of doubly stochastic matrices is the convex hull of the set of permutation matrices.As a consequence, the linear function K , S F is maximized by taking for S a permutation matrix.Such matrices are orthogonal and verify S S = S, and their only square roots for the Hadamard product are permutation matrices with negative coefficients allowed.Therefore, the optimal choice for Y has its columns in the canonical basis.Hence the optimal choice for X = U Y is to take its columns among the eigenvectors of A. 1When Y is a permutation matrix, Jensen's inequality becomes an equality, so that taking the related X = U Y is also an optimal choice for the original objective µ AX 2 .Furthermore, if A has n distinct eigenvalues, Jensen's inequality is strict except when y i is a vector of the canonical basis.Therefore, in that case, the optimal subset of eigenvectors of A (up to permutation and change of sign) is the only maximizer of µ AX 2 .
Remark 2.3.When taking µ = 0, the result can be proved more simply by using Ky Fan's principle on eigenvectors [22].A closely related, yet different result, was recently obtained by [40].We believe that obtaining a closed-form formula for maximizing the complete conditional density p(X | A, θ) would require significantly more work.The eigenvectors of A are no longer optimal: the best value of X is obtained as a trade-off between M and the closest optimal eigenvalue combination of A, with the concentration and variance parameters determining the balance between both.

Model identifiability
Identifiability of statistical model p(x | θ) refers to the property that, if θ 1 = θ 2 , then the distributions p(• | θ 1 ) and p(• | θ 2 ) must differ.It is a generally desirable property, as it ensures that the model is well-defined and behaves in an intuitive way.It also has an immediate theoretical interest, since it enables to prove that Maximum Likelihood Estimators converge to the correct value when the data is generated according to the model.It can be proved for instance by retrieving the parameter θ from a set of moments of p(• | θ).
The identifiability of latent variable models is a general, long-standing question, which has been studied and proved for only few specific models.It relates to the question of identifying the parameters of graphical models where only a fraction of the variables is observed.Much work has been devoted to the identifiability of finite mixture models [27,51,52,56].In a similar spirit, classes of statistical models with discrete latent variables have also recently been proved to be identifiable [4,25].Partial results have been shown for mixed-effects models, in particular in a longitudinal setting [37,51].In a less closely related domain, identifiability results exist on time series model with latent variables [17].Finally, general identifiability results are available for (possibly infinite) mixtures of exponential models [6,7].Although the latter result is related to the model we consider here, its necessary theoretical conditions turn out to be hard to verify in practice.
The main difficulty with identifying latent variable models comes from the expression of the observations' likelihood: Even though our full model p(A, X, λ | θ) is identifiable, the marginalized model p(A | θ) may not be: permuting two eigenvalues µ i , µ j and the related eigenvector parameters f i , f j , or changing the sign of f i does not change the distribution of A. This first obvious source of non-identifiability is easily overcome, by imposing that the normalized columns (m 1 , . . ., m p ) (denoting m i = f i /|f i |) are sorted according to the lexicographical order and that each column has its first non-zero element positive.An additional constraint allows getting a provably identifiable marginal model: we shall assume that the f i 's are non-zero, i.e. that the concentration parameters s i = f i are positive.These two constraints form the set of identifiable parameters Θ id : With this definition, we have the following result: Proof.Given θ ∈ Θ id , we show that all parameters (F, µ, σ λ , σ ε ) can be retrieved from the distribution p(A | θ).We first identify the noise variance.This allows identifying the eigenvalue parameters, and finally the eigenvector parameters.
Identifying σ λ and σ ε .Using Lemma C.1 and αI n * X = α1 p , we have, for all α ∈ R: ) is a second-order polynomial, its coefficients (a 0 , a 1 , a 2 ) can thus be identified.In particular, the degree two coefficient gives the value of Similarly, the computation in Lemma C.1 can be used to derive the gradient ∇ A p(A | θ).It writes, for A = αI: where B is the random variable given by B = λ p • X, with λ p ∼ N Furthermore, since we have we deduce: As a consequence, the α-linear function above can be deduced from the distribution of A, hence we know its coefficients.In particular, the leading coefficient a 3 writes: The formulas of a 3 and a 2 in equation (3.1) can be combined to obtain Identifying µ.
The moment generating function of A writes as: Since the distribution of ε has been characterized, G ε (T ) is known, and hence G λ•X can be deduced as G A (T )/G ε (T ).As the moment generating function characterizes the probability distribution, if the distribution λ • X is identifiable then the distribution of A is identifiable.We thus turn on the problem of identifying µ given the distribution of λ • X (and proceed similarly for the eigenvector parameters in the next paragraph).We have for t ∈ R: which in particular gives Tr The functions of the form t → e tµi+ 1 2 σ 2 λ t 2 are linearly independent for distinct µ i 's: this allows retrieving both the µ i 's and the multiplicity count of each eigenvalue.
Identifying F .From there, we could use the matrices E[x i x i ] to identify the modal directions m i .Indeed, as shown in [32] (Eqs.(2.9)-(2.11)),each m k is an eigenvector of each E[x i x i ].However, the related eigenvalues and remaining n − p eigenvectors are unknown, and the relevant eigenvectors cannot be identified easily.In the limit of large concentration parameters, E[x i x i ] m i m i , so that the largest eigenvalue is the one corresponding to m i .Yet this argument cannot be quantified, as the eigenvalues involve partial derivatives of log C(F ) which are hard to manipulate.
Instead, we get a better result by expressing the density of the distribution of B = λ • X = D(λ, X), with support on the set Im(D) of n × n square matrices with rank p.The distribution of B is characterized by the expectations E[h(B)] with h continuous bounded.We have: We want to perform a change of variable so as to express the expectation as an integral over Im(D).However this cannot be performed directly.First, the mapping D is not injective.Next, the most relevant change of variable formula for this problem is, to the best of our knowledge, the main result of [53], which gives a formula for mappings taking inputs in vector spaces (which is not the case here as X ∈ V np ).
The first problem can be solved by splitting the integral over domains where D is injective, which means preventing permutation and change of signs in the columns of X.To that end, for π ∈ S p a permutation and f ∈ {±1} p , we denote X π,f = (f 1 x π(1) , . . ., f p x π(p) ), and by λ π = (λ π(1) , . . ., λ π(p) ).We also define the sets where ≺ denotes the lexicographical order over R n .By construction, we have Furthermore, the map X → X π,f corresponds to multiplying X by an orthogonal matrix.By construction, the Haar measure over V np is invariant to this transformation [13].Moreover, the map λ → λ π is also a linear orthogonal transformation, and as such has Jacobian determinant one.Hence we can perform the change of variable (λ, X) → (λ π , X π,f ), and we get: The first problem is now solved, as D is injective over R p × ∆ 0 .We now need to get to an integral formulation over a vector space.To that end, we consider the inverse of the Cayley transform of X: D = C −1 (X).We refer the reader to Appendix A for a definition of the Cayley transform C. It is a smooth injective map from the tangent space at identity T Inp V np = { A B | A = −A} to the manifold V np , which covers the entire manifold apart from a set with measure zero.As explained in [29] (Thm.4.1), a change of variable from D to X can be performed, and amounts to adding a multiplicative factor J 1 (D), with J 1 is a generalized Jacobian determinant.It follows that we can rewrite: . Given B, we denote by λ B , X B and D B its pre-images by D and C. Since the considered mapping is smooth, the main theorem of [53] applies.Letting J 2 (λ, D) be the generalized Jacobian determinant involved in the formula, it writes as: where dB denotes the Hausdorff measure over Im(D).Since both maps C and D are diffeomorphic, the generalized Jacobian determinants involved are non-zero.As a consequence, the random variable B has density over its support w.r.t. the Hausdorff measure.Therefore, if the distribution of B is known, we can deduce the value of the function Since the sum above is invariant by any permutation π and change of sign f , it follows that the value of this expression is known not only for X ∈ ∆ 0 , but over the whole manifold V np .Now, we consider the specific case λ = µ.Up to a normalizing constant, f µ (X) is a probability distribution over V np : it is a mixture of von Mises-Fisher distributions with parameters (F π,f ) and mixture weights proportional to p(µ | θ π,f ).This structure allows using the main result of [31], which grants that the von Mises-Fisher densities given by the F π,f are linearly independent.This result can be combined with the main theorem of [56], which states that a family of finite mixtures is identifiable if and only if the mixture components form a linearly independent set.
As a consequence, we identify the parameter F up to a column permutation and change of sign.Moreover, in the sum above, the probabilities p(µ | θ π,f ) with maximal amplitude are given by π = Id, and all the other permutations such that for all i, µ σ(i) = µ i (which encompasses eigenvalue multiplicity).Since we assumed that all concentration parameters are positive, all (F π,f ) are distinct and hence the maximal mixture weights correspond to the matrices (F π,f ) with π as just described.This finally allows matching eigenvalues with eigenvectors, completing the identification of θ.

Maximum a posteriori versus maximum likelihood
We turn to the problem of estimating θ from samples A 1 , . . ., A N when the number of samples N grows large.In this section, we assume that the samples are distributed according to a distribution P , which may not be of the form p(A | θ).
However, the MLE may not be defined: the optimal value for F may theoretically be infinite, as the model likelihood does not necessarily decrease at infinity.For instance, if the samples A 1 , . . ., A N are drawn from a Gaussian distribution with i.i.d.coefficients and mean equal to a rank p matrix λ 0 • X 0 , the parameters σ λ and s i tend to take extreme values (σ λ being very small and s i being very large), and the distribution of latent variables is highly concentrated around (λ 0 , X 0 ).This phenomenon occurs because the estimated data distribution asymptotically converges to the true data distribution, which lies at the boundary of the model family (in the sense that taking very large s i 's and a very small σ λ yields a distribution close to the true data distribution).
This problem is overcome numerically by adding a prior distribution p(θ) and considering the Maximum A Posteriori (MAP) estimator over the set Θ of all parameters: In this section, we want to account for the possible convergence of latent variable distributions to constant values.For this purpose, instead of the parameterization θ = (F, µ, σ λ , σ ε ), we will be defining the parameter set by Θ = {θ = (M, s, µ, σ λ , σ ε ) | σ λ > 0, s i < +∞}, with the equivalence given by F = M Diag(s).In the next section, this representation will allow us to formally consider an extension of the set Θ accounting for the case where s i = +∞ and σ λ = 0.
We consider inverse Gamma distributions for the prior p(σ λ , σ ε ), the uniform distribution over V np for M , and any p.d.f.decreasing at infinity for p(s) and p(µ).Unlike the MLE, with this prior specification the MAP estimator is guaranteed to exist.Theorem 4.1.Given the proposed model, with parameters following the prior distribution described above, for any set of matrices (A i ) N i=1 , there exists θN ∈ argmax θ∈Θ p(θ | A 1 , . . ., A N ).Proof.The bound obtained in Lemma C.2 gives with Bayes' formula: Since the manifold V np is compact and the integrand f (θ, X) = exp( F, X + µ AX 2 /2σ 2 p ) is smooth on Θ × V np , classical integration theorems grant that log p(θ | A) is smooth over every compact subset of Θ: given a compact set K, the domination function g(X) = max θ∈K f (X, θ) is smooth over K. Hence log p(θ | A) is smooth over Θ.In particular, the function log p(A | θ) is coercive and continuous, and it thus admits a maximizer over Θ.

MAP consistency
The above result motivates the study of the MAP estimator over the MLE.However, although adding a prior distribution grants the existence of a maximizer within Θ, the weight of the prior term decreases as the number of samples grows large, and we should expect the MAP estimator to diverge to the boundary of Θ for some empirical data distributions P .This phenomenon is accounted for by considering an extended set of parameters Θ ∞ allowing null eigenvector variance (i.e.λ constant) and infinite von Mises-Fisher concentrations (i.e.x i constant for some i's): We prove in Lemma C.3 that the likelihood p(A | θ) extends continuously to Θ ∞ .The extension essentially amounts to considering eigenvalue and eigenvector distributions restricted to a conditional subspace.With this convention, the objective function to be asymptotically maximized can be defined over Θ ∞ as the almost sure (a.s.) limit of the empirical objective function 1 If P has a density with respect to the Lebesgue measure, the function is equal, up to a constant term which depends only on P , to the opposite of the Kullback-Leibler divergence between P and p(A | θ).The MAP estimator is said to be consistent if it converges to the set Θ * of maximizers of (θ).In the case where P corresponds to some p(A | θ * ) for θ * ∈ Θ id , only has one maximizer, which is the true model parameter θ * .For large classes of sufficiently regular families of statistical models, the MLE and the MAP can be proved to be consistent and, in probability, to minimize the KL divergence to the optimal point [54].
The consistency of MLE for latent variable models has been studied for several classes of models, like Hidden Markov Models [16], Independent Component Analysis [9] or longitudinal mixed effects models [3,11].Along these results, we obtain the almost sure (a.s.) consistency of the MAP estimator.We study two particular cases: in the first case, we assume that the parameters which may diverge stay bounded, and obtain a.s.convergence to the set of maximizers over the constrained set.In the second case, we show that the unconstrained MAP estimator converges a.s. to the set of maximizers over Θ ∞ .
The convergence to the set of maximizers of (θ) is quantified by the distance d( θN , Θ * ).However, the set Θ ∞ * of maximizers of over Θ ∞ may have some elements with infinite coordinates, which prevents from quantifying distances.To overcome this issue, we consider the reparameterization ξ(θ) = (M, h(s), µ, σ λ , σ ε ), with h : [0, +∞] p → [0, 1] p applying the same continuous increasing transformation to each s i , for instance h(s) i = atanh(s i ).Over the new parameter space Ξ ∞ = ξ(Θ ∞ ), we also obtain the almost sure consistency of the MAP ξN = ξ( θN ).Theorem 4.2.Let Θ η be the set of parameters with each s i and σ −1 λ upper bounded by η, and let Θ η * be the the set of maximizers of over Θ η .Consider the following hypotheses: H1 The number of latent patterns is strictly lower than the number of nodes: p < n.H2 The samples (A i ) N i=1 are independent and identically distributed.H3 The true data distribution P (dA) has a density w.r.t. the Lebesgue measure and exponentially decaying tails beyond a compact set: there exist a, b > 0, such that for x large enough, sup A F ≥x P (A) ≤ a exp(−bx).
Then, assuming H1, H2 and H3: 1.For all η > 0, Θ η * = ∅ and the MAP estimator θη N on Θ η is consistent: for every continuous metric δ, almost surely, The proof follows the architecture of [11,54].The main difficulties and specificities lie in the proofs of the required lemmas which are specific to the model, and the possibility of having partially constant latent variable distributions.We thus only present here the structure of the main proof, and refer the reader to Appendix B for the detailed argument.As the proof for the first assertion is a strictly simpler version of the proof of the second assertion, we omit it for the sake of brevity.
Sketch of the proof.The proof is divided into four parts.We define B) The set K ε described above is compact, and therefore among all the sets U defined in part A we can extract a finite cover of K ε .This allows proving that lim sup C) Using the definition of θN and the law of large numbers, we show that lim inf D) Finally, combining the two arguments above allows getting a contradiction if θN ∈ K ε for an infinite number of N .As a consequence, for all ε > 0, θN / ∈ K ε almost surely as N → +∞, which gives precisely δ( ξN , Ξ ∞ * ) → 0.

Asymptotic normality of the MAP estimator
A consequence of Theorem 3.1 is that, if the empirical data distribution P corresponds to p(A | θ 0 ) for some θ 0 ∈ Θ id , we have Θ id * = {θ 0 }: thus, by Theorem 4.2, the MAP estimator over Θ id converges almost surely to θ 0 .A classical question is then to establish the rate of convergence of θN toward θ 0 , as well as the limiting asymptotic distribution.An answer for the more general case of M and Z-estimators is provided in chapter 5 of [54], which we restate with adapted notations: Theorem 5.1 (Them.5.23 in [54]).Let m θ (A) = log p(A | θ).Assume that m θ is a measurable function such that θ → m θ (A) is differentiable at θ 0 for P -almost every A with derivative ∇ A m θ0 (A).Assume that there exists a function m with E P (dA) [m(A) 2 ] < +∞, such that, for every θ 1 and θ 2 in a neighborhood of θ 0 : (5.1) Furthermore, assume that the map (θ) = E P (dA) [m θ (A)] admits a second-order Taylor expansion at a point of maximum θ 0 with nonsingular symmetric second derivative V = ∇ 2 (θ 0 ).If and θN → θ 0 in probability, then In particular, the sequence √ N ( θN − θ 0 ) is asymptotically normal with mean zero and covariance matrix Remark 5.2.The notation o P (1/N ) designates a random variable Z N such that N Z N → 0 in probability.
The most important condition in the theorem above is the non singularity of the Hessian matrix at θ 0 .In general, ∇ 2 θ (θ) is impossible to compute for latent variable models, as it involves the Hessian of log p(A | θ).However, the problem gets more tractable when the data distribution P corresponds to p(A | θ 0 ) for some θ 0 ∈ Θ.The Hessian matrix at θ 0 then classically rewrites as the Fisher information matrix I(θ 0 ) (see for instance Lemma 5.3 in [38]): The non-singularity of the Fisher information matrix remains difficult to prove for general latent variable models.Some papers consider it as a base hypothesis to obtain the asymptotic normality, e.g. for Factor Analysis [5] or Hidden Markov Models [8].In the latter case, the more recent work of [15] provided a condition to obtain the non-singularity of the Fisher information matrix.A recent result was obtained by [48] on the asymptotic normality of MLE for Gaussian graphical models and apply it to estimation from partial observations.In this specific case, the Fisher information has a simple closed form expression.
For the model considered here, no closed form expression can be expected, as the gradient of the log-likelihood writes with integrals on V np .Instead, we notice that, since the observation density p(A | θ 0 ) is continuous and I(θ 0 ) writes as the integral of (∇ θ log p(A | θ 0 ))(∇ θ log p(A | θ 0 )) , the matrix will be non-singular if we can find dim(θ 0 ) matrices A i such that the related gradients ∇ θ log p(A i | θ 0 ) are linearly independent.This is formalized in the following lemma: Lemma 5.3.Let d = dim(θ 0 ).If A 1 , . . ., A d matrices can be found such that the related log p(A i | θ 0 ) are independent, then I(θ 0 ) is positive definite.
Proof.Let x ∈ R d .We have: If x I(θ 0 )x = 0, then x, ∇ θ log p(A | θ 0 ) 2 must be zero everywhere.Therefore, since θ → log p(A | θ) is infinitely smooth, x is orthogonal to all the gradients ∇ θ log p(A i | θ 0 ), and thus to their linear span, which covers the full space, which implies x = 0.As a consequence, I(θ 0 ) is positive definite.
In the case of our model, it turns out that, although the expression of ∇ θ log p(A | θ) is intractable, it simplifies as A F grows large.This simplification comes from the so-called Fisher identity: and the gradient rewrites as an expectation of the complete log-likelihood over the latent variables.Given the complete expression this expectation yields for instance: As A F → +∞, we show in the upcoming Proposition 5.4 that the eigenvector distribution of (X | A) concentrates around the permutations of the p eigenvectors of A related to the p largest eigenvalues.As a consequence, ∇ F log p(A | θ) writes as the sum of ∇ F log C(F ) and a linear combinations of all (X A ) π,f , with X A the n × p eigenvector matrix of A and π ∈ S p , f ∈ {±1} p .However, although X A can be chosen freely, the subsequent linear combination turns out to be hard to compute and manipulate, which ultimately prevents from getting an explicit expression for the gradient in F .The same phenomenon happens with the other gradients, which all rely on an expectation given A.
This observation motivates the main hypothesis for our normality result.We shall consider a restricted variant of the main model p(A, X, λ), where the X variable is constrained to the set ∆ 0 defined in equation (3): the density of X writes as with . This constraint does not fundamentally change the model in the limits where s i → 0 and s i → +∞.For intermediate values, it truncates the other sections ∆ π,f of the vMF distribution, but does not change the support of the distribution of λ • X, as it still covers the set of rank p matrices.The resemblance between p and p is optimized when the maximum of X, F F is reached in ∆ 0 , i.e. when choosing the normalized columns of F to be in ∆ 0 .We adopt this convention in the remainder of the section, as it also facilitates proving the identifiability of the restricted model.
In the remainder of this section, the notations (θ), θN , ... refer to densities and estimators obtained for the restricted model.We also assume that the empirical data distribution is given by p(A | θ 0 ) rather than p(A | θ 0 ).With this restricted model, we have the following result: Proposition 5.4.Let A ∈ R n×n with rank at least p and distinct eigenvalues, and let A t = tA for t ∈ R. On the restricted model with X ∈ ∆ 0 , the distribution (X | A = A t ) converges to the constant value X A , with X A ∈ ∆ 0 the list of eigenvectors of A corresponding to the p largest eigenvalues.In particular, is the expectation of X w.r.t. the probability density proportional to As t → +∞, the function g t (X) = 1 2σ 2 p µ tA,X 2 reaches its maximum to a point which converges to X A .We have indeed: By Proposition 2.2, A * X 2 is only at X = X A on ∆ 0 (unicity is guaranteed as the eigenvalues of A are distinct).Let D a region of V np with non-zero Haar measure such that X A / ∈ D and let η > 0 such that if Hence, by the Portmanteau theorem, the sequence of probability distributions (X | A = A t ) converges in distribution to the constant X A .
Remark 5.5.Proposition 5.4 can be compared to the decreasing uncertainty on the normalized position x/ x of a point x going to infinity.If we used the complete model, the distribution of X would instead converge to the sum of Diracs at (X A ) π,f weighted by p((X A ) π,f | θ).
With the result above, we can prove that dim Θ linearly independent gradients ∇ θ log p(A | θ 0 ) can be obtained.
Lemma 5.6.The log-likelihood gradient ∇ θ log p(A | θ) of the restricted model takes dim Θ = np + p + 2 linearly independent values.Proof.As explained above, the Fisher identity reminded here allows computing gradients as A grows large: In order to alleviate the notations, the expectations E below refer to the distribution p(A, X, λ | θ 0 ).Since we have: we get for (F, µ, σ λ , σ ε ): Let t ∈ R, consider the matrix A t = tA and denote X A ∈ ∆ 0 the matrix of eigenvectors of A for the p largest eigenvalues.The expressions above simplify as t → +∞: 1.For F : Proposition 5.4 gives for A with p distinct non-zero leading eigenvalues: 2. For µ: as seen in Section 2.3, (λ | X, A t ) ∼ N (µ AtX , σ 2 p ), so that we have 3. For σ 2 ε : similarly, we get 4. For σ 2 λ : In these expression, A * X A = (λ 1 , . . ., λ p ) is the p leading eigenvalues of A, A * X A 2 In the remainder of the proof, we call these asymptotic rescaled values limit gradients.Using the formulas above, we derive the following limit gradients.
Taking X ∈ V np , we consider the limit gradient for µ • X i .Up to factors t and t 2 which do not affect the linear independence, the result is: .
Finally, we take the matrix µ • I np .The resulting limit gradient is: Subtracting the limit gradient at µ • I np , we can get linear combinations of gradients arbitrarily close to any vector of the forms: With X ∈ V np and λ = µ .We can now use Lemma C.8, which states that the elements of the form X − I np span R np .Similarly, elements of the form λ − µ span R p : taking λ = −µ, the space contains −2µ and thus µ.Hence it also contains all elements λ with norm µ , which can be rescaled to get the entire space.As a consequence, the three vector families above span the entire linear hyperplan R np × R p × R × {0}.Furthermore, the limit gradient at µ • I np has a non zero last coordinate and does not belong to this linear hyperplan.As a consequence, the set of all limit gradients spans the entire gradient space.Therefore, the set of all gradients, which gets arbitrarily close to limit gradients, also spans the entire gradient space.Finally, we can thus find With the result of Lemma 5.6, we can now obtain the asymptotic normality result.
Theorem 5.7.Assume that the empirical data distribution is given by the restricted model for some parameter θ 0 ∈ Θ id .Then the MAP estimator θN over Θ id for the restricted model converges almost surely to θ 0 , and θN is asymptotically normal: Proof.As verified in Lemma C.9, the restricted model is identifiable on Θ id , so that the only maximizer of (θ) over Θ id is θ 0 .The proof of the consistency Theorem 4.2 adapts without hurdle to the restricted model, proving that θN converges to θ 0 almost surely.
We can now check the conditions to apply Theorem 5.1.Since θ → (θ) is smooth over Θ, it admits a secondorder Taylor expansion at θ 0 , and Lemma 5.3 combined with Lemma 5.6 ensures that the Hessian matrix at this point is non singular.Lemma C.10 shows that the Lipschitz condition (5.1) is satisfied by log p(A | θ).Finally, condition (5.2) is satisfied, as the MAP estimator is such that: Theorem 5.1 thus applies, and grants the convergence in distribution of

Conclusion
This papers provides theoretical guarantees for the estimation of the eigenvalue and eigenvector distributions of the adjacency matrix decomposition model introduced in [43].The considered model is identifiable, its MAP estimator exists and converges almost surely to the points minimizing the Kullback-Leibler divergence to the empirical data distribution.By considering an alternate restricted model, we obtain the usual 1/ √ N convergence rate and the asymptotic normality of the MAP estimator using the theory of [54].Our results show that asymptotic statistical analysis can be performed on manifold-valued latent variable models to obtain classical guarantees.Arguments similar to those we presented should allow obtaining results in related models where little theoretical work has been done.State-space models on Stiefel and Grassman manifolds [14], eigendecomposition models for a single network matrix [26] or mixture models [2] could lend themselves to such an analysis.
The model considered here, as most of the literature on statistics for Stiefel manifolds, is estimated with MLE or MAP.Recently, [46] proposed a Bayesian framework for von Mises-Fisher distributions which allows computing the posterior distribution of F given observations of X.An interesting question would be to analyze the behavior of this posterior distribution in a hierarchical model where X is a latent variable, in a direction similar to the works of [41] and [20].
Finally, another important question on the model we studied is the analysis of its estimation error.In practice, [43] rely on a variant of the EM algorithm to estimate the model parameters.EM-based methods are known to produce local maxima of the likelihood, which prevents from getting a rigorous theoretical analysis of the estimation error.However, even assuming that no local maximum is found, the E-step of the EM algorithm behaves in an undesirable way, as the conditional distribution of (X, λ) given A is multimodal (one mode per permutation and change of sign for the columns of X).This conditional distribution yields a very low vMF concentration far from the real one, as the samples X are spread over the manifold.A heuristic thus has to be employed in practice to ensure that X stays close to ∆ 0 , and get a better estimate of the MAP.This question will be part of our future work.are available for the exponential map on V np (see again [21]), they rely on matrix exponential and little is known on the properties of the inverse mapping.
In contrast, the Cayley transform C X behaves better in that regard.It also sends elements from T X V np to V np and behaves similarly to the exponential map close to X, in the sense that Denoting K = HX − XH , the Cayley transform at X is defined by: C X was studied in more detailed for X = I np in [29].In practice, the Cayley transform is used in optimization to perform gradient descent [24], as it allows projecting the descent direction ∇ V f (X) onto the manifold and requires only simple linear algebra computations.We prefer it to the exponential map because it has a simple expression, is invertible, and covers the entire manifold apart from a set with measure zero.

Appendix B. Proof of the consistency of the MAP estimator
We define The proof relies on the Alexandrov compactification Θ ∞ of Θ ∞ , which adds an infinity point for the coordinates σ ε (for the cases σ ε ∈ {0, +∞}), σ λ (for the case σ λ = +∞) and µ (for all the cases where µ = +∞).
Part A. We prove that, for all θ ∞ ∈ Θ ∞ such that δ(ξ(θ ∞ ), Ξ ∞ * ) ≥ ε, there exists an open neighborhood U ⊂ Θ ∞ of θ ∞ such that Let U h be a decreasing sequence of open sets such that ∩ h≥0 U h = {θ ∞ }, and let Two cases arise: is continuous, we have: And the sequence f h (A) is decreasing for every A. Furthermore, Lemma C.6 ensures that the sequence is bounded from above (with the upper bound obtained by taking the whole space for U).Hence the monotone convergence theorem applies, and we get: Therefore, it is sufficient to take h large enough to have equation (B.1) satisfied.
2. If θ ∞ / ∈ Θ ∞ , i.e. the variance parameters (σ λ , σ ε ) take extreme values, we prove by contradiction that lim h→∞ f h (A) = −∞ a.s.Let us assume that there exists a measurable set E ∈ B(R n×n ) such that P(A ∈ E) > 0 and, for all A ∈ E, inf h f h (A) > −∞.Since f h (A) is decreasing for every A in E, the infimum is reached at infinity.For each h, let (θ h,m ) ∈ (U h ∩ Θ ∞ ) N be a sequence such that: By taking for each h a value of θ h,m h −1 -close to the function's limit, we obtain a sequence Hence this contradicts Lemma C.7.Therefore, P (dA)-almost surely, f h (A) → −∞.We can again apply Lemma C.6 and use the monotone convergence theorem, which grants Therefore, whether θ ∞ is in Θ ∞ or not, there exists an open neighborhood U of θ ∞ such that Part B. Define K ε as: By definition of the Alexandrov compactification and by the continuity of δ, K ε is a compact set, hence we can find a finite open cover (U h≤H ) of it, where each U h satisfies equation (B.1).Let N ∈ N.For all θ ∈ K ε : Since the observations A i are independent (H2), by the law of large numbers and by the definition of U h : lim where E X denotes the expectation taken with respect to X only.
Proof.From the definition of our model, Furthermore: So that, using 1 µ .We get: Thus we obtain the result: Lemma C.2 (Bound on the log-likelihood).For all matrix A and parameters θ, . This bound yields in the expression of Lemma C.1: Furthermore, from the definition of σ p we have σ p ≤ σ ε , which gives the desired bound: The likelihood p(A | θ) extends continuously when s i = +∞ for a subset I of r indices or when σ λ = 0.In other words, θ → p(A | θ) is continuous over Θ ∞ .With the following notations J is the complementary of I in {1, . . ., p}, X I is the n × r matrix (x i1 , . . ., x ir ), M ⊥ I denotes an n × (n − r) matrix such that M I M ⊥ I = 0 and M ⊥ I ∈ V n,n−r .q vMF (X, F ) is the von Mises-Fisher density with parameter F and variable X, F = M Diag(s) is the parameterization of F described in Section 2, the extension reads: If all latent variables are constant, this yields the Gaussian likelihood A ∼ N (µ • M, σ 2 ε I n×n ).
Proof.For notational convenience, we suppose that I is composed of the first r indices of {1, . . ., p}.Let {X I = M I } ⊂ V np be the set of values of X such that X and M match on the columns of I.The continuity at infinity comes from the expression: The conditional expectation, computed below, is continuous (as the parameters in it remain finite).Furthermore, in distribution, X I → M I and λ → µ, as s I → +∞ and σ λ → 0. Therefore in the limit the expression reduces to the conditional expectation taken at the limiting final values: where the measure for X here corresponds to the Hausdorff measure over {X I = M I }.Furthermore, we have ) is an isometry with respect to the Haar measures (which is equal to the Hausdorff measure for Stiefel manifolds, as noted in [29]).We thus get: and similarly Lemma C.4 (Better bound on the likelihood).For θ ∈ Θ and A ∈ R n×n such that µ > max(2 A F , 2σ λ p/2 − 1), we have the bound Proof.Using Proposition 2.2, which in particular grants that A * X ≤ A F , we have And since µ > 2 A F , we have: Furthermore, Since by definition λ ∼ N (0, σ 2 λ I p ), 1 σ 2 λ λ − µ 2 follows a chi-squared distribution with degree p.Its CDF is given by: x/2 t p/2−1 e −t dt.Therefore we have: Furthermore the function t → t p/2−1 e −t is decreasing for t > p/2 − 1 and µ 2 > 4σ 2 λ (p/2 − 1), so that: This finally yields the claimed result: Lemma C.5 ([11], Lem. 1).Let p < q be two integers.Then, for any differentiable map f : R p → R q and any compact subset K of R p , there exists a constant λ depending only on p and q such that Lemma C.6.Assume hypotheses H1, H3.We have Proof.For an observation A and all θ ∈ Θ, we have: Where R np denotes the set of n × n matrices with rank less than p.This inequality remains true for θ ∈ Θ ∞ , as both sides extends continuously to Θ ∞ .Hence for all θ ∈ Θ ∞ : This quantity is maximized for σ Which gives, taking the positive part, up to a finite additive constant α: We now want to apply Lemma C.5 to integrate over A. To that end, we need to parameterize R np with a map from a lower dimensional space.The naive mapping R p × R n×p → R np mapping (λ, X) to λ • X does not work directly, as it is not "coercive", in the sense that (λ, X) can go to infinity with λ • X possibly staying bounded.This problem is overcome by restricting the X domain of the map (λ, X) → λ • X to a set of points close to V np .
Let f : R n×p × R p → R n×n , defined by f (U, v) = v • U = U Diag(v)U .Then R np = f (V np × R p ).We have furthermore Df U,v (H, w) = U Diag(v)H + HDiag(v)U + U Diag(w)U , so Hence the operator norm of the differential (for the matrix operator norm) satisfies Df u,v 2 ≤ C np (U, v)  (with C np a generic product of norm equivalence constants, whose definition may implicitly vary depending on the equation).
Let β ∈]0, 1].Since V np is a compact subset of R n×p , There exists X 1 , . . ., X H ∈ V np such that the union of Frobenius balls ∪ H h=1 B F (X h , β) covers V np .In particular, we have P (A) .
Since P has an exponentially decaying tail beyond some compact set (Hypothesis H3), this sum converges to a finite value.Since the sequence log + d(A, f (D T )) −1  T ∈N is non-negative non-decreasing with limit We use the gradient of p(X | θ) to identify the concentration parameters (s i ).Since the function is defined over V np , we only have access to the projection of its gradient onto the tangent spaces.If we denote by G(X) the Euclidean gradient, the projected manifold gradient writes: G V (X) = G(X) − XG(X) X [21].In the case of the function p(X | θ), the manifold gradient thus is: G V (X) = p(X | θ)(F − XF X).As a consequence, the function h(X) = F − XF X is known over ∆ 0 .Coherently, we have h(M ) = M Diag(s) − M Diag(s)M M = 0. We will use the first-order variations of h(X) around M allow to retrieve s.
These variations are retrieved by using the Cayley transform on tangent vectors at M (any other smooth retraction map could be used here).As reminded in Appendix A, such tangent vectors H ∈ T M V np write as H = M A + M ⊥ B, with A = −A.Denoting K = HM − M H , the Cayley transform at M is defined by: In particular, as in Lemma C.8, it satisfies C M (εH) = M + εH + O(ε 2 ).This gives: Taking B = 0 and normalizing by ε, we obtain the value of M [Diag(s)A + ADiag(s)] for every p × p skewsymmetric matrix A, which gives Diag(s)A + ADiag(s) when multiplying by M .For every i, j, taking for A the matrix with A ij = −A ji = 1 and zeros everywhere else gives the value of s i + s j .This gives an over-determined system of equations which allows identifying the s i 's.
Lemma C.10.If the empirical data distribution is given by P (A) = p(A | θ 0 ), then condition (5.1) for the asymptotic normality theorem of [54] is satisfied by the restricted model on a neighborhood of θ 0 .
Proof.We are looking for a function L : R n×n → R + with E[L(A) 2 ] < +∞ and such that, for θ 1 and θ 2 sufficiently close to θ 0 , Transposed to the restricted model, Lemma C.1 gives the marginalized expression: For two parameters θ 1 and θ 2 , all terms apart from the integral over ∆ 0 are bounded by (C + A and given the other assumptions on the prior distribution, we have log p(θ | A) → −∞ as any of the model variables reaches an open boundary of its domain.Furthermore, the function log p(θ | A) is smooth: the integral representation given by Lemma C.1 writes as A i | θ) < E * .Part C. For each θ * ∈ Θ ∞ * , the law of large numbers gives lim N →+∞ 1 N N i=1 log p(A i | θ * ) = E * .Let θ k be a sequence of parameters with finite values such that θ k → θ * .Then we have, for all k: Lemma C.1.The model likelihood rewrites as