SQUARED QUADRATIC WASSERSTEIN DISTANCE: OPTIMAL COUPLINGS AND LIONS DIFFERENTIABILITY

. In this paper, we remark that any optimal coupling for the quadratic Wasserstein distance W 22 ( µ, ν ) between two probability measures µ and ν with ﬁnite second order moments on R d is the composition of a martingale coupling with an optimal transport map T . We check the existence of an optimal coupling in which this map gives the unique optimal coupling between µ and T # µ . Next, we give a direct proof that σ (cid:55)→ W 22 ( σ, ν ) is diﬀerentiable at µ in the Lions (Cours au Coll`ege de France. 2008) sense iﬀ there is a unique optimal coupling between µ and ν and this coupling is given by a map. It was known combining results by Ambrosio, Gigli and Savar´e (Lectures in Mathematics ETH Z¨urich. Birkh¨auser Verlag, Basel, 2005) and Ambrosio and Gangbo (Comm. Pure Appl. Math., 61:18– 53, 2008) that, under the latter condition, geometric diﬀerentiability holds. Moreover, the two notions of diﬀerentiability are equivalent according to the recent paper of Gangbo and Tudorascu (J. Math. Pures Appl. 125:119–174, 2019). Besides, we give a self-contained probabilistic proof that mere Fr´echet diﬀerentiability of a law invariant function F on L 2 (Ω , P ; R d ) is enough for the Fr´echet diﬀerential at X to be a measurable function of X .


Introduction
In this paper, we are interested in the structure of optimal couplings for the squared quadratic Wasserstein distance W 2 2 (µ, ν) between µ and ν in the set P 2 (R d ) of probability measures with finite second order moments on R d , and in the differentiability of W 2 2 (µ, ν) with respect to µ. By definition, W 2 2 (µ, ν) = inf π∈Π(µ,ν) |y − x| 2 π(dx, dy) where Π(µ, ν) denotes the set of coupling measures on R d × R d with first and second marginals respectively equal to µ and ν and |.| denotes the Euclidean norm on R d . There always exists an optimal coupling and we denote by Π opt (µ, ν) the set of optimal couplings. According to [11], there exists only one W 2optimal coupling π between µ and each ν ∈ P 2 (R d ) and this coupling is given by a map T (i.e. π = (I d , T )#µ where I d denotes the identity function on R d ) iff µ gives 0 mass to the c − c hypersurfaces of dimension d − 1. Even when µ does not satisfy this condition which is implied by absolute continuity with respect to the Lebesgue measure, according to Proposition 5.13 [8], if ϕ : R d → R is a C 2 strictly convex function such that R d |∇ϕ(x)| 2 µ(dx) < ∞, then there is a unique W 2 -optimal coupling between µ and ν = ∇ϕ#µ and this coupling is given by the map ∇ϕ. But there also exist measures ν ∈ P 2 (R d ) such that either the unique optimal coupling (uniqueness holds in dimension d = 1 for instance) is not given by a map or there exist distinct optimal couplings. In the latter case, any strictly convex combination of these couplings is an optimal coupling which is not given by a map.
In Section 2, we study optimal couplings π which are not given by a map. By disintegration, π(dx, dy) = µ(dx)k(x, dy) for some Markov kernel k on R d (which is µ(dx) a.e. unique). Setting T (x) = R d yk(x, dy) and using the bias-variance decomposition under the kernel k, we obtain that π is the composition of a martingale coupling between T #µ and ν with the map T which gives a W 2 -optimal coupling between µ and T #µ. Note that couplings of this form have recently been studied by Gozlan and Juillet [12] when considering the barycentric optimal cost problem. For φ : R d → R a strictly convex function such that R d φ(y)ν(dy) < ∞, by minimizing R d φ(T (x))µ(dx) over the W 2 -optimal couplings between µ and ν, we obtain optimal couplings such that the associated map T φ gives the only optimal coupling between µ and T φ #µ. There is a unique such coupling when φ(x) = |x| 2 .
In Section 3, we are interested in the differentiability of W 2 2 (µ, ν) in the Lions sense with respect to µ. Gangbo and Tudorascu have recently proved in Corollary 3.22 [10] that the Lions differentiability [15] of a function f : P 2 (R d ) → R is equivalent to the geometric differentiability and that the Fréchet derivative of the lift at X ∼ µ is then given by ∇ µ f (X) where ∇ µ f ∈ L 2 (R d , µ; R d ) is the geometric (or Wasserstein) gradient of f at µ. While the lifted space that they consider is the ball centered at the origin of unit volume in R d endowed with the Lebesgue measure, the result can be transferred to any atomless lifted space by considering an almost isomorphism between those spaces 1 . Theorem 10.2.6 [4] states that σ → W 2 2 (σ, ν) is subdifferentiable in the geometric sense at µ when Π opt (µ, ν) = {(I d , T )#µ} for some measurable transport map T : On the other hand, Proposition 4.3 [3] states that σ → W 2 2 (σ, ν) is always superdifferentiable in the geometric sense at µ with x → 2 x − R d yk(x, dy) belonging to the superdifferential for each Markov kernel k on R d such that µ(dx)k(x, dy) ∈ Π opt (µ, ν). Since geometric differentiability amounts to simultaneous geometric sub-and superdifferentiability, as soon as Π opt (µ, ν) = {(I d , T )#µ}, then σ → W 2 2 (σ, ν) is differentiable in the geometric sense at µ. On the other hand, geometric differentiability implies that the geometric sub-and superdifferential considered as subsets of L 2 (R d , µ; R d ) coincide and contain one element only (see for instance [8], Prop. 5.63). The fact that the quotient of {x → R d yk(x, dy) : µ(dx)k(x, dy) ∈ Π opt (µ, ν)} for the µ(dx) a.e. equality is a singleton is therefore necessary for the geometric differentiability of σ → W 2 2 (σ, ν) to hold at µ. We prove that σ → W 2 2 (σ, ν) is differentiable at µ in the Lions sense iff Π opt (µ, ν) = {(I d , T )#µ}. We give a direct probabilistic proof of the sufficient condition which also follows from the just mentionned results. To prove the necessary condition, we use that the Fréchet differentiability at X ∼ µ of the lift on an atomless probability space is enough for the Fréchet derivative at X to be a.s. equal to a measurable function of X, a consequence of [10] that we show again using simple probabilistic arguments. Let us emphasize that the quotient of {x → R d yk(x, dy) : µ(dx)k(x, dy) ∈ Π opt (µ, ν)} for the µ(dx) a.e. equality may be a singleton while Π opt (µ, ν) is not equal to {(I d , T )#µ} for some measurable map T :

Structure of quadratic Wasserstein optimal couplings
In this section, we are interested in characterizing the set |y − x| 2 π(dx, dy)}. 1 We thank one of the referees for pointing out this argument to us.
of optimal couplings between two probability measures µ, ν ∈ P 2 (R d ) for the quadratic cost. This set is not empty : see e.g. [4], page 133. The refined version of the Brenier theorem in [11] ensures that Π opt (µ, ν) contains a single element (I d , T )#µ which is given by a measurable transport map T : R d → R d for each ν ∈ P 2 (R d ) iff µ does not give mass to the c − c hypersurfaces parametrized by an index i ∈ {0, . . . , d − 1} and two convex functions f and g from R d−1 to R: The next lemma deals with the case where Π opt (µ, ν) = {(I d , T )#µ} for some measurable transport map.
Lemma 2.1. Let µ, ν ∈ P 2 (R d ). One of the two conditions holds: Moreover, if any coupling in Π opt (µ, ν) is given by a map i.e. writes (I d , T )#µ for some measurable function The second statement easily follows.
Remarking that if ν is the Dirac mass at x ∈ R d and ν ε the uniform distribution on the ball centered at x with radius ε, then W 2 (ν, ν ε ) ≤ ε, we deduce from the next proposition that for any µ, ν ∈ P 2 (R d ), we can always find µ ε , ν ε ∈ P 2 (R d ) such that W 2 (µ, µ ε ) ≤ ε, W 2 (ν, ν ε ) ≤ ε and ∃µ ε (dx)k ε (x, dy) ∈ Π opt (µ ε , ν ε ) such that is not a Dirac mass. Then for all µ ∈ P 2 (R d ), there exists a sequence (µ n ) n of elements of P 2 (R d ) such that lim n→∞ W 2 (µ n , µ) = 0 and for each n, there does not exist T n : Proof. Let (X i ) i≥1 be an i.i.d. sequence of random variables with law µ, and (Y i ) i≥1 an independent i.i.d. sequence of uniform random variables on the unit ball {x ∈ R d , |x| ≤ 1}. We setμ n = 1 n n i=1 δ Xi the empirical measure and µ n = 1 n n i=1 δ Xi+Yi/n . By construction, we have W 2 2 (µ n ,μ n ) ≤ 1 n n i=1 |Y i /n| 2 ≤ 1/n 2 and P(∃i = j, X i + Y i /n = X j + Y j /n) = 0, which means that a.s. for each n ∈ N * , µ n weights a.s. exactly n points. The law of large numbers gives the almost sure weak convergence ofμ n towards µ and the almost sure convergence of [4] ensures that W 2 (μ n , µ) → n→+∞ 0 almost surely. By the triangle inequality, we get W 2 (µ n , µ) → n→+∞ 0 almost surely. Now, we consider (p n ) n≥1 the increasing sequence of prime numbers. Suppose that ∃n 0 ∈ N * , such that T #µ pn 0 = ν. Then, ν weights at most p n0 points and the masses are equal to k/p n0 with 1 ≤ k ≤ p n0 − 1 since ν is not a Dirac mass. Then, if we had T #µ pn = ν for some n > n 0 , we would have k/p n0 = k /p n with 1 ≤ k ≤ p n − 1. This would imply that p n0 divides kp n and thus k, which is impossible since 1 ≤ k ≤ p n0 − 1. Thus, there is at most one n 0 ∈ N * such that there is a transport map T n0 satisfying T n0 #µ pn 0 = ν.
Proof. Let X ∼ µ and U be an independent random variable uniform on [0, 1]. The random variable Remark 2.5. Lemma 2.3 still holds true for µ, ν probability measures on R with finite moments of order ρ ≥ 1, and a transport cost c(x, y) = h(|y − x|), with h : R + → R strictly convex such that ∃C < ∞, ∀x ∈ R, h(|x|) ≤ C(1 + |x| ρ ). The same proof applies since, by Theorem 2.9 in [16], the only optimal coupling for such a cost is the image of the Lebesgue measure on [0, 1] by (F −1 µ , F −1 ν ).
The next proposition, which is one of the main results of this section, shows that any W 2 -optimal coupling can be written as the composition of a transport map and a martingale kernel i.e. a Markov kernel k such that for all Let us now give the definition of the convex order on probability measures before recalling its link with the existence of martingale couplings. Definition 2.6. Let η, ν be two probability measures on R d . We say that η is smaller than ν in the convex order and write η ≤ cx ν if for each convex function φ : R d → R such that the integrals make sense, Notice that since a convex function φ on R d is bounded from below by an affine function, for a probability measure η on R d with finite first order moment (and in particular for η ∈ P 2 (R d )), R d φ(x)η(dx) always makes sense possibly equal to +∞.
The first part of this proposition is also a consequence of Theorem 12.4.4 in [4]: the barycentric projection of µ(x)k(x, dy) is precisely (I d , T )#µ. Here, we present this result with a probabilistic fashion. For µ(dx)k(x, dy) as in the first statement and (X, s. and this optimal coupling is the composition of the martingale coupling given by the law of (T (X), Y ) and the transport map T . Notice that since it relies on the bias-variance decomposition, this structure of optimal couplings does not seem to generalize to other Wasserstein distances Nevertheless, Gozlan and Juillet [12] have recently obtained optimal couplings that are the composition of a martingale coupling and a deterministic transport map by considering the barycentric optimal cost problem, which consists in minimizing for a given cost function θ : Proof. Let us first prove the second statement. Let η ≤ cx ν, q be a Markov kernel such that µ(dx)q(x, dz) ∈ Π opt (µ, η) and m be any martingale kernel such that ηm = ν. Then µ(dx)qm(x, dy) is a coupling between µ and ν such that where we used the variance-bias decomposition under the martingale kernel m for the third equality. Hence, if (2.1) holds, then µ(dx)qm(x, dy) ∈ Π opt (µ, ν).
Jensen's inequality immediately gives η ≤ cx ν and thus η ∈ P 2 (R d ). We have where we used the variance-bias decomposition with respect to k(x, .) for the second equality. With (2.2), we deduce that and T is a W 2 -optimal transport map between µ and η.
Proof. By the second assertion in Lemma 2.8, the characterization of I ν µ easily follows from the one ofĨ ν µ , which, with the definition ofĨ ν µ , the first statement in Proposition 2.7 and the uniqueness of the optimal coupling in dimension d = 1, also implies that Π opt (µ, T #µ) = {(I 1 , T )#µ}. Let U, U be two independent uniform random variables on [0, 1]. We define and have by construction According to Theorem 2.9 [16], the law of (F −1 ) and by (2.6), Hence the single element ofĨ ν µ is the law T #µ of T (F −1 µ (V )). Since T is nondecreasing, s.. Hence the law of (F −1 T #µ (V ), F −1 ν (V )), which is the single element of Π opt (T #µ, ν), is a martingale coupling. Since all the martingale couplings share the quadratic cost R y 2 ν(dy) − R (T (x)) 2 µ(dx), each martingale coupling belongs to Π opt (T #µ, ν) and is therefore equal to the previous one.
According to the next theorem, we can find elements η inĨ ν µ such that Π opt (µ, η) = {(I d , T )#µ} for some measurable transport map T by minimizing over I ν µ the integral of a strictly convex function.
This theorem permits to select extreme elements of I ν µ and provides the following characterization of the existence of a minimal element for the convex order in this set. Corollary 2.12. For µ, ν ∈ P 2 (R d ), there exists η 0 ∈ P 2 (R d ) such that I ν µ = {η 0 ≤ cx η ≤ cx ν} if and only if η φ : φ : R d → R d strictly convex and such that R d φ(y)ν(dy) < ∞ = {η} and then η 0 = η. Let us show the corollary before proving the theorem.
Proof of Corollary 2.12. The necessary condition is obvious. Let us show that it is sufficient. It is enough to check that for any φ : [1], Lem. A.1). For such a function φ and for ε > 0, φ ε (x) := φ(x) + ε|x| 2 is strictly convex and, since η φε = η, we have We conclude by letting ε → 0 using the dominated convergence theorem.
To prove Theorem 2.11, we will need the following Lemma Lemma 2.13. Let ν be a probability measure on R d such that R d |y|ν(dy) < ∞ and φ : R d → R a convex function such that R d φ(y)ν(dy) < ∞. Then the family of probability measures {φ#η : η ≤ cx ν} is uniformly integrable.
Proof of Lemma 2.13. Let us first suppose that φ is nonnegative. Let M ∈ (0, +∞), η ≤ cx ν and m be a martingale kernel such that x∈R d η(dx)m(x, dy) = ν(dy). Using Jensen's inequality for the first inequality and the Markov inequality combined with η ≤ cx ν for the third one, we obtain that In particular, the family {|x|#η : η ≤ cx ν} is uniformly integrable. When the sign of φ is not constant, we obtain a nonnegative convex functionφ such that R dφ (y)ν(dy) < ∞ by addition to φ of a suitable affine function ψ. The conclusion follows from the uniform integrability of both the families {ψ#η : η ≤ cx ν} and {φ#η : η ≤ cx ν}.

Differentiability of the squared quadratic Wasserstein distance
We now present the notion of differentiability introduced by Lions [15]. Let f : P 2 (R d ) → R. We consider an atomless probability space (Ω, A, P) and denote by L 2 (Ω, P; R d ) the set of R d -valued square integrable random variables on this space. The lift of the function f on L 2 (Ω, P; R d ) is the function F : L 2 (Ω, P; R d ) → R such that where L(X) ∈ P 2 (R d ) is the probability distribution of X. The atomless property is equivalent to the existence of a random variable U : Ω → R uniformly distributed on [0, 1] (see e.g. [9], Prop. A.27). By the fundamental Theorem of simulation (see e.g. Bouleau and Lépingle [6], Thm. A.3.1, p. 38), it ensures the existence on (Ω, A, P) of a random variable distributed according to each probability measure on each Polish space, and in particular of X : Ω → R d distributed according to µ, for each µ ∈ P 2 (R d ).
such that X ∼ µ and F is Fréchet differentiable at X.
Let f : P 2 (R d ) → R and F (X) = f (L(X)) for X ∈ L 2 (Ω, P; R d ). The Fréchet differentiability of F at X amounts to the existence of a bounded linear operator D F X : L 2 (Ω, P; By the Riesz representation theorem, there is a unique DF (X) ∈ L 2 (Ω, P; R d ) such that ∀Y ∈ L 2 (Ω, P; R d ), D F X (Y ) = E[DF (X).Y ], and we will call later on DF (X) the Fréchet derivative of F at X. From Theorem 6.2 in [7], if f is L-differentiable at µ ∈ P 2 (R d ), then F is Fréchet differentiable at X for all X ∈ L 2 (Ω, P; R d ) such that µ = L(X). Besides, the law of (X, DF (X)) does not depend on X by Proposition 5.24 in [8]. According to Theorem 6.5 [7], under the additional continuous differentiability assumption, the Fréchet derivative DF (X) is equal to g(X) for some measurable function g. According to Corollary 3.22 [10], the continuous differentiability assumption is not needed. We will provide a simple and direct proof of this result, see Lemma 3.4.
We now state the main result of this section that characterizes the differentiability of the square quadratic Wasserstein distance. To do so, we first exhibit the lift of the Wasserstein distance. Let µ, ν ∈ P 2 (R d ). From the atomless property, there exist random variables X ∼ µ and Y ∼ ν on (Ω, A, P). The dual formulation (see for instance [18], Thm. 5.10) T )#µ} and then the Fréchet derivative of the function Z → W 2 2 (Z, Y ) at X ∼ µ is given by 2(X − T (X)). Remark 3.3.
We are going to give a probabilistic proof of Theorem 3.2 by working with the L-differentiability. The two following lemmas are needed: the first one is used to get the necessary condition while the second is used for the sufficient condition. Their proofs are postponed after the proof of the theorem. Lemma 3.4. Let F : L 2 (Ω, P; R d ) → R be law invariant. If F is Fréchet differentiable at X ∼ µ, then its Fréchet derivative is equal to g(X) for some measurable function g ∈ L 2 (R d , µ; R d ) and it is differentiable with Fréchet derivative g(X) at eachX ∼ µ in L 2 (Ω, P; R d ).
Let us note that this result is also a consequence of the work by Gangbo and Tudorascu [10]. Here, we provide an alternative simple probabilistic proof of this fact. Wu and Zhang ( [19], Prop. 1) already gave a different probabilistic proof when X is discrete. Lemma 3.5. Let µ, ν ∈ P 2 (R d ) be such that there exists T : R d → R d measurable such that Π opt (µ, ν) = {(I d , T )#µ}. Let also (µ n ) n be a sequence of elements of P 2 (R d ) converging weakly to µ and such lim n→∞ W 2 (µ n , ν) = W 2 (µ, ν). If (on a single probability space), X ∼ µ and for n ∈ N, (X n , Y n ) is such that X n ∼ µ n , Y n ∼ ν, W 2 2 (µ n , ν) = E |X n − Y n | 2 and X n Pr −→ X as n → ∞, then Remark 3.6. The fact that lim n→∞ E |X n − X| 2 = 0 implies that lim n→∞ W 2 (µ n , µ) = 0. On the other hand, denoting by µ n the distribution of X + ξ n where ξ n = ξ n , we have where we used (3.2) for the second equality and the definition of ξ for the third. If σ → W 2 2 (σ, ν) were L-differentiable at µ, then (3.2) combined with Lemma 3.4 would imply that, as n → ∞, n . Now, we assume that Π opt (µ, ν) = {(I d , T )#µ} for some measurable transport map T : R d → R d . Let, on the lifted probability space, X ∼ µ, Y ∼ ν and (ξ n ) n be a sequence of square integrable R d -valued random vectors such that ξ n 2 := E 1/2 |ξ n | 2 tends to 0 as n → ∞. We denote by µ n the law of X + ξ n . Let Y n ∼ ν such that W 2 2 (µ n , ν) = E[|X + ξ n − Y n | 2 ] be defined on a possible enlargement of the lifted probability space. We have On the other hand, With Cauchy-Schwarz inequality, we deduce that |W 2 2 (µ n , ν) − W 2 2 (µ, ν) − 2E[(X − T (X)).ξ n ]| ≤ ξ n 2 ( ξ n 2 + Y n − T (X) 2 ) .