On strict sub-Gaussianity, optimal proxy variance and symmetry for bounded random variables

We investigate the sub-Gaussian property for almost surely bounded random variables. If sub-Gaussianity per se is de facto ensured by the bounded support of said random variables, then exciting research avenues remain open. Among these questions is how to characterize the optimal sub-Gaussian proxy variance? Another question is how to characterize strict sub-Gaussianity, defined by a proxy variance equal to the (standard) variance? We address the questions in proposing conditions based on the study of functions variations. A particular focus is given to the relationship between strict sub-Gaussianity and symmetry of the distribution. In particular, we demonstrate that symmetry is neither sufficient nor necessary for strict sub-Gaussianity. In contrast, simple necessary conditions on the one hand, and simple sufficient conditions on the other hand, for strict sub-Gaussianity are provided. These results are illustrated via various applications to a number of bounded random variables, including Bernoulli, beta, binomial, uniform, Kumaraswamy, and triangular distributions.


Introduction
Sub-Gaussian distributions are probability distributions that have tail probabilities that are upper bounded by Gaussian tails. More specifically, a random variable X with finite mean µ = E[X] is sub-Gaussian if there exists σ 2 > 0 such that: The constant σ 2 is called a proxy variance and X is termed σ 2 -sub-Gaussian. For a sub-Gaussian random variable X, the smallest proxy variance is called the optimal proxy variance and is denoted σ 2 opt (X), or simply σ 2 opt . The variance always provides a lower bound on the optimal proxy variance: V[X] ≤ σ 2 opt (X). When σ 2 opt (X) = V[X], X is said to be strictly sub-Gaussian.
This paper focuses on the study of almost surely bounded random variables, where Bernoulli, beta, binomial, Kumaraswamy (Jones, 2009) or triangular (Kotz and Van Dorp, 2004) distributions are taken as standard and common examples. If sub-Gaussianity per se is de facto ensured because the support of said random variables is bounded, then exciting research avenues remain open in the area. Among these questions are (a) how to obtain the optimal sub-Gaussian proxy variance, and (b) how to characterize strict sub-Gaussianity?
Regarding question (a), we propose general conditions characterizing the optimal sub-Gaussian proxy variance, thus generalizing previous work (Marchal and Arbel, 2017) that was tailored to the beta and Dirichlet distributions. Several techniques based on studying variations of functions are proposed. In illustrating our results with the Bernoulli distribution, we prove as a by-product of Proposition 4.1 the uniqueness of a global maximum of a function that was observed by Berend and Kontorovich (2013) "as an intriguing open problem".
As for question (b), it turns out that the symmetry of the distribution plays a crucial role. By symmetry, we mean symmetry with respect to the mean µ = E [X]. That is, we say that X is symmetrically distributed if X and 2µ − X have the same distribution. Thus, if X has a density, this means that the density is symmetric with respect to µ. A simple, and remarkable, equivalence holds for most of the standard bounded random variables.
Proposition 1.1. Let X be a Bernoulli, beta, binomial, Kumaraswamy or triangular random variable. Then, X is symmetric ⇐⇒ X is strictly sub-Gaussian.
The result is known for the beta distribution (Marchal and Arbel, 2017). In this article, we provide proofs for the Bernoulli, binomial, Kumaraswamy and triangular distributions.
From Proposition 1.1, it may be tempting to conjecture that the equivalence holds true for any random variable having a bounded support. However, we establish that this is not the case. This was actually one of the starting points for the present work. More precisely, we shall provide a proof of the following result.
Proposition 1.2. Symmetry of X is neither (i) a sufficient condition, nor (ii) a necessary condition, for the strict sub-Gaussian property.
The proof of this result is presented in Section 3.2, where we demonstrate that (i) there exists simple symmetric mixtures of distributions (e.g., a two-components mixture of beta distribution and a three-components mixture distribution of Dirac masses) which are not strictly sub-Gaussian, and that (ii) there exists an asymmetric three-components mixture of Dirac masses which is strictly sub-Gaussian.
Before delving into detailing the strict sub-Gaussianity property in Section 3, we first investigate some conditions that characterize the optimal proxy variance σ 2 opt , in Section 2. The results of Sections 2 and 3 are then illustrated on a number of standard random variables on bounded supports, in Section 4. Technical results are presented in Appendix A.
2 Characterizations of the optimal proxy variance σ 2 opt Let X be an almost surely bounded random variable with mean µ = E[X]. Then, X is sub-Gaussian and satisfies Definition 1 for some σ 2 > 0.
An equivalent definition is that where the function K, defined on R by: , corresponds to the cumulants generating function of X − µ. Thus the optimal proxy variance σ 2 opt can be defined as the supremum If X is almost surely bounded, then this supremum is attained, see Lemma A.1 for details. Note that the function h, defined on R by is continuous at λ = 0, since a standard series expansion demonstrates that: Moreover, h may never vanish. In fact, since the logarithm function is strictly concave, Jensen's inequality implies that for any λ ∈ R, Equation (4) also explains directly why σ 2 opt ≥ V[X], since the variance is the value of the right-hand side (r.h.s.) function at λ = 0 and thus the maximum is always greater or equal to it. We therefore have the following result.
Proposition 2.1 (Characterization of σ 2 opt by h). The optimal proxy variance is given by: We may now present a necessary (but not always sufficient) system of equations for σ 2 opt . Indeed, since the maximum is achieved at a finite point, then this point must necessarily be a zero of the derivative of h, if h is differentiable (we will denote by D k the space of functions that are k times differentiable on R and by C k the space of functions that are k times differentiable on R and for which the k th derivative is continuous on R).
Thus, we obtain the following corollary.
Corollary 2.2 (Necessary condition for σ 2 opt , with respect to h). Let σ 2 opt be the optimal proxy variance, and assume that h and K are D 1 . Then there exists a finite λ 0 , such that which is equivalent to using only the centered cumulants generating function K.
In practice, the previous set of equations has to used with caution, since there may be more than one solution to the second equation involving the derivative of h (or that of K), and a global maximizer is required to be picked among the stationary points, instead of a minimizer or a local maximizer. On a case-by-case basis, the following approach based on ordinary differential equations (ODEs), satisfied by h, can be used to demonstrate that it has a unique global maximum.
Proposition 2.3. If the function h is C 2 , then it is the unique solution of the ordinary differential equations: or Proof. The result is directly obtained by differentiating h and via standard analysis theorems.
Remark. For cases such as the Bernoulli and uniform distributions, we may prove that the r.h.s. of (10) is strictly negative on R * := R \ {0}. This implies that if λ 0 is extremal (i.e., h (λ 0 ) = 0), then it satisfies h (λ 0 ) < 0 so that it is a local maximum. This implies that h has no local minimum and thus may only have one critical point which is necessarily the unique global maximum.
We conclude this section with another possible methodology for deriving a necessary and sufficient condition for σ 2 opt . To this end, the problem needs to be addressed from a different point of view, by studying the difference of the terms of Definition 1: Proposition 2.4 (Characterization of σ 2 opt , with respect to ∆). If ∆ is C 1 , then the optimal proxy variance is characterized by: λ → ∆(σ 2 opt , λ) ≥ 0 and ∃ λ 0 ∈ R, such that ∆(σ 2 opt , λ 0 ) = 0 and ∂ λ ∆(σ 2 opt , λ 0 ) = 0. (12) Proof. See Section A.2, in Appendix.
This proof technique was used by Marchal and Arbel (2017) for obtaining the optimal proxy variance of the beta and Dirichlet distributions. However we find more convenient to use the conditions stated in Proposition 2.3 using the function h to address the issues presented in this article, except for the triangular distribution in Section 4.2 where this method is employed for a numerical evaluation of σ 2 opt . Remark. In general, we would like to remove the condition: λ → ∆(σ 2 , λ) ≥ 0 on the r.h.s. of Proposition 2.4, in order to have a simpler (and local) characterization of the optimal proxy variance, as a solution of (12). However, this is not possible, since we may not exclude that there exists a value σ 2 < σ 2 opt for which ∆(σ 2 , λ) presents a double zero λ 0 where locally it remains non-negative but at the same time a whole interval far from λ 0 where it would be strictly negative.

Conditions based on the cumulants
Strict sub-Gaussianity is fulfilled when the optimal proxy variance equals the variance. In view of Equation (4), Proposition 2.1 can be rewritten as the following corollary in order to characterize the strict sub-Gaussianity property.  (3), is attained in zero (and is automatically equal to V[X]). That is: max This characterization provides necessary conditions, based on cumulants, that are required for strict sub-Gaussianity to hold.
Proposition 3.2 (Necessary conditions based on cumulants). If X is strictly sub-Gaussian, then the 3 rd and 4 th cumulants of X must satisfy Proof. By definition of the cumulant generating function K(λ) of X − µ, where κ i are the cumulants of X − µ. Since κ 1 = µ − µ = 0 and κ 2 = V[X], and using values for the third and fourth cumulants given in (14) and (15), we may write (locally around λ → 0): Therefore if E[(X − µ) 3 ] = 0, the maximum of h(λ) cannot be h(0) and thus strict sub-Gaussianity cannot be achieved. We conclude the proof by noting that if E[(X − µ) 3 ] = 0, we have the fact that λ = 0 can be a local maximum, Condition (14) requires that the third centered moment is zero and Condition (15) imposes a relation between the second and fourth centered moments. Note that the latter condition can be compactly formulated via an alternative condition on the kurtosis of X: More specifically, sub-Gaussianity requires that the random variable has kurtosis less than or equal to three, which is the kurtosis of a standard Gaussian random variable. Such distributions are referred to as platycurtic. The fourth cumulant defined in (15) is also termed excess kurtosis. Thus, strict sub-Gaussianity requires negative excess kurtosis.
When the above necessary conditions (14) and (15) hold, we are not able to obtain simple additional necessary conditions on the next cumulants. In particular, note that strict sub-Gaussianity does not imply symmetry (i.e., E[(X − E[X]) 2j+1 ] = 0, for any j ≥ 0), as will be discussed in the next section.
In contrast, more can be said when the distribution is symmetric. In fact, in the symmetric case, the moments of odd order are zero, and a simple sufficient condition can be readily obtained by comparing the Taylor expansions at λ = 0 of both terms of inequality (1), as stated in the following proposition.
Proposition 3.3 (Sufficient condition based on moments). If X is symmetric with respect to its mean µ = E[X], then a sufficient condition for X to be strictly sub-Gaussian can be stated in terms of all its even moments. That is, for X to be strictly sub-Gaussian, it is sufficient that holds.
Proof. The proof is based on series expansions at λ = 0 of both terms of inequality (1), when the proxy variance σ 2 is set to the variance V[X]. Namely: when compared term-by-term, leads to inequality (1), under assumption (18). Note that inequality (18) needs be checked only for j ≥ 2, as it trivially holds for j = 0, 1.
This technique was used by Marchal and Arbel (2017) (Section 2.2) for showing that a (symmetric) Beta(α, α) random variable is strictly sub-Gaussian. We also use it to address the cases of Bernoulli and binomial, and triangular distributions in Section 4.

Link with symmetry
The relationship between strict sub-Gaussianity and symmetry was discussed in the Introduction. Here, we provide a proof of Proposition 1.2, while the proof of Proposition 1.1 is deferred to Section 4.

Symmetry is neither a sufficient condition. . .
Simple symmetric distributions which break the necessary condition of negative excess kurtosis can easily be constructed by hand. One such construction is by means of mixture of Dirac masses. First, consider the discrete random variable which is a three-component mixture of Dirac masses at locations −1, 0 and 1, with η ∈ [0, 1]. It is symmetric, by construction, and its excess kurtosis equals which is strictly positive for all values η ∈ 0, 1 3 , hence X is not strictly sub-Gaussian for these values by virtue of Proposition 3.2. On the other hand when η → 1, the distribution of X degenerates to that of the so-called Rademacher random variable, which leads to the least possible excess kurtosis of −2.

. . . nor a necessary condition for strict sub-Gaussianity
Although most typical bounded random variables that are strictly sub-Gaussian are symmetric (see, e.g., Proposition 1.1), the symmetry of the distributions of such variables is not a necessary condition for strict sub-Gaussianity. Examples of such distributions include mixtures of Dirac masses. For example, with (x 1 , x 2 , x 3 ) = −2, − 1 2 , 5 4 and (p 1 , p 2 , p 3 ) = 1 13 , 4 7 , 32 91 . The function h for the random variable characterized by (22) is plotted in Figure 1b. Note that it attains its maximum in λ = 0. 4 Results and applications to standard distributions

Bernoulli and binomial distributions
Consider a Bernoulli random variable, X ∼ Ber(µ) with µ ∈ (0, 1) and a binomial random variable, Y ∼ Bin(n, µ) which can be obtained as the sum of n independent Ber(µ) random variables, n a positive integer.
Proof of Proposition 1.1 for the Bernoulli and binomial distributions.
Starting with the Bernoulli: the third cumulant is equal to thus, by virtue of Proposition 3.2, a non-degenerate Bernoulli random variable may only be strictly sub-Gaussian when µ = 1 2 . That is, when it is symmetric. Conversely, verifying the sufficient condition for the symmetric Bernoulli distribution Ber (1/2) is equivalent to assessing the condition for the Rademacher distribution instead. That is, the distribution of random variable X, where the events X = −1 and X = 1 have equal probability Since X 2 = 1, the variance of X and all of its even moments are V[X] = E[X 2j ] = 1. Therefore, to verify the sufficient condition of Proposition 3.3, we are required to demonstrate that (2j)! ≥ 2 j j!, for each j ≥ 2, which follows from the expansion Thus, we have verification of the sufficient condition for the Rademacher distribution and hence the symmetric Bernoulli distribution, as a consequence.
Turning to the binomial distribution, we observe that the optimal proxy variance of a sum of i.i.d. (independent and identically distributed) variables is the sum of the optimal proxy variances. Thus, we immediately obtain the result that In particular, X ∼ Bin(n, µ) is strictly sub-Gaussian if and only if µ = 1 2 .
We now turn to the optimal proxy variance of a Bernoulli, which has the form This fact is known via Theorem 2.1 and Theorem 3.1 of Buldygin and Moskvichova (2013); see also the discussion in the introduction of Marchal and Arbel (2017). Here, we focus on a rather different approach, based on function h and Corollary 2.2, where Note that the study of the variations of h is observed by Berend and Kontorovich (2013) (cf. their function g; Equation (2.1)). However, a formal proof that h has a single global maximum is left "as an intriguing open problem" by Berend and Kontorovich. This is stated in the next proposition, and formally proved, below. An illustration of this result is presented in Figure 2.
Proposition 4.1. If X ∼ Ber(µ), then the function h : λ → 2 λ 2 ln(µe λ + 1 − µ) − µλ admits a unique critical point which is a global maximum. The global maximizer is obtained at λ 0 = 2 ln 1−µ µ , which leads to the optimal proxy variance of form Proof. Let us first prove that h admits a unique critical point, which is a global maximum, by using Proposition 2.3 and the remark that follows. ODEs (9) and (10) are respectively and with h (0) = µ(1−µ)(2µ−1)
By differentiating h, we observe that the global maximizer of h is obtained as the unique solution (in λ 0 ) of the equation It is easy to verify that λ 0 = 2 ln 1−µ µ , and that this leads to the optimal proxy variance as stated.

Triangular distribution
We say that X ∼ Tri(a, b) is a triangular random variable on (−a, b), for any a, b > 0, if it is characterized by a density equal to See Kotz and Van Dorp (2004) for details and properties of such distributions. A recent review of research developments regarding the triangular distribution appears in Nguyen and McLachlan (2017).
Proof of Proposition 1.1 for the triangular distribution. The third cumulant is equal to so by virtue of Proposition 3.2, a triangular random variable may only be strictly sub-Gaussian when a = b. That is when it is symmetric.
Conversely, when the distribution is symmetric with a = b, we can easily express the moments of even order in the form so that the sufficient moment condition of Proposition 3.3 is equivalent to In other words, the only remaining inequality is to show that The result is true for j ∈ {1, 2, 3, 4} by direct computation. We then make the decomposition: thus verifying the sufficient condition in the symmetric case.
For the general case, we first observe that We further observe, numerically, that the difference ∆ introduced in (11) admits a unique minimum, and also that the h function (3) admits a unique (global) maximum; see Figure 3. The optimal proxy variance can be obtained numerically, by minimizing (11)

Uniform distribution
In this section, we prove that the uniform distribution is strictly sub-Gaussian using a similar proof as for obtaining the optimal proxy variance in the Bernoulli case (i.e., Proposition 4.1). First, we observe that after translation/dilatation, we may always reduce the problem to the case of X ∼ Unif([0, 1]). In this case, the moment generating function is straightforward to compute and we may write which is a symmetric and is a C ∞ function. It remains to prove that it attains its global maximum at λ = 0.
Obviously s (4) is strictly positive on R * , thus s (3) is a strictly increasing function on R. Since s (3) (0) = 0, we conclude that s (3) is strictly negative on R * − and strictly positive on R * + . Thus, s is strictly decreasing on R * − and strictly increasing on R * + . Finally, since s (0) = 0, we conclude that s is strictly positive on R * , therefore s is a strictly increasing function on R. Since s (0) = 0 then s is strictly negative on R * − and strictly positive on R * + so that s is strictly decreasing on R * − and strictly increasing on R * + . Since s(0) = 0, we conclude that s is strictly positive on R * . This proves that h has only one unique critical point, which is therefore the global maximizer.
In conclusion for X ∼ Unif ([a, b]) with a < b, we have the celebrated result that

Sum of independent uniform variables
We may now consider the sum of independent (but not necessarily identically distributed) uniform random variables. Let (X 1 , . . . , X n ) be independent variables with for i ∈ {1, . . . , n}, with a i < b i and denote S n = n i=1 X i . Since the family of uniform distributions is invariant under translation and multiplication by a constant, we have the standard result that and then, since the h function of a sum of independent random variables is the sum of the h functions of the variables and by independence of the variables (X i ) i≤n , we obtain The sum of the r.h.s. of the equation above is composed of functions that are all strictly increasing on R − and all strictly decreasing on R + . Thus, it too is strictly increasing on R − and strictly decreasing on R + . In particular, the global maximum is unique and obtained at λ = 0 for which we find: Remark. Note in particular that the sum of two independent uniform variables is generically a trapezoid distribution, with symmetric triangular parts, or a symmetric (up to translation) triangular distribution. However the general asymmetric triangular case, considered in Section 4.2, cannot be expressed as a sum of independent uniform distributions.

Kumaraswamy distribution
Kumaraswamy distribution is characterized by the density on (0, 1): for α, β > 0, which yields the simple distribution function of form The distribution was first studied in Kumaraswamy (1980) and was considered in details by Jones (2009).
Proof of Proposition 1.1 for the Kumaraswamy distribution. The Kumaraswamy distribution is symmetric if and only if α = β = 1 (Jones, 2009). In this case, it reduces to the uniform distribution, which is strictly sub-Gaussian, as was proved in Section 4.3.
Conversely, let us now consider any potentially strictly sub-Gaussian Kumaraswamy distribution. It must then satisfy the necessary conditions of Proposition 3.2. The third cumulant κ 3 vanishes if and only if the parameters satisfy the relation In such a case, a numerical evaluation of the 4 th cumulant demonstrates that it is negative, thus both necessary conditions of Proposition 3.2 hold. However, a numerical evaluation also shows that the maximizer of the h function is never located at zero (i.e., the condition of Corollary 3.1 is not satisfied), except for α = β = 1 (i.e., the uniform distribution). This is illustrated in Figure 4, where the function h is plotted for (α, β), satisfying relation (34), with β varying in the interval [10 −3 , 5]. The maximum of h is illustrated with the red curve, showing that the global maximizer always deviates from zero, except for the case of the uniform distribution (the black curve) and the degenerate symmetric Bernoulli distribution (the blue curve). This proves the necessity of symmetry and concludes the proof.  (34), with β varying from 10 −3 (top gray curve) to 5 (bottom gray curve). The black curve represents the h function of the uniform distribution. The blue curve represents the h function of the symmetric Ber 1 2 . Maxima of h, corresponding to the optimal proxy variances, are depicted by the red curve. In particular, the maxima are located at zero only for the symmetric distributions (i.e., the uniform distribution and Ber 1 2 ). Varying λ on the x-axis. Log-scale on the y-axis.

Beta distribution
The optimal proxy variance for the Beta(α, β) distribution was derived in Marchal and Arbel (2017), Theorem 2.1. In particular, this theorem states that the optimal proxy variance is equal to the variance if and only α = β. That is, if and only if the beta distribution is symmetric. This proves Proposition 1.1 for the beta distribution. The h function and the optimal proxy variance is illustrated on Figure 5. where M is the maximum value of |X| (which is finite since X is almost surely bounded). Thus we obtain 2 Therefore, the supremum is not at infinity and since the function λ ∈ R → 2K(λ) λ 2 is continuous and positive, it must achieve its maximal value at finite values of λ.

A.2 Proof of Proposition 2.4
The proof of Proposition 2.4 is based on the study of the variations of the ∆ function, defined in Equation (11), which is the object of the next lemma.
We are now ready to prove Proposition 2.4.