UNIVERSAL CONSISTENCY OF THE k -NN RULE IN METRIC SPACES AND NAGATA DIMENSION. II

. We continue to investigate the k nearest neighbour ( k -NN) learning rule in complete separable metric spaces. Thanks to the results of C´erou and Guyader (2006) and Preiss (1983), this rule is known to be universally consistent in every such metric space that is sigma-finite dimensional in the sense of Nagata. Here we show that the rule is strongly universally consistent in such spaces in the absence of ties. Under the tie-breaking strategy applied by Devroye, Gy¨orfi, Krzy˙zak, and Lugosi (1994) in the Euclidean setting, we manage to show the strong universal consistency in non-Archimedian metric spaces (that is, those of Nagata dimension zero). Combining the theorem of C´erou and Guyader with results of Assouad and Quentin de Gromard (2006), one deduces that the k -NN rule is universally consistent in metric spaces having finite dimension in the sense of de Groot. In particular, the k -NN rule is universally consistent in the Heisenberg group which is not sigma-finite dimensional in the sense of Nagata as follows from an example independently constructed by Kor´anyi and Reimann (1995) and Sawyer and Wheeden (1992).


Introduction
The problem of describing those (separable, complete) metric spaces in which the k nearest neighbour classifier is universally (weakly) consistent still remains open.The same applies to the strong universal consistency under some reasonable tie-breaking strategy.In this paper, we are motivated by those two problems and closely related questions.
The main tool in this direction is the theorem by Cérou and Guyader [1], who have shown that the k-NN classifier is (weakly) consistent under the assumption that the regression function η(x) satisfies the weak Lebesgue-Besicovitch differentiation property.While it is unknown if this property actually follows from the consistency of the k-NN classifier, it is now possible to deduce the universal consistency for every metric space having the weak Lebesgue-Besicovitch property for every probability measure.A large class of such metric spaces was previously isolated by Preiss [2]: the so-called sigma-finite dimensional metric spaces in the sense of Nagata [3,4].Thus, it follows that in every separable metric space that is sigma-finite dimensional in the sense of Nagata the k-NN classifier is universally consistent.In the part I of this work [5], we have given a direct proof of the result in the spirit of the original argument of Stone for Euclidean spaces [6], illustrating the similarities and differences of the argument in this more general setting.
One observation of the present paper is that the conclusion of the result holds for a strictly more general class of metric spaces.Assouad and Quentin de Gromard have shown [7] that the Lebesgue-Besicovitch differentiation property is true for metric spaces that are finite dimensional in the sense of de Groot.In particular, modulo the results of [1], the k-NN classification rule is universally consistent in such spaces.Among the most studied examples of such metric spaces is the Heisenberg group H.It is known that the Heisenberg group has infinite Nagata dimension (this was shown independently by Korányi and Reimann [8] and Sawyer and Wheeden [9]).In fact, their argument also implies that H is not sigma-finite dimensional in the sense of Nagata.Thus, the k-NN classifier is universally consistent in the Heisenberg group, and the property of being sigma-finite dimensional in the sense of Nagata is not a necessary condition.This observation, the subject of Section 3, refutes the conjecture made by us in part I [5].
It is also noteworthy that the example of the Heisenberg group answers in the negative a question asked by Preiss in 1983 [2]: suppose a metric space Ω satisfies the Lebesgue-Besicovitch differentiation property for every sigma-finite locally finite measure, will it satisfy the strong Lebesgue-Besicovitch differentiation property for every such measure too?While this must be well-known to the experts, we are unaware of this being mentioned explicitly anywhere.
In the remaining part of the article we proceed to the strong universal consistency of the k-NN classifier in metric spaces.In Section 4 we show that in the absence of distance ties, the k-NN rule is strongly universally consistent in every separable sigma-finite dimensional space in the sense of Nagata.The argument follows closely the proof in the Euclidean case belonging originally to Devroye and Györfi [10] and Zhao [11] as presented in the book [12] (Thm.11.1).Clearly, the key geometric lemma using Nagata dimension is a bit different.Section 4 is a revised version of a part of the PhD thesis of the first-named author [13].
Adopting a specific paradigm of uniform tie-breaking belonging to Devroye, Györfi, Krzyżak, and Lugosi [14] who applied it in the Euclidean case, we show that the k-NN classifier is strongly universally consistent in the non-Archimedian metric spaces, that is, those satisfying the strong triangle inequality: d(x, z) ≤ max{d(x, y), d(y, z)}.The same holds in a slightly more general class of metric spaces of Nagata dimension zero.We were unable to extend the result to all (sigma) finite dimensional metric spaces in the sense of Nagata, but already the non-Archimedian case is, we believe, important, as it is, intuitively, where the distance ties occur most often.It is worth noting that a direct analogue of a crucial technical geometric lemma proved in [14] in the Euclidean case fails in non-Archimedian metric spaces with measure, revealing a rather interesting difference in their underlying geometries.This is the subject of our Section 5.
In the concluding short Section 6, we propose a new version of the conjecture aimed to describe those complete separable metric spaces in which the k-NN classifier is universally consistent.

Learning in a measurable space
Let Ω = (Ω, A) be a measurable space, that is, a non-empty set Ω equipped with a sigma-algebra of subsets A. The product Ω × {0, 1} becomes a measurable space in a natural way.The elements x ∈ Ω are known as unlabelled points, and elements (x, y) ∈ Ω × {0, 1} are labelled points.A finite sequence of labelled points, σ = (x 1 , x 2 , . . ., x n , y 1 , y 2 , . . ., y n ) ∈ Ω n × {0, 1} n , is a labelled sample.Here it is probably important to stress that a sample is a sequence and not a subset, as it may have repetitions.
A classifier in Ω is a mapping assigning a label to every point.The mapping is usually assumed to be measurable (or, more generally, universally measurable, that is, measurable with regard to the intersection of all possible completions of the sigma-algebra).
This assumption is necessary in order for things like the misclassification error to be well defined, although some authors are allowing for non-measurable maps, working with the outer measure instead.Let μ be a probability measure defined on the measurable space Ω × {0, 1}.Denote (X, Y ) a random element of Ω × {0, 1} following the law μ.The misclassification error of a classifier T is the quantity err μ(T ) = μ{(x, y) ∈ Ω × {0, 1} : The misclassification error cannot be smaller than the Bayes error, which is the infimum (in fact, the minimum) of the errors of all the classifiers T defined on Ω: A (supervised binary classification) learning rule in (Ω, A) is a mapping, g, that, when shown a labelled sample, σ, produces a classifier, g(σ).In other words, a learning rule determines a label of each point x on the basis of a labelled learning sample σ: Again, the map above is usually assumed to be (universally) measurable with regard to the natural sigma-algebra generated by A through the finite products and then countable unions.
We denote the restriction of g to Ω n × {0, 1} n by g n .This way, one can think of a learning rule g as a sequence of maps and write g = (g n ).
The labelled datapoints are modelled by a sequence of independent, identically distributed random elements (X n , Y n ) ∈ Ω × {0, 1} following the law μ.For each n, the misclassification error of the rule g restricted to Ω n × {0, 1} n , that is, g n , is the random variable Define the measure µ = μ • π −1 , where π is the first coordinate projection of Ω × {0, 1}.This is a probability measure on (Ω, A).Now define a finite measure µ 1 on Ω by µ 1 (A) = μ(A × {1}).Clearly, µ 1 is absolutely continuous with regard to µ. Define the regression function, η : Ω → [0, 1], as the corresponding Radon-Nikodým derivative that is, the conditional probability for x to be labelled 1. (For the Radon-Nikodým theorem in our abstract setting, see [15], 232E and 232B.) Notice that since the regression function η, together with the measure µ, allows to fully reconstruct the measure μ, a learning problem in a measurable space (Ω, A) can be alternatively given either by the measure μ or by the pair (µ, η).We will sometimes denote the corresponding Bayes error by ℓ * µ,η .
Given a classifier T = χ C , the misclassification error can be written as Now it is easy to see that the Bayes error ℓ * = ℓ * µ,η is achieved at exactly those classifiers T satisfying T (x) = 1, for µ-almost all x such that η(x) > 1 2 , 0, for µ-almost all x such that η(x) < 1  2 .
(At the points where η equals 1/2, the value of a Bayes classifier -or any classifier -does not affect the error.)Such classifiers are known as Bayes classifiers.
A rule g is consistent (or weakly consistent) under μ if where the convergence is in probability, and universally consistent if g is consistent under every probability measure μ on Ω × {0, 1}.In this paper, consistency will be synonymous with weak consistency.In a similar way, one defines the strong consistency.A labelled sample path is an infinite sequence of i.i.d.elements of Ω × {0, 1} each one following the law μ.A rule g is strongly consistent under μ if where the convergence is along almost every infinite labelled sample path D ∞ = (X 1 , Y 1 ), (X 2 , Y 2 ), . .., and D n denotes the initial segment of the path D ∞ .A rule is strongly universally consistent if it is strongly consistent under every probability measure on the space of labelled points.Clearly, strong consistency implies consistency.
Recall that the Borel sigma-algebra (or Borel structure) of a topological space Ω is the smallest sigma-algebra containing all open sets.In particular, every metric on a set generates a Borel sigma-algebra.A standard Borel space is a set equipped with a sigma-algebra that is the Borel sigma-algebra generated by some complete separable metric.The usual setting for statistical learning is a standard Borel space as Ω.This will be the setting for our paper as well.However, apriori there are no restrictions for studying learning problems in more general measurable spaces.

The k nearest neighbour classification rule
Let now Ω be a metric space.The k-NN classifier in Ω is a learning rule, defined by selecting the label g n (σ)(x) ∈ {0, 1} for a point x on the basis of a labelled n-sample σ = σ n = (x 1 , x 2 , . . ., x n , y 1 , y 2 , . . ., y n ), x i ∈ Ω, y i ∈ {0, 1}, by the majority vote among the values of y i corresponding to the k = k n nearest neighbours of x in the learning sample σ.
There is an issue of possibly occurring ties, which come in two types.One is the voting tie, when k is even and we may have a split vote.This can be broken, in fact, in any way, without affecting the consistency of the classifier.For instance, in such cases one can always choose the label 1 (as we do below), or just assign the label in a random way.Or else one can only work with odd values of k n .
It may also be that there are more than k nearest neighbours of x within σ that are at the same distance.This requires a tie-breaking rule.Given k and n ≥ k, define In other words, this is the smallest radius of a closed ball around x containing at least k nearest neighbours of x in the sample σ n .
A k nearest neighbour mapping is a function which, given an unlabelled n-sample σ and a point x, selects a k-subsample N σ k (x) ⊏ σ so that 1. all elements of N σ k (x) are at a distance ≤ r σn k-NN (x) from x, and 2. all points x i in σ that are at a distance strictly less than r ςn k-NN (x) to x are in N σ k (x).The k nearest neighbour mapping N σ k (x) (which we will sometimes shorten to N k (x)) can be deterministic or stochastic, in which case it will depend on an additional random variable, independent of the sample path.An example of the kind would be to give the sample σ a random order, under a uniform distribution on the group of n-permutations, and break the distance ties by selecting among the tied neighbours on the sphere the smallest ones under the order selected.
Here is a formal definition of the k-NN learning rule: Above, θ is the Heaviside function: The k-NN rule was historically the first classification learning rule in a standard Borel space whose universal consistency was established, by Charles J. Stone [6].
The k-NN classifier is no longer universally consistent in more general separable metric spaces, in fact already in the infinite-dimensional Hilbert space ℓ 2 , as noted in [1].An example of this kind (constructed for the needs of real analysis) belongs to Preiss [16].(See this example adapted for the k-NN classifier in [5], Sect.2.) This brings up the question of characterizing those metric spaces in which the k-NN classifier is universally consistent, and so far the problem remains open.

Strong consistency
Under the -possibly the most natural -randomized method of tie-breaking, the k-NN classifier is never strongly universally consistent.Let (Z n ) be a sequence of i.i.d.random variables distributed uniformly in the unit interval I = [0, 1], and independent on data.In case of distance ties, we choose among the points x n1 , x n2 , . . ., x nm at an equal distance to x those points whose corresponding instances z ni are the smallest.(See for example [1], bottom of p. 341.)Proposition 2.2.If a sequence of values of k, (k n ), goes to infinity sufficiently slowly, then the k-NN classifier, under the uniform random tie-breaking using the auxiliary variables Z i ∈ I as above, is not strongly universally consistent in any metric space.
Proof.Let the underlying probability measure µ on Ω be a Dirac measure concentrated in one point, and let the regression function η take a value p ∈ (0, 1), p ̸ = 1/2 at the unique point of the measure support.This way, the nature of the metric space becomes totally irrelevant, as everything reduces to a trivial one-point domain, Ω = { * }.A sample path in this context is just a Bernoulli sequence (Y n ) of random labels 0 and 1 with probability of success p, together with an i.i.d.sequence Z n ∈ I of tie-breaking values, the two sequences being independent.The Bayes error for our problem equals min{p, 1 − p}.It is achieved at the Bayes (optimal) classifier, returning the label 1 if p > 1/2 and the label 0 if p < 1/2.(Here, we need the assumption p ̸ = 1/2: for p = 1/2 any prediction would achieve the Bayes error 1/2.)Strong universal consistency would require that for a.e.Bernoulli sequence (Y n ) and a.e.tie-breaking sequence (Z n ), the k-NN classifier always predicts the Bayes label, starting with some i large enough.
Fix a summable sequence (δ i ), δ i ∈ (0, 1).Choose recursively sequences n i ↑ ∞ and ϵ i ↓ 0 in such a way that for every i, if we randomly choose n i i.i.d.uniform elements of the interval, Z 1 , Z 2 , . . ., Z ni , then with confidence > 1 − δ i 1. at least ⌈ln i⌉ elements Z i belong to the interval [0, ϵ i ), while 2. none of Z i belong to [0, ϵ i+1 ).Now define for each n The first Borel-Cantelli lemma implies that almost surely, for some j occurs the event A j "a sample path (Z i ) satisfies the conditions (1) and ( 2) for all i ≥ j." Denote Θ the event "the k-NN classifier returns a wrong label infinitely often".We will show that at least for some values of p, it is an almost sure event.We will condition on the tie-breaking path (Z i ).Almost surely, (Z i ) is in A j for some j.So let us fix j and a path (z i ) belonging to A j .The properties of A j imply that, for all i, m, such that j ≤ i < m, the k = ⌈ln i⌉ smallest elements among z 1 , z 2 , . . ., z ni belong to the interval (ϵ i+1 , ϵ i ), while the k = ⌈ln m⌉ smallest elements among z 1 , z 2 , . . ., z i , . . ., z nm belong to the interval (0, ϵ m ).Since ϵ m ≤ ϵ i+1 , the two intervals are disjoint, so the sets of tie-breaking values at the steps n i and n m are disjoint too, and the subsamples N ( * ) of nearest neighbours selected by the classifier to make a prediction at the steps i and m are disjoint (are indexed with disjoints sets of integers).We conclude: the sets of labels of the k-nearest neighbours chosen at the moments n i , i ≥ j according to our procedure form a sequence of independent random variables with values in {0, 1} kn i .Consequently, the predictions made at the steps n i , i ≥ j also form an independent sequence.Denote W i the event "the k-NN classifier returns the wrong label at the step n i when using the sequence (z i ) for tie-breaking".According to the above, the sequence of events (W i ), i ≥ j is independent.The probability for the k-NN classifier to return the wrong label (that is, 1 if p < 1/2 and 0 if p > 1/2) at the step n i , i ≥ j is at least min{p, 1 − p} kn i = min{p, 1 − p} ⌈ln i⌉ (this is the probability of the event where all k nearest neighbours have the same label opposite to the Bayes one).Now let p = e −1 ≈ 0.368 . ... We have a divergent sequence.The events (W i ), i ≥ j are independent, the sequence p(W i ) is divergent.The second Borel-Cantelli lemma implies that, almost surely, W i occur infinitely often.In other words, if p = e −1 and our (z i ) is used for tie-breaking, the k-NN rule will return the wrong label infinitely often for almost all labelling sequences (Y i ).Since the sequences (Z i ) and (Y i ) are mutually independent, we conclude by the Fubini theorem that our event Θ occurs with probability one (the same holds in fact whenever p belongs to [e −1 , 1/2) ∪ (1/2, 1 − e −1 ]).
In view of this observation, one way to get strong consistency results is to make k grow fast enough.For some results obtained in this direction, see [17].We do not touch upon this approach in our paper.
Another possibility is to assume that there are no distance ties, that is, there are no atoms and all the spheres have measure zero.This happens in the Euclidean case, for instance, if the underlying distribution has Lebesgue density.Under this assumption, strong consistency for the k-NN classifier in the Euclidean space is a result due to Devroye and Györfi [10] and to Zhao [11].We will extend the same conclusion to all sigma-finite dimensional metric spaces in the sense of Nagata in Section 4.
Finally, a modified randomized tie-breaking approach to the k-NN classifier was proposed by Devroye, Györfi, Krzyżak, and Lugosi in [14].As before, the data path is enlarged by adding an independent i.i.d.sequence of tie-breaking variables (Z n ) taking value in I.The difference with the previous approach is that the test data point is also modelled not by a single random variable X ∼ µ but a pair of random variables, (X, Z), where Z is independent of X and of the data and follows the uniform distribution on I.In the case of distance ties, the points X i , i ∈ J all at the same distance from X are ordered in accordance with the corresponding values of Z i , i ∈ J, the closest ones to Z being chosen first.(The previously described approach corresponds to the case of Z taking a constant value zero.) Under this mode of tie-breaking, the classifier is being built not in Ω proper but rather in the extended domain Ω × I, equipped with the product of µ and the uniform measure λ, and whose regression function is the composition of η with the projection on the first coordinate.In the Euclidean case Ω = R d it was shown by Devroye et al. [14] that the resulting classifier, which is, strictly speaking, not the k-NN classifier but a modification thereof, converges along almost every sample path to the Bayes classifier on Ω × I, obtained by composing the Bayes classifier for Ω with the first coordinate projection.Even if for any fixed value Z = z the same argument as in our Proposition 2.2 shows that the wrong predictions may occur infinitely often, the expected error averaged over Z ∈ I converges to zero for almost all sample paths.Thus, if one now wants to obtain a strongly consistent learning rule on Ω proper, one has to average the predictions along every fibre {x} × I, that is, take the majority vote over all values of the auxiliary variable Z.In this approach, essentially, one combines the k-NN with ensemble learning.
In Section 5, we will establish strong consistency within the above approach for non-Archimedian metric spaces, and the proof shows interesting geometric differences from the Euclidean case.

Dimension in the sense of de Groot and the Heisenberg group
The aim of this section is to observe that a complete separable metric space in which the k-NN classifier is universally consistent need not be sigma-finite dimensional in the sense of Nagata.We begin by reminding the important result by Cérou and Guyader.Theorem 3.1 (Cérou and Guyader, [1]).Let Ω be a separable complete metric space equipped with a probability measure µ (the distribution law of data) and a regression function η : Ω → [0, 1] (the conditional probability for a point to be labelled 1).Suppose further that the regression function satisfies the weak Lebesgue-Besicovitch differentiation property: where the convergence is in measure, that is, for each ϵ > 0, Then the k-NN classifier is (weakly) consistent for the supervised learning problem (µ, η) in Ω.Now, some necessary concepts and results related to the Nagata dimension.(For a more detailed presentation with many examples, see Part I of our work [5].)The following definition is Preiss' generalization [2] of Nagata's original concept.Recall that a family γ of subsets of a set Ω has multiplicity ≤ n if every point of Ω is contained in at most n elements of γ.Definition 3.2.Let Ω be a metric space and X a metric subspace, let δ ∈ N and s > 0. Then X has Nagata dimension ≤ δ on the scale s inside of Ω if every finite family of closed balls in Ω with centres in X and radii < s admits a subfamily having multiplicity ≤ δ + 1 in Ω which covers all the centres of the original balls.The Nagata dimension of X within Ω on the scale s > 0, denoted dim s N ag (X, Ω) or sometimes simply dim N ag (X, Ω), is the smallest δ such that X has Nagata dimension ≤ δ on the scale s inside Ω.We say that a subspace X has a finite Nagata dimension in Ω if X has finite dimension in Ω on some suitable scale s > 0.
Here is a reformulation that we will use.A family of balls in a metric space is disconnected if the centre of each ball of the family does not belong to any other ball.Proposition 3.3.For a subspace X of a metric space Ω, one has if and only if every disconnected family of closed balls in Ω of radii < s with centres in X has multiplicity ≤ δ + 1.
For a proof, see e.g.[5], Proposition 7.2.Here is another important property: the Nagata dimension does not increase when we form the closure of a subspace.Proposition 3.4 (See [5], Prop.7.4).Let X be a subspace of a metric space Ω, satisfying dim s N ag (X, Ω) ≤ δ.Then dim s N ag ( X, Ω) ≤ δ, where X is the closure of X in Ω. Definition 3.5 (Preiss, [2]).A metric space Ω is said to be sigma-finite dimensional in the sense of Nagata if Ω = ∪ ∞ i=1 X n , where every subspace X n has finite Nagata dimension in Ω on some scale s n > 0 (where the scales s n are possibly all different).Remark 3.6.Because of Proposition 3.4, we can assume all X n to be closed.Also, it is easy to see that the union of two subspaces having finite Nagata dimension each also has a finite Nagata dimension (Prop.7.5 in [5]), so we can in addition assume that X n form an increasing chain.
Remark 3.7.In view of the preceding remark, the Baire Category argument implies that every complete metric space Ω that is sigma-finite dimensional in the sense of Nagata contains a non-empty open subspace that has finite Nagata dimension in Ω.Now we can remind the theorem of Preiss.
Theorem 3.8 (Preiss [2]).Let Ω be a complete separable metric space.Then the following two properties are equivalent.
It should be noted that the original note of Preiss [2] only contained a brief sketch of the proof of the implication (1)⇒(2).The implication (2)⇒(1) was worked out in detail by Assouad and Quentin de Gromard in [7] for the case of finite Nagata dimension (from this, the deduction of the sigma-finite dimensional case is straightforward).
By combining Theorems 3.8 and 3.1, one obtains: Corollary 3.9.The k-nearest neighbour classifier is (weakly) universally consistent in every complete separable metric space sigma-finite dimensional in the sense of Nagata.
In Part I [5] we have given a direct proof of this result along the geometric ideas of the original proof of Stone [6].
Note that Preiss' result asserts a strong version of the Lebesgue-Besicovitch property, while the result of Cérou and Guyader only requires the weak version of it as an assumption.Turns out, there is a class of metric spaces that "fills the gap" between the two.For that, we need to give some more definitions.Definition 3.10 ( [18]; [7], 3.5).Let δ ∈ N. A metric space Ω has de Groot dimension ≤ δ if it satisfies the following property.For every closed ball B(a, r) in Ω with centre a and radius r > 0, if x 1 , . . ., x δ+1 ∈ B(a, r), then there are i ̸ = j with d(x i , x j ) ≤ r.Proposition 3.11 (Prop.3.1 in [7]).A metric space Ω has de Groot dimension ≤ δ if and only if every finite family of closed balls having the same radii admits a subfamily covering all the centres of the original balls and having multiplicity ≤ δ + 1.
Proof.Necessity: let B(x 1 , r), . . ., B(x N , r) be a finite family of closed balls having the same radius.Take any maximal disconnected subfamily of those balls.It covers all the centres by maximality (here we use the fact that the radii of all the balls are the same).Also, this maximal disconnected subfamily has multiplicity ≤ δ + 1 because of our assumption on de Groot dimension: assuming there were x belonging to δ + 2 balls, the r-ball centred at x of radius r would contain δ + 2 points two by two at a distance > r from each other.
Sufficiency: apply the property to the family of balls B(x i , r), i = 1, 2, . . ., δ + 1, where x i ∈ B(x, r).All of the above closed balls contain x, so at least one of those balls, say B(x i , r), will be missing from a subfamily containing all the centres; then x i ∈ B(x j , r), j ̸ = i, so d(x i , x j ) ≤ r.
Thus, in view of Proposition 3.3, de Groot dimension of a metric space is always bounded by the Nagata dimension on the scale +∞.For the space R n equipped with an arbitrary norm, the two dimensions are equal ([7], 4.9).In a more general case, in fact, already in the infinite-dimensional Hilbert space ℓ 2 , the distinguishing examples are easy to construct.Example 3.12.The convergent sequence 2 −n e n , n ≥ 0, where e n are elements of the standard orthonormal basis in the Hilbert space ℓ 2 , together with the limit 0, equipped with the induced metric, has infinite Nagata dimension on every scale s > 0. Indeed, each closed ball of radius 2 −n , centred at 2 −n e n , contains 0 as the only other element of the space, and so admits no subfamily of finite multiplicity containing all the centres.
At the same time, this sequence has de Groot dimension 2. Call n the index of a point x = 2 −n e n , and let the index of zero be infinite.Denote the index i(x).Given a closed ball of centre a in this space and three points inside the ball, order them according to the increasing index, x 1 , x 2 , x 3 .If now i(a) ≤ i(x 1 ), then x 2 and x 3 are closer to each other than x 3 is to a.And if i(x 1 ) < i(a), then the distance between x 2 and x 3 is smaller than between a and x 1 .(And notice that de Groot dimension is not equal to one as the example of a ball of radius 1/2 centred at x = 2 −3 e 3 and containing two points, x 1 = 2 −1 e 1 and x 2 = 2 −2 e 2 shows.) This space is complete (even compact) and sigma-finite dimensional in the sense of Nagata being the union of countably many singletons: a singleton trivially has Nagata dimension zero in every ambient metric space.
A source of metric spaces of finite de Groot dimension is provided by the doubling metric spaces.Definition 3.13.A metric space X is doubling if there is a constant C > 0 such that for every x ∈ X and r > 0, the closed ball B(x, r) can be covered with at most C closed balls of radius r/2.
The following is a simple exercise.(Cover a closed r-ball with ≤ C many r/2-balls and notice that among any C + 1 points, at least two belong to the same closed r/2-ball.)Proposition 3.14.Every doubling metric space has finite de Groot dimension (bounded by the constant C from Def. 3.13).
Combining this result with that of Cérou and Guyader (Thm.3.1 above), we arrive at: Corollary 3.16.The k-nearest neighbour classifier is universally consistent in every complete separable metric space having finite de Groot dimension.
It would be certainly interesting to give a direct proof of the result in the spirit of Stone.Moreover, the versions of de Groot dimension on a given scale and of sigma-finite dimensional spaces in the sense of de Groot that exactly parallel the definition of Preiss can be easily stated, so it is natural to ask a number of questions about such spaces.For instance, is it true that a metric space has the weak Lebesgue-Besicovitch property if and only if it is sigma-finite dimensional in the sense of de Groot?See the concluding Section 6 for an exact formulation.
An example of a complete separable metric space of finite de Groot dimension that is not sigma-finite dimensional in the sense of Nagata is provided by the Heisenberg group H equipped with one of the natural metrics that we now proceed to describe.
Topologically, the Heisenberg group H is identified with the Euclidean space R 3 , and is equipped with the following group multiplication: (3.4) Here x, x ′ , y, y ′ , z, z ′ ∈ R, and C ̸ = 0 is a real constant.Different choices of C result in algebraically isomorphic groups: a group isomorphism from the above version to the one determined by the constant C ′ is given by a linear map multiplying each vector by C/C ′ .The operation in (3.4) clearly makes H into a topological group, in fact a Lie group, when equipped with the Euclidean topology.
For all values of C with |C| ≤ 4 the formula defines a group norm on H, in the sense that |p The latter is a consequence of the following particular case of a result of Cygan [19] (using notation and concepts from, and better be looked at jointly with, the article [20]): Given an expression on the right of equation (3.4), denote ε = ±1 the product of the signs of z + z ′ and of Cxy ′ − Cyx ′ .Assuming |C| ≤ 4, the norm of the product (x, y, z) • (x ′ , y ′ , z ′ ) is less than or equal to , and applying Cygan's inequality, we arrive at the product of norms of (x, y, z) and (x ′ , y ′ , z ′ ).Consequently, a left-invariant metric on H is defined by and is clearly compatible with the Euclidean topology.This distance is known as a (Cygan-)Korányi distance.Thus, it is the unique left-invariant metric such that It is well-known and readily seen that the group H equipped with a Cygan-Korányi distance is doubling.In fact, the doubling property holds for any compatible left-invariant metric on H that is homogeneous in the sense that if we apply to the group the transformation (x, y, z) → (tx, ty, t 2 z) for t > 0, then the distance between any pair of points increases by the factor of t.(It can actually be shown that every such metric is automatically compatible with the Euclidean topology, see [21].)In this form, the doubling property is enough to establish for a single ball of radius r = 1 say centred at zero, and it follows from local compactness of the Euclidean space.As the Cygan-Korányi metric is both left-invariant and homogeneous (an easy calculation), the statement follows.
In particular, we conclude from the result of Assouad and Quentin de Gromard (Thm.3.15): Corollary 3.17.The Heisenberg group H equipped with a Cygan-Korányi metric satisfies the weak Lebesgue-Besicovitch property for every Borel probability measure µ and every L 1 (µ)-function.
According to the result of Cérou and Guyader (Thm.3.1), we now have: Corollary 3.18.The k-NN learning rule is universally consistent in the Heisenberg group H equipped with a Cygan-Korányi metric.
At the same time, the metric space H with a Cygan-Korányi distance need not be sigma-finite dimensional in the sense of Nagata.For the next result, we choose a version of the group law corresponding to the value C = −2 in the multiplication formula (3.4), following [8].Thus, We find it useful to present a proof, following [8] and somewhat expanding the argument.
Proof.By identifying R 2 with the complex plane C, we can write the multiplication law (3.5) in the group The neutral element of the group is (0, 0), and the inverse of (z, t) is simply (−z, −t).Consequently, the formula for the left-invariant Cygan-Korányi metric becomes: .
Let (z, t) and (z ′ , t ′ ) be two points on the unit sphere of H around the neutral element 0 that are different from (0, 0, ±1) (so that z, z ′ ̸ = 0).Notice that Re z z′ is the inner product of z and z ′ as vectors of R 2 .As r ↓ 0, we have up to the second order terms in r: If the bracketed term on the right is strictly negative, then for sufficiently small r > 0 d((rz, r 2 t), (z ′ , t ′ )) > 1, so for any ρ > 0, using the homogeneity property of the metric, Since the complex number t ′ + |z ′ | 2 i has modulus one, it can be written as e ψi , so the condition in equation (3.6) becomes Im(e ψi z z′ ) < 0. (3.8) Now we define two sequences of reals and a sequence of points on the unit sphere in H (z j , t j ) = e θj i sin ψ j , cos ψ j .
Notice that e ψj i = t j + |z j | 2 i.
Since for n > j we have (there is a small typo in the second displayed formula on p. 18 in [8]), it follows that Im(e ψj i z n zj ) ≤ Im(e ψj i z j+1 zj ) < 0. Now the radii r j > 0 are being chosen recursively, using equation (3.7), in such a way that each element (r j z j , r 2 j t j ) is outside the finitely many closed balls already selected.Since all of the above closed balls B(p n , r n ) contain zero (the identity of H), the Nagata dimension of H is infinite by Proposition 3.3, as was noted by Assouad and Quentin de Gromard [7], 4.7(f).But in fact, the construction implies more.
Corollary 3.20.The group H equipped with the Cygan-Korányi metric is not sigma-finite dimensional in the sense of Nagata.
Proof.Assuming H were sigma-finite dimensional, by our Remark 3.7, it would contain a non-empty open subset U which has finite Nagata dimension in H. Select any p ∈ U .Since the metric is left-invariant and so the left translation q → p −1 • q is an isometry, the set p −1 • U also has finite Nagata dimension.Since this set is a neighbourhood of identity, it contains all elements of the sequence (x n ) chosen as in Theorem 3.19, beginning with n large enough.This contradicts the finite dimensionality of the set p −1 • U inside H in the sense of Nagata.
Thus, the Heisenberg group H provides an example of a metric space possessing the weak Lebesgue-Besicovitch property -in particular, on which the k-NN classifier is universally (weakly) consistent -and which is not sigma-finite dimensional.
Remark 3.21.The influential 1983 paper by Preiss [2] mentioned that it was unknown whether a complete separable metric space Ω satisfies the weak Lebesgue-Besicovitch differentiation property for every Borel locally finite measure if and only if Ω satisfies the strong Lebesgue-Besicovitch differentiation property for every Borel locally finite measure.The later developments have shown the answer to be negative, in fact the Heisenberg group with the Cygan-Korányi metric provides a distinguishing example in view of Corollary 3.17, Corollary 3.20 and Preiss's Theorem 3.8, (1)⇒(2).This fact must be well known to the specialists, even if we have not found it mentioned explicitly anywhere.

Strong consistency in the absence of distance ties
A probability measure µ on a metric space Ω has a zero probability of distance ties if the measure of every sphere S r (x), x ∈ Ω, r ≥ 0 is zero.In particular, such a measure is non-atomic (the case r = 0).In this section, we will show that the result by Devroye and Györfi [10] and Zhao [11] about the strong universal consistency of the k-NN classifier in the Euclidean space in the absence of distance ties is valid in all complete separable sigma-finite dimensional metric spaces in the sense of Nagata -again, in the case where distance ties occur with zero probability.We will follow the presentation of the proof of Theorem 11.1 in [12], however, as to be expected, the extension requires certain technical modifications, not all of which concern Lemma 4.6 below.Theorem 4.1.Under the zero probability of distance ties, the k-NN learning rule is strongly universally consistent in every complete separable metric space that is sigma-finite dimensional in the sense of Nagata.
Remark 4.2.The result is certainly of interest in the setting of all finite-dimensional normed spaces (not just the Euclidean ones), because in such a space there are no distance ties whenever the underlying distribution has density with regard to the Lebesgue measure.It is hard to think of a similar natural condition for sigma-finite dimensional metric spaces beyond the normed spaces case.One of the most interesting classes -and in which the distance-based classifiers are of practical interest [22] -is given by the non-Archimedian metric spaces, satisfying the strong triangle inequality, which are essentially the metric spaces of Nagata dimension zero.It is not difficult to see that a non-Archimedian metric on a separable space only takes a countable number of distinct values.(Indeed, given such a space, Ω, choose a contable dense subset X and apply the strong triangle inequality to deduce that for any x, y ∈ Ω there are a, b ∈ X with d(x, y) = d(a, b).)This means the distance ties will always occur with strictly positive probability.A rather natural example where the ties are overwhelming was worked out by us in Part I [5], Example 6.4.
Recall from Section 2 that strong consistency of a learning rule (g n ) means that along μ∞ -almost every infinite labelled sample path σ ∞ ∈ Ω ∞ × {0, 1} ∞ , the learning error converges to the Bayes error: Here ℓ * µ,η is the Bayes error of the learning problem (µ, η), and err µ,η (g n (σ n )) is the error of the classifier given by the learning rule on the sample input σ n , the initial n-segment of the path σ ∞ .The convergence here is that of a sequence of reals.
Getting back to the k-NN learning rule, denote η n the approximation to the regression function: where the sum is over all k nearest neighbours of X.We have a classical estimate valid in all metric spaces (see [1], Prop.1.1): Therefore, the strong consistency would follow if we could show that along almost every sample path, A sigma-finite dimensional metric space Ω can be represented as the union of a countable increasing chain of measurable (even closed should we wish, see Rem. 3.6) subspaces (F m ), each having finite Nagata dimension in Ω, in such a way that µ(F m ) → 1.Thus, the strong consistency would follow if we could prove that for each fixed m, along almost every sample path, where the expectation is conditional, that is, essentially, a normalized integral over F m .The way to prove this is through the Borel-Cantelli lemma: we want to show that the expected value of the difference |η(X) − η n (X)| over F m normally concentrates in n.We have no control over the rate of convergence of this difference to zero, so it may be very slow, but what matters is that it should be roughly uniform: if for every ϵ > 0, starting with n sufficiently large, the probability of a deviation larger than ϵ is of the order exp(−nϵ 2 ), we are done: for almost every sample path, beginning with some n, the deviation over F m will be below ϵ.Thus, the following lemma, modelled on Theorem 11.1 in [12], will settle the proof of Theorem 4.1, and the rest of the section will be just devoted to a proof of lemma.Lemma 4.3.Let Ω be a complete separable metric space, and let Q be a Borel subset.Suppose Q has Nagata dimension ≤ β in Ω on a scale s.Let µ be a probability measure on Ω with zero probability of ties, and let η : Ω → [0, 1] be a regression function.Suppose µ(Q) > 0. Let μ be a probability measure on Ω × {0, 1} corresponding to (µ, η).For ε > 0, whenever k, n → ∞ and k/n → 0, there is a n 0 such that for n > n 0 , Let µ be a Borel probability measure on a complete separable metric space Ω.Let 0 < α ≤ 1.We define Lemma 4.4.Let µ be a probability measure with zero probability of ties.Then µ(B(x, r α (x))) = α for every x.
Proof.Clearly, r α (x) > 0. The measure of every open ball of radius < r α (x) is strictly less than α.By the sigma-additivity of µ, the measure of the open ball of radius r α (x) is ≤ α, and the measure of the corresponding closed ball B(x, r α (x)) is ≥ α.By our assumption, the sphere is a null set, so the two values are equal.
Lemma 4.5.The real-valued function r α defined as in (4.1) is 1-Lipschitz continuous and converges to zero as α → 0 at each point of the support of the measure.
The following technical result is an analogue of Lemma 11.1 in [12].
Lemma 4.6.Let Ω be a complete separable metric space and let Q be a Borel subset having Nagata dimension ≤ β in Ω on the scale s.Assume that µ is a probability measure on Ω with zero probability of ties.For y ∈ Ω, define Then µ(D(y, α) ∩ Q) ≤ (β + 1)α for all α small enough.
Proof.First of all, notice that the set D(y, α) is open, so it makes sense to talk of its measure.Indeed, if x ∈ D(y, α), then the open ball of radius δ = (1/2) [r α (x) − d(x, y)] > 0 around x also belongs to D(y, α): every element x ′ of such a ball satisfies (the first inequality is due to Lem. 4.5, the rest follow from the triangle inequality).Now let ε > 0. By Luzin's theorem, there is a compact set As ε > 0 is arbitrary, we need to only get the desired upper bound for µ(K).
It follows from Lemma 4.5 that r α converges to 0 uniformly on K when α goes to 0. Choose α 0 > 0 such that for 0 < α ≤ α 0 , we have r α (x) < s for all x ∈ K.
As K is compact, we can recursively select a subset of indices I ⊆ N so that each sequence of centres x n i , i = 1, 2, . . ., β + 1, n ∈ I converges to some point x i ∈ K.We claim that the union of closed balls B(x i , d(x i , y)), 1 ≤ i ≤ β + 1 covers K, which will finish the proof in view of the inclusion (4.2).
As closure of the finite union is the union of closures and since the balls are closed, it is enough to show that D = {a m } m∈N is contained in the union of B(x i , d(x i , y)), 1 ≤ i ≤ β + 1. Fix m.There are i 0 ∈ {1, 2, . . ., β + 1} and an infinite set of indices J ⊆ I such that a m belongs to all the balls B(x n i0 , d(x n i0 , y)), n ∈ J.It follows that Now, to the proof of Lemma 4.3.As in equation (4.1), denote r k/n (x) the unique solution to the equation By the triangle inequality, Like in equation (2.2), denote r k-NN (x) the smallest radius of a closed ball around x containing at least k nearest neighbours of x (we suppress the symbol of the sample).In the absence of distance ties, the closed r k-NN (x)-ball a.s.contains exactly k nearest neighbours.Of the two closed balls around x, one of radius r k-NN (x) and the other of radius r k/n (x), one is necessarily contained in the other, so the symmetric difference, which we tentatively denote ∆(x), is just the set-theoretic difference of the two balls, though we do not know in which order.With this in mind, we have for the second term on the right-hand side of above equation (4.4), because N k (X) contains exactly k points.
Next we show that the latter term converges to zero.Let ηn (X) be equal to 1 k n i=1 I {ρ(Xi,X)<r k/n (X)} and let η(X) be identically equal to 1. Conditionally on X = x, the expected value of the random variable under the absolute sign is zero (LLN), which allows to pass to variance.Using the Cauchy-Schwarz inequality, which term goes to zero as k → ∞.
For the first term on the right hand side of equation (4.4), where we used the fact that E µ n {|η(X) − η n (X)|} → 0 because the k-NN rule in our setting is weakly consistent due to the results of Preiss and Cérou-Guyader (Cor.3.9).
The random variables |η(X) − η * n (X)| and |η n (X) − η(X)| admit realisations as Borel measurable functions on Ω ∞ × {0, 1} ∞ × Ω taking values in [0, 1].Thus, the convergence in expectation implies convergence in measure, and consequently their restrictions to Ω ∞ × {0, 1} ∞ × Q, where by our assumption Q ⊆ Ω has a strictly positive measure, converge to zero as well, in measure and in expectation.So, for a given ε > 0 we can choose n, k so large that Suppose we have random variables X, Y , such that EX (by (4.5)) and applying the above observaton with ϵ 1 = ε/2 and ϵ 2 = ε/6, we have where we used the inequality (4.6).Now we will separately estimate the probability of deviations in the two last terms.
For the first term let θ be a function defined on labeled samples, θ : Let a new sample σ ′ n be formed by replacing (x i , y i ) with (x i , ŷi ).The difference of values of η * ni computed at the original sample and the altered one is at most 1/k.For elements of Q, the value can only change at the points of the set D(x i , k/n) ∩ Q.According to Lemma 4.6, the µ-measure of the latter set is bounded by (β + 1)k/n whenever r k/n is sufficiently small (smaller than the scale s, in fact).Therefore, the normalized (conditional) measure of this set in Q is bounded by (β + 1)k/µ(Q)n, and Let us remind a classical concentration inequality.
Theorem 4.7 (Azuma, McDiarmid).Let X 1 , X 2 , . . ., X n be i.i.d.random variables taking values in a space Ω, and let a function f : Ω n → R satisfy the following Lipschitz condition with regard to the Hamming distance: whenever just the i-th coordinate in the argument (x 1 , x 2 , . . ., x n ) is changed, the value of the function changes by at most c i > 0. Then the probability of the deviation of the random variable f (X 1 , X 2 , . . ., X n ) from the expected value by at least t > 0 is bounded by We conclude that An identical argument applied to ηn results in a similar concentration estimate for the second term in equation (4.7), and we are done.

Strong consistency in the non-Archimedean case
Here we show that the randomized tie-breaking approach to the k-NN classifier in the presence of distance ties adopted by Devroye, Györfi, Krzyżak, and Lugosi [14] (see our Sect.2.3) and used by them to prove the strong universal consistency of the k-NN classifier in the Euclidean setting works also in the case of metric spaces of non-Archimedian metric spaces: those whose metric satisfies the strong triangle inequality: However, the proof becomes somewhat trickier, revealing some interesting geometric features of non-Archimedian spaces with measure.
Theorem 5.1.The k-NN classifier is strongly universally consistent in every complete separable non-Archimedian metric space, under the tie-breaking strategy of Devroye, Györfi, Krzyżak, and Lugosi.Remark 5.2.A slightly more general class of metric spaces is formed by those of Nagata dimension zero: a metric space is non-Archimedian if and only if it has Nagata dimension zero on the scale s = +∞, see [5], Example 5.3.Our result above requires a minimal amount of adjustments to be extended to the complete separable metric spaces of Nagata dimension zero on some scale s > 0. We decided to avoid technicalities in order to make the argument in the proof of Lemma 5.7 below a little clearer.
We begin with combinatorial preparations.For z ∈ I and b ≥ 0, denote Let Ω be a metric space.Given x ∈ Ω, z ∈ I, r, b > 0, define, just like in [14], the set (See Fig. 1.) Now let α > 0. Given (x, z) ∈ Ω × I, denote r α (x) as before (Eq.(4.1)), being the infimum of all the radii r > 0 such that the open ball of radius r around x has measure ≥ α.It the presence of atoms, it may happen that r = 0; we adopt the convention that B(x, 0) is the empty set, and B(x, 0) = S(x, 0) = {x}.
Proof.The statement is trivially true if r α (x) = 0. Otherwise, approximate r α (x) > 0 with a strictly increasing sequence of radii r n ↑ r α (x), and use sigma-additivity.Proof.One has Thus, it suffices to prove that b α (x, 1/2) is measurable as a function of x ∈ Ω.This can be written as Everything now reduces to proving the measurability of the maps x → µ B rα(x) (x) and x → µ Brα(x) (x) .
As the function x → r α (x) is continuous (in fact, 1-Lipschitz, see Lem. 4.5), when x is fixed and x n → x, we have r α (x n ) → r α (x).For every ϵ > 0, from the triangle inequality, when n is large enough, the closed ball Brα(xn) (x n ) is contained in the ϵ-neighbourhood of the ball Brα(x) (x).When ϵ ↓ 0, the measure of this ϵ-neighbourhood converges to the measure of Brα(x) (x) by sigma-additivity of µ, so we conclude lim sup Thus, the function x → µ Brα(x) (x) is upper semi-continuous, hence Borel measurable.An identical argument works for the sphere in place of the closed ball, and this suffices.
For r = r α (x) we have In case where the two values are different, the function is continuous and surjective, so the value α is achieved.We have: where C = C(d) is a constant depending on the dimension of the space.However, this kind of bound does not hold in more general finite-dimensional spaces in the sense of Nagata.In fact, it already fails in the non-archimedean metric spaces.Here is a counter-example.
Example 5.6.Let Ω be any infinite complete non-archimedean metric space having at least one non-isolated point.For example, one can take any of the classical examples such as the space of p-adic numbers Q p , or the Cantor space {0, 1} ω with the metric d(x, y) = 2 − min{i : xi̸ =yi} .
Fix a non-isolated point x ∈ Ω and a sequence x n converging to x, such that r n = d(x, x n ) is strictly decreasing.Denote S n = S n (x n , r n ) the spheres, B n = B(x n , r n ) the open balls, and Bn = B(x n , r n ) the closed balls.
Then ∪B n has a full measure.Denote Fix a sufficietly small α > 0, in the sense to be defined shortly.Denote Therefore, for every y ∈ B n and w ∈ [0, 1], whenever n ≥ α −1/2 .Now let x be the same non-isolated point as above, and set z = 0.By the above reasoning, the set D(x, z, 3α) contains every pair (y, w) with y ∈ B n and w ≤ ζ n , provided that When α > 0 is sufficiently small, we have Since the balls B n are pairwise disjoint, we have which expression is ω(α) as α → 0. Thus, unlike in the Euclidean case, there is no upper bound on the size of the set D(x, z, α) that is linear in α.
However, this example does not contradict the strong consistency of the k-NN rule.Indeed, the proof of [14] proceeds as follows.The inequality (5.3), taken with α = k/n, implies, like in our earlier argument (Eq.(4.8)), that the misclassification error of an auxiliary rule is a Lipschitz function on the cube Ω n × {0, 1} n , equipped with the normalized Hamming distance, with C being the Lipschitz constant.The Azuma inequality bounds the probability that the error deviates by more than ε > 0 from the expectation by an expression of the form 2 exp(−ϵ 2 n/C 2 ), and the sequence of such upper bounds is summable, allowing one to use the Borel-Cantelli lemma.If we now assume that the upper bound on the size of D(x, z, α) is of the form −α ln α, and substitute α = k/n, the Azuma inequality bounds the probability of a large deviation by something like 2 exp(−e 2 n/(ln n) 2 ).This sequence is still summable over n, so the Borel-Cantelli argument applies.
It turns out that in fact the upper bound in the above example is (up to a constant) exact.
Proof.Fix α > 0. We will estimate the measure of the set D(x, z, α) \ ({x} × I).If x is not an atom, this makes no difference.If µ{x} > 0, then for any pair of the form (x, w) ∈ D(x, z, α) the product measure of the set {x} × N (z, |w − z|) does not exceed α.This means |z − w| ≤ αµ{x} −1 .Consequently, the measure of the set D(x, z, α) ∩ ({x} × I) is bounded by 2α, and we will just add this value to our estimate at the end.It simplifies things to estimate separately the measure of the intersection of the above set, D(x, z, α) \ ({x} × I), with supp µ × (0, z) and with supp µ × (z, 1); as the arguments are identical, we will only do the former.
By approximating the measurable set (D(x, z, α) \ {x} × I) ∩ supp µ × (0, z) with a compact subset K from inside to any given accuracy (Luzin's theorem), we can concentrate on bounding the measure of K.
Denote V the family of all open subsets of Ω × I of the form B(y, r) × (w, z), where (y, w) ∈ K, w < z, and r = d(y, x).Notice that they cover K. Choose a finite subcover of K with sets of this form, say B(y i , r i ) × (w i , z), i = 1, 2, . . ., N .Because the metric is non-Archimedian, we can assume these open balls, B(y i , r i ), to be disjoint from each other.(Indeed, assume B(y i , r i ) and B(y j , r j ) intersect, i ̸ = j.Then one of them is entirely contained in the other, say B(y i , r i ) ⊆ B(y j , r j ), and as r i = d(y i , x) = d(y j , x) = r j by the strong triangle inequality, we have B(y i , r i ) = B(y j , r j ).Now out of the two sets B(y i , r i ) × (w i , z) and B(y j , r j ) × (w j , z), one contains the other, depending on whether w i or w j is smaller, so one can be discarded.)Also, we can discard all balls of zero µ-measure.Order them in such a way that the radii r i decrease, and whenever r i = r i+1 , we have w i ≤ w i+1 .Now, some more non-Archimedian geometry.For every i, d(y i , y i+1 ) = r i : it cannot be strictly smaller because the open balls B(y i , r i ) and B(y i+1 , r i+1 ) are disjoint, and cannot be strictly larger because both points are at a distance ≤ r i from x.As a consequence, B(y i+1 , r i+1 ) ⊆ S(y i , r i ).Indeed, if y ∈ B(y i+1 , r i+1 ), then d(y i , y) ≤ max{d(y i , y i+1 ), d(y i+1 , y) = r i , and the strict inequality is again impossible because the open balls do not meet.Notice that it is possible that r i = r i+1 , in which case the closed ball B(y i+1 , r i+1 ) will coincide with B(y i , r i ).Write for short B i = B(y i , r i ), S i = S(y i , r i ).To sum up, the open balls B i are two-by-two disjoint, and if i < j, then B j ⊆ S i .Also, write ξ i = z − w i .
Denote b i = µ(B i ), i = 1, 2, . .., and s i = N j=i+1 b j , i = 0, 1, 2, . ... These b i and s i are all strictly positive by the choice of K. Also, s i ≤ µ(S i ) for i ≥ 1.For all i = 1, 2, . . ., N , We have γ i > 0. Let n ≤ N be the largest integer satisfying s n ≥ α. (If it does not exist, then µ(K) ≤ α and we are done.)Clearly, where δ i > 0. Thus, Notice that if for some i ≤ N we have s i−1 /s i > 2, then As for all t ∈ [0, 1), ln(1 − t) ≤ −t, we get: whence we get the estimate using (5.5) and the remark at the start of the proof.
Remark 5.8.The main result of this section, Theorem 5.1, would be established in the general case of a complete separable metric space sigma-finite dimensional in the sense of Nagata if we could verify the following.Suppose a subspace Q of a complete metric space Ω has Nagata dimension β on a scale s > 0 in Ω.Is it true that, for some absolute constant C > 0 and all sufficiently small α, Of course one could think of weaker estimates that will also suffice.
We will model the proof of Theorem 5.1 on the proof of Theorem 1 in [14].First, we remind that, by Lemma 5.5, for every pair (x, z) there is a unique pair (r k/n (x), b k/n (x, z)) defined as in equations (4.1) and (5.2) with α = k/n.This leads us to define the regression function approximation We also have the regression function approximation where the choice of the set N k (X, Z) of k nearest neighbours of X using the auxiliary random variable Z is made using the same tie-breaking strategy as described at the beginning of the section.We need to prove that, first, the difference η(X) − η n (X, Z) converges to zero in expectation (or in probability), which would mean the (weak) consistency of the algorithm, and second, for every ϵ > 0 the probabilities of an ϵ-deviation of η(X) − η n (X, Z) from its expected value taken over all n-samples form a summable sequence.This will allow to apply the first Borel-Cantelli lemma and deduce the strong consistency.
We have, taking the expectation over random samples (i.e., E stands for E σ∼µ n ⊗λ n ), (5.7) We will verify the convergence to zero in expectation for all three terms.As to the deviation bound, the first term does not depend on a random sample, so the ϵ-deviation is improbable.We will deduce a summable bound for the second term, while the conclusion for the third will come for free as a particular case.
Notice that whenever a metric space with measure satisfies the strong Lebesgue-Besicovitch property (Eq.3.2), that is, for a.e.x, Let now f : Ω → R be an L 1 (µ)-function.By the main theorem of Preiss from [2] (reproduced above as Thm.3.8), combined with our observation in the previous paragraph, almost every x ∈ Ω has the property: given ϵ > 0, one can select ρ > 0 so small that when 0 < r < ρ, then the average value of f in the r-ball around x, either open or closed, is ϵ-close to f (x).We can also see f as a function on Ω × I which only depends on the first argument, x ∈ Ω.Now let x, ϵ > 0, and r are as above, and z, ξ ∈ [0, 1].Denote provisionally Thus, for µ-a.e.x ∈ Ω, the average value of f over B(x, z, r, ξ) converges to f (x) when r ↓ 0. In particular, this conclusion applies to the regression function η and its average value over B(x, z, r k/n (x), b k/n (x, z)) when n, k → ∞ and k/n → 0. It follows that the expected value E σ∼µ n ⊗λ n η * n (x, z) of the approximation η * n (x, z) taken over all random labelled n-samples converges to η(x) for a.e.x, z as n, k → ∞ and k/n → 0: This is a summable sequence.
For the second term in equation (5.7) it remains to show convergence to zero in expectation.We perform a familiar trick with the Cauchy-Schwarz inequality and the variance: (5.9) We have used the following three observations: the empirical measure of the symmetric difference of the sets B(X, Z, r k/n (X), b k/n (X, Z) and B(X, Z, R n , B n ) bounds the error, for the latter set this empirical measure is always one, and among the two intersections of the sample with these sets one always contains the other.Now The last line of the equation (5.9) becomes |η * − Eη * |, and is therefore just a special case of the second term correspoding to the constant regression function η ≡ 1.

The revised conjecture
We propose the following conjecture (a revised version of the conjecture previously stated by us in [5]).Conjecture 5.1.For a complete separable metric space Ω, the following are equivalent.
1.The k-NN classifier is (weakly) universally consistent in Ω. 2. For every sigma-finite locally finite Borel measure µ on Ω, every L 1 (µ)-function f : Ω → R satisies the weak Lebesgue-Besicovitch differentiation property: 1 µ(B(x, r)) B(x,r) f (x) dµ(x) → f (x) (6.1) in probability.3. The space Ω is sigma-finite dimensional in the sense of de Groot, that is, one can represent Ω as a union of subspaces W n in such a way that for each n and some δ n ∈ N and s n > 0, every finite family of closed balls with centres in W n having the same radii < s n admits a subfamily covering all the centres of the original balls and having multiplicity ≤ δ n + 1 in Ω.

. 5 )
Essentially, by fixing C, we select a version of the Cygan-Korányi metric, because the groups are all isomorphic between themselves for different values of C ̸ = 0. Lemma 3.19 (Korányi and Reimann, [8], p. 17; Sawyer and Wheeden, [9], Lem.4.4, p. 863).Let C = −2.There exists a sequence (p n ) of elements of H with r n = |p n | H → 0 so that the family of balls B(p n , r n ) is disconnected.

d
(x n , y) ≤ max{d(x n , x), d(x, y)} = r n , and at the same timer n = d(x n , x) ≤ max{d(x n , y), d(y, x)}.So we must have d(x n , y) = r n , proving equation(5.4).In particular, the open balls B n are all two-by-two disjoint: if n < m, then B n ∩ S n = ∅, while B m ⊆ S n .Also, the spheres S n form a nested sequence: S 1 ⊇ S 2 ⊇ . ...

1
µ(B(x, r)) B(x,r) f (x) dµ(x) → f (x),(5.8)one can use closed balls in place of open balls and the a.e.convergence will still take place.Indeed, as every closed ball is the intersection of a sequence of open balls of the same centre, we have, by sigma-additivity,1 µ( B(x, r)) B(x,r) f (x) dµ(x) = lim ϵ↓r 1 µ(B(x, ϵ)) B(x,ϵ) f (x) dµ(x),from where the statement follows.