Chain-referral sampling on Stochastic Block Models

We are motivated by the study of `hidden populations', in which all frameworks including size or membership are unknown. The discovery of the hidden population is made possible by assuming that its members are connected in a social network by their relationships. Chain-referral Sampling (CRS) makes use of the graph-structure by following randomly the edges in the underlying social networks. This leads to the study of a Markov chain on a random graph where vertices represent individuals and edges connecting any two nodes describe the relationships between corresponding people. The interviewees are asked for their peers, and we then deliver a number of coupons to some of the people mentioned. One model of random graph receiving a lot of attention lately is Stochastic block-model (SBM), which extends the well-known Erd\"os-R\'enyi graphs to populations partitioned into communities. The SBM considered here is characterized by a number of vertices $N$ (size of the population), a number of communities $m$, a block distribution $\pi=(\pi_1,...,\pi_m)$ representing the proportion of each community (block) and a pattern of connecting vertices between blocks given by the matrix $P=(\lambda_{kl}/N)_{(k,l) \in \{1,...,m\}^2}$. In this paper, we give a rigorous description of the dynamic of CRS process in discrete time on an SBM. The difficulty lies in handling the heterogeneity of the graph. In our model, the graph and random walk are constructed simultaneously. Then, we study the evolution of this chain by considering the normalized process on the time scale $[0,1]$. We prove that when the population's size is large, the normalized stochastic process of the referral chain behaves like a deterministic curve which is the unique solution of a system of ODEs.


Introduction
In sociology, some populations may be hidden because their members share common attributes that are illegal or stigmatized.These hidden groups may be hard to approach because these individuals try to conceal their identities due to legal authorities (e.g.drugs users) or because of the social pressure (e.g.men having sex with men).In such populations, all the information is unknown: there is no sampling frame such as lists of the members of the population or of the relationship between the latter.It causes many challenges for researchers to identify these groups.The discovery of the hidden populations is made possible by assuming that its members are connected by a social network.The population is described by a graph (network) where each individual is represented by a vertex and any interaction or relationship (e.g.friendship, partnership) between a couple of individuals is represented by an edge matching the corresponding vertices.Thanks to this important feature, we are allowed to investigate these populations by using a Chain-referral Sampling (CRS) technique, such as snowball sampling, targeting sampling, respondent driven sampling etc. (see the review of [25] or [16][17][18]).CRS consists in detecting hidden individuals in a population structured as a random graph, which is modeled by a stochastic process that we study here.The principle of CRS is that from a group of initially recruited individuals, we follow their connections in the social network to recruit the subsequent participants.The exploration proceeds from node to node along the edges of the graph.The interviewees induce a sub-tree of the underlying real graph, and the information coming from the interviews gives knowledge on other noninterviewed individuals and edges, providing a larger sub-graph.We aim at understanding this recruitment process from the properties of the explored random graph.The CRS showed its practicality and efficiency in recruiting a diverse sample of drug users (see [4]).
CRS models are hard to study from a theoretical point of view without any assumption on the graph structure.In this paper, we consider a particular model with latent community structure: the stochastic block model (SBM) proposed by Holland et al. [19].This model is a useful benchmark for some statistical tasks as recovering community (also called blocks or types in the sequel) structure in network science [14,15,23].By block structure, we mean that the set of vertices in the graph is partitioned into subsets called blocks and nodes connect to each other with probabilities that depend only on their types, i.e. the blocks to which they belong.For example, edges may be more common within a block than between blocks (e.g. group of people having sexual contacts).We recall here the definition of SBM (we refer the reader to the survey in [1]): Definition 1.1.Let N be a positive integer (number of vertices), m be a positive integer (number of blocks or types), π = (π 1 , . . ., π m ) be a probability distribution on {1, . . .m} (the probabilities of the m types, i.e. a vector of [0, 1] m such that m k=1 π k = 1) and P = (p kl ) (k,l)∈{1,...,m} 2 be a symmetric matrix with entries p kl ∈ [0, 1] (connectivity probabilities).The pair (Γ, G) is drawn under the distribution SBM(N, π, P ) if the vector of types Γ is an N -dimensional random vector, whose components are i.i.d., {1, . . ., m}-valued with the law π, and G is a simple graph of size N where vertices i and j are connected independently of other pairs of vertices with probability p ΓiΓj .We also denote the blocks (community sets) by: [l] := {v ∈ {1, . . ., N } : Γ v = l} with the size Notice that when m = 1, i.e. there is only one type.Any arbitrary pair of vertices is connected independently to the others with the same probability p 11 , SBM becomes the Erdös-Rényi graph, which is studied in [10].
Here, we consider the Poisson case where the connectivity probabilities p kl depend on N and are given by p kl = λ kl /N .This means that each individual of the block k contacts in average λ kl π l individuals of the block l.This implies that the network examined is sparse.In the present work, we give a rigorous description of a CRS on such SBM and study the propagation of the referral chain on this sparse model.
The CRS relies on a random peer-recruitment process.To handle the two sources of randomness, the graph and the exploring process on it are constructed simultaneously.In the construction, the vertices of the graph will be in 3 different states: inactive vertices that have not being contacted for interviews, active vertices that constitute the next interviewees and off-mode vertices that have been already interviewed.The idea to describe the random graph as a Markov exploration process with active, explored and unexplored nodes is classical in random graphs theory.It has been used as a convenient technique to expose the connections inside a cluster, especially to discover the giant component in a random graph models, for example see [11,27].In our case, there is a slight difference in the recruiting process: the number of nodes being switched to the active mode is set to be bounded by a constant.This trick helps to improve the bias towards high-degree nodes in the population (see [18]).At the beginning of the survey, all individuals in the population are hidden and are marked as inactive vertices.We choose some people as seeds of the investigation and activate them.During the interview these individuals name their contacts and a maximum number c of coupons are distributed to the latter, who become active nodes.One by one, every carrier of a coupon can come to a private interview and is asked in turn to give the names of her/his peers.Whenever a new person is named, one edge connecting the interviewee and her/his contact is added but they remain inactive until they receive a coupon.After finishing the interview, a maximum number of c new contacts receive one coupon each and are activated.So if the interviewee names more than c people, a number of them are not given any coupon and can be still explored later provided another interviewee mentions them.After that, the node associated to the person who has just been interviewed is switched to off-mode and is no longer recruited again, see Figure 1.We repeat the procedure of interviewing, referring, distributing coupons until there is no more active vertex in the graph (no more coupon is returned).Each person returning a coupon receives some money as a reward for her/his participation, and an extra bonus depending on the number contacts that will later return the coupons.Notice that each individual in the population is interviewed just once and we assume here that there is no restriction on the total number of coupons.
The process of interest counts the number of coupons present in the population.We also want to know how many people are detected, which leads to the number of people explored but without coupons.Denote by the discrete time n ∈ N = {0, 1, 2, . . .} the number of interviews completed, A n corresponds to the number of individuals that have received coupons but that have not been interviewed yet (number of active vertices); B n to the number of individuals cited in the interviews but who have not been given any coupon (number of found but still inactive vertices) and U n to the total number of individuals having been interviewed (number of off-mode nodes).
Because of the connectivity properties of the SBM graphs, we need to keep track of the types of the interviewees and the coupons distributed not only to one community but also in general to each of the m communities at every step.We then associate to the chain-referral the following stochastic vector process n and u (l) n ) corresponds to the number of active nodes (resp. of found but inactive nodes and of off-mode nodes) of type l at step n.In all the paper, we will use the notation (X The main object of the paper is to establish an approximation result when the size N of the SBM graph tends to infinity.In this case, the chain-referral process correctly renormalized is: In all the paper, we consider spaces R d equipped with the L 1 -norm defined for x = (x 1 , . . ., x d ) as x = d k=1 |x k |.For all N , the process X N • lives in the space of càdlàg processes D([0, 1], [0, 1] 3×m ) equipped with Skorokhod topology (see [13,20,22]).
There exist to our knowledge a few works of studying CRS form a probabilistic point of view, for example, Athreya and Röllin [3].In their work, they obtained a result in a slightly different framework: they consider random walks on the limiting graphon to construct a sequence of sub-graphs, which converges almost surely to the graphon underlying the network in the cut-metric.Whereas we take here to the limit both the graph and its exploring random walk simultaneously.The main result of this paper is that the process (X N .) N converges to a system of ordinary differential equations (ODEs).There has also been literature on random walks exploring graphs possibly with different mechanism (see [7,12] for instance).Here we allow the exploring Markov process to branch.Also, our process bares similarities with epidemics spreading on graphs (see [6,9,21,26]) but with the additional constraint of a maximum number of distributed coupons here.
The CRS is constructed by the similar principle of an epidemic spread and starts with a single individual.There are two main phases of evolution (see [6]): the initial phase is well approximated by a branching process (which we are neglecting here) and the second phase is when the stochastic process is approximated by an deterministic curve.In this paper, we focus on the second phase, but let us comment quickly on the first phase.In the sequel, we will assume that: Assumption 1.2.For each , k ∈ {1, . . ., m}, denote µ k = λ k π k .We assume that the matrix µ = (µ k ) ,k∈{1,...,m} is irreducible and the largest eigenvalue of µ is larger than 1.
Remark 1.3.Under the Assumption 1.2, from the proof of Theorem 3.2 of Barbour and Reinert [6], the early stages of the CRS is now can be associated approximated by a multitype branching process with the offspring distributions determined by the matrix µ.Thanks to the Assumption 1.2 the multitype branching process associated with the offspring matrix µ is supercritical.The analogous results for the extinction probability and for the number of offspring at the nth generation as in the single branching process have been proved in Chapter 5 of [2]: the mean matrix of the population size at time n is proportional to µ n .And follow the claim (3.11) of Barbour and Reinert [6], we can deduce that if we start with a single individual, then after a finite steps, we can reach a positive fraction of explored individuals in the population with a positive probability.
It means that the initial number of individuals with type i at the beginning of the survey is approximately a (i) 0 N .A possible way to initializing the process is to draw A 0 from a multinomial distribution M( a 0 N ; π 1 , . . ., π m ).
Theorem 1.5.Under the Assumptions 1.2 and 1.4, we have: when N tends to infinity, the process ), which is the unique solution of the system of differential equations where f (x s ) := (f il (x s )) 1≤i≤3 1≤l≤m has an explicit formula described as follows.Denote Remark 1.6.Notice that in this model, the time corresponds to the fraction of the population interviewed.The time t 0 is the first time at which |a t | reaches 0 and can be seen as the proportion of the population interviewed when there is no more coupon to keep the CRS going.Necessarily, t 0 ≤ 1.We see that a t = 0 only if a Then, the solution of the system of ODEs (1.5) becomes constant over the interval [t 0 , 1].
The rest of this paper is organized in the following manner.First, in Section 2, we give a precise description of the chain-referral process on a SBM random graph.This relies heavily on the structure of the random graph that we construct progressively when the exploration process spreads on it.In Section 3, we prove the limit theorem.The proof uses limit theory of càdlàg semi-martingale vector processes equipped with Skorokhod topology (see [13]) and Poisson approximations (see [5]).Then in Section 4, we present simulation results of the stochastic process and the solution of the system of limiting ODEs.We conclude with some discussions on the impacts of changing parameters of the models on the evolution of the chain-referral process.

Definition of the chain-referral process
Let us describe the dynamics of n is the total number of individuals having coupons but who have not yet been interviewed.We start with A 0 seeds, whose types are chosen independently according to π.A 0 is an m-dimensional random vector with multinomial distribution M( a 0 N ; π 1 , . . ., π m ), i.e.P (A and Assumption 1.4 is satisfied.Also B 0 = U 0 = (0, . . ., 0) and we set X 0 = (A 0 , B 0 , U 0 ).We now define X n given the state X n−1 previous to the nth-interview and given the number N 1 , . . ., N m of nodes of each type.At step n ≥ 1, after the nth-interview, the type of the upcoming interviewee is chosen uniformly at random according to the number of active coupons of each type in the present time.To choose the type of the next interviewee, we define an m-dimensional vector I n := (I n ), which takes value 1 at coordinate l and 0 elsewhere if the nth interviewee belongs to block l.This nth-interviewee is chosen uniformly among the A n−1 active coupons of m types i.e.I n has multinomial distribution If the chosen one belongs to block [l], A n is reduced by 1 and a number of new coupons distributed are added up, depending on how many new contacts he/she has.In the meantime, the number of interviewees of type l is increased by 1. i.e.U (l) n .Among the new contacts of the nth−interviewee, define H follows the binomial distribution: n also has the binomial distribution: (2.3) In total, there are Z n := H n + K n candidates, who can possibly receive coupons at step n.Notice that, conditioning on (N 1 , . . ., N m ), X n−1 , (H ..,m are independent, henceforth, n , . . ., C n ) (l = 1, . . ., m) be the numbers of coupons that are distributed at step n.By the setting of the survey, the total coupons |C n | must be maximum c.If the number Z n of candidates is less than or equal to c, we deliver exactly Z n coupons.Otherwise, we choose new people to be enrolled in the study by an m−dimensional random variable C In another words, Let define by the first step that |A n | reaches zero.The dynamics of X n can be described by the following recursion: The random network is progressively discovered when the referrals chain process explores it.
Proposition 2.1.Consider the discrete-time process (X n ) 1≤n≤N defined in (2.7).For n ∈ N, we denote by F n := σ {X i , i ≤ n, (N 1 , . . ., N m )} the canonical filtration associated with (X n ) 1≤n≤N .Then the process (X n ) n is an inhomogeneous Markov chain with respect to the filtration (F n ) n .
Proof.The proposition is deduced from the recursion (2.7) of (X n ) 1≤n≤N and the fact that the random variables C n , I n , H n are defined conditionally on X n−1 and (N 1 , . . ., N m ).The fact that the Markov process is inhomogeneous comes from the setting of the CRS survey: there is no replacement in the recruitment procedure.For example, when m = 1, the definition of 2) depends on time as

Asymptotic behavior of the chain-referral process
Let us now consider the renormalized chain-referral process given in (1.1) in the time interval [0, t 0 ].The main theorem (Thm.1.5) shows the convergence of the sequence (X N • ) N to a deterministic process.For this, we look for an expression of the equations (2.7) as a vector of semi-martingales.We start by writing the Markov chain (X n ) 1≤n≤N as the sum of its increments in discrete time.
Each element of the increment X n+1 − X n are binomial variables conditioned on all the events having been occurring until step n.When we fix n and let N tend to infinity, the conditional binomial random variables can be approximated by some Poisson random variables.The normalization X N t of X n becomes: The Doob decomposition of the renormalized processes (X N t ) t∈[0,t0] given in Section 3.1 consists of a finite variation process and an L 2 -martingale.We use Aldous criteria (conditionally on the past see e.g.[13,24]) to show the tightness of the distributions of these processes in Section 3.2.Once the tightness is established, we identify the limiting values of this tight sequence and finally we prove that the limiting values of all converging subsequences are the same, hence it is the limit of processes (X N • ) N .This proves Theorem 1.5.

Doob's decomposition
Lemma 3.1.The process (X N t ) t∈[0,1] admits the Doob's decomposition: is an F N t −predictable process defined by given by: for every (l, k) ∈ {1, . . ., m} 2 , where X is a column vector and X T is its transpose.
Proof.In order to obtain the Doob's decomposition, we write for t ∈ [0, 1], It is clear that the conditional expectations above are all well-defined since the components of X n and X n−1 are all bounded by N , that ∆ N t is F N t −predictable and that (M N t ) t∈[0,1] is an F N t −martingale.We first check that (∆ N • ) N is a sequence of finite variation processes and then we can conclude that is the Doob's decomposition.Denote by λ := max l,k∈{1,...,m} λ kl .Notice that ) ) which is finite.It follows that (∆ N t ) t∈[0,1] is an F N t − predictable with finite variations.The quadratic variation of (M N t ) t∈[0,1] is computed as follow.For every k, l = 1, . . ., m The term To see that the quadratic variation of M N t has the form (3.2), we write the term L N t as follows: As a result, The integrand in the right hand side is the conditional covariance between X (l) n are vectors, this covariance is a matrix of size 3 × 3 and for 1 ≤ i, j ≤ 3, the term (i, j) For all 0 < δ < 1 and for every s, t ∈ [0, 1] such that |t − s| < δ, we have that .
Thus, for each ε > 0, choose δ 0 ≤ ε 2(2c+mλ+1) , we have that which allows us to conclude that the sequence (∆ N • ) N is tight and finishes the proof of the lemma.
To complete the proof of Lemma 3.2, we now prove that: ) N converges locally uniformly in t to 0 in L 2 , as N goes to infinity.

Proof. Consider the quadratic variation of (M N
• ) N : According to the formula (3.2), we apply the Cauchy-Schwarz's inequality and then use the inequality (3.8) to obtain that for every t ∈ [0, 1], Var(X i,( n , U n ).From (3.3)-(3.5)and (3.9), we deduce that Applying the Doob's inequality for martingale, for every t ∈ [0, 1], we have This concludes the proof of Proposition 3.3 and hence of Lemma 3.2.

Identify the limiting value
Since the sequence (X N • ) N is tight, for any limiting value x = (a, b, u) of the sequence (X N ) N , there exists an increasing sequence (ϕ N ) N in N such that (X ϕ N

•
) N converges in distribution to x in D([0, 1], [0, 1] 3×m ).Because the sizes of the jumps converge to zero with N , the limit is in fact in C([0, 1], [0, 1] 3×m ).We want to identify that limit.In order to simplify the notations, we also write the subsequence (X We consider separately the martingale and finite variation parts.Proposition 3.3 implies that the sequence martingale (M N • ) N converges to 0 in distribution and hence (M N ) N converges to zero in probability.It remains to find the limit of the finite variation process (∆ N • ) N given in equation (3.1) and prove that the limit found is the same (which is done later in the proof for the uniqueness of the system of the ODEs (1.5)) for every convergent subsequence extracted from the tight sequence (X N ) N .Proposition 3.4.When N goes to infinity, we have the following convergences in distribution in D([0, 1], [0, 1] 3×m ): 1 1 where λ k,l s , Λ k s , µ k,l s are defined as in Theorem 1.5.This provides the convergence of (∆ N • ) N to a solution x . of (1.2).
Since the limits are deterministic, the convergences hold in probability.Moreover the uniqueness of the solution of (1.2) will be proved later, which will imply the convergence of the whole sequence (X N • ) N to this solution.
Proof.Recall that since the sequence (X N .) N is tight, we have extracted a converging subsequence also denoted by (X N .) N of which we study the limit.The proof of the Proposition 3.4 is separated into three steps.
Step 1: We consider the most complicated term E[C n |F n−1 ].We prove that: for each l ∈ {0, . . ., m}, where  It follows that we need to deal with the Poisson random variables Z(l) n (l ∈ {1, . . ., m}).Because of the result that the sum of two independent Poisson random variables is a Poisson random variable whose parameter is the sum of the two parameters, we have that n has a Poisson distribution with parameter λ(l) n := j =l λ (j) n .And hence, Using (3.16), we obtain: which finishes Step 1.
Step 2: We decompose the second term in the left hand side of (3.14) as follows: By writing we obtain that for every t ∈ [0, 1], 1 2. Plots of the proportions of classes in the population of size N = 10000 when c varies from 1 to 6 and all the others parameters are fixed: A 0 = 100 the parameters π = (1/3, 2/3), λ 11 = 2, λ 12 = 3, λ 22 = 4.This shows that x 1 t ≡ x 2 t for all t ∈ [0, t 0 ].It also follows t 0 = t 0 .

Simulation
The simulations show that the deterministic solution of the system of ODEs (1.2) fits well with our stochastic process, see Figure 2. The sequence of stochastic process (X N • ) N that we have constructed describes how the chain-referral process works on a network.When we consider the population with a very large number of people, the process (X N • ) N is asymptotically a deterministic function, which is a solution of a system of (1.2).To see numerically the convergence, we do a simulation: for c = 3, we vary N from 500 to 50000 and plot as a function of N the log of the quantity: . The speed of convergence has been studied in the case of Erdös-Rényi graphs in the Ph.D. thesis, by establishing a central limit theorem [28].By studying the solution of (1.2), we can obtain an approximation of the fraction of the population that has been interviewed when the CRS process stops.The proportion of the population discovered is then approximated by t 0 .
The number of maximum coupon c plays an important role in how many people we could explore before there is no distributed coupons any more (when a t = 0).By keeping all other parameters fixed and changing c, in the simulations of Figure 2, we see that the time t 0 are different.For example, with m = 2, π = (1/3, 2/3), λ 11 = 2, λ 22 = 4, λ 12 = 3, we obtain the Table 1.If c = 1, even though the average number of neighbors are bigger than 1, the simple random walk describing the survey reaches only a very small number of people, see Figure 2a.The random walk stops when it encounters a node of degree 1 and cannot propagate any more.
Furthermore, the parameter c also impacts the peaks (time and size) the curves corresponding to the number of distributed coupons.In case of a limited budget with a fixed number of interviews, a higher value of c can imply that we discover a larger fraction of the population since it allows more flexibility in the interviewees.From the Figure 4, we observe that the proportion of people receiving coupons gets bigger as c increases.If c = 1, the fraction of discovered population is small, which means that the survey is not so efficient.When c takes values from 4 to 6, the corresponding curves of a t are "close" and so are the times t 0 .However, in these cases, the number of coupons spent during the CRS survey is large.We can also be interested in seeing how c impacts the part of population discovered when the survey stops after a fixed number of interviewed individuals.For example, consider the case when N = 1000 and assume that we start with A 0 = 10.The parameters of the SBM are π = (1/3, 2/3), λ 11 = 2, λ 22 = 4, λ 12 = 3.Then when there have been approximately 0.2N individuals interviewed, the proportion of the explored individuals: A N 0.2 + B N 0.2 for each c varying from 1 to 6 is given in Table 2.
Changing the parameters λ kl impacts the discovered proportion of types.For instant, let us take a bipartite random model π = (1/3, 2/3), c = 3 and λ 11 = λ 22 = 0, λ 12 = 4, which means that the people between communities are highly connected and there is no connection within community.In this case, the number of explored people without coupon of type 1 is quite small compared to the one of type 2, see Figure 5.

Figure 1 .
Figure 1.Description of how the chain-referral sampling works.In our model, the random network and the CRS are constructed simultaneously.For example, at Step 3, an edge between two vertices who are already known at Step 2 is revealed.

n
of new contacts of type l, who have not been mentioned before; K (l) n the number of new contacts of type l whose identities are already known but who are still inactive.The H (l) n new connections are chosen independently among N l − A individuals in the hidden population where probability of each successful connection is m k=1 I (k) n p kl .Hence, conditioning on (N 1 , . . ., N m ), X n−1 , the random variable H (l) n n p kl .In that way, conditioning on (N 1 , . . ., N m ), X n−1 , K n , . . ., C (m) n ) having the multivariate hypergeometric distribution with parameters (m; c, (Z (1) n , . . ., Z (m) n )) and the support {(c 1 , . . ., c m ) ∈ N m : ∀l ≤ m, c l ≤ Z

15 )
Notice that Λ n = 0 only if for each l ∈ {1, . . ., m}, λ (l) n = 0.It happens when A (l) n−1 + U (l) n = N l , meaning that all the nodes of type l have been discovered.In this case, C (l) n = 0 and (3.14) is satisfied.

Figure 3 .
Figure 3. Scatter plot of ln d 1 (X N , x) along with the smoothing line suggesting the linear relationship between ln d 1 (X N , x) and N .The plot is done for the case c = 3, the number of initial individuals are 1% of the population and the size N varies from 500 to 10000.All other parameters are fixed: π = (1/3, 2/3), λ 11 = 2, λ 12 = 3, λ 22 = 4.

Table 1 .
Numerical computation of t 0 for varying parameters c.

Table 2 .
Numerical computation of A N