License: arXiv.org perpetual non-exclusive license
arXiv:2310.16136v2 [stat.AP] 07 Dec 2023

Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Hyeongseong Lee, Udit Paul, Arpit Gupta, Elizabeth Belding, and Mengyang Gu
   
University of California, Santa Barbara, California, USA
Corresponding author ([email protected])
Abstract

Crowdsourced speed test measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest® measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger population, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed.

1 Introduction

The allocation of large US federal grants and subsidies to expand Internet infrastructure, such as through the $42.5 billion Broadband Equity Access and Deployment (BEAD) program initiated in November 2021, is an important step in addressing digital inequity and providing high-quality Internet access to all Americans. To properly allocate this funding to regions of greatest need, there must first be a methodology for measuring the current state of Internet access and quality at a given location. Indeed, the US Federal Communications Commission (FCC) recently undertook a significant effort to update publicly available information about US broadband infrastructure. First, the FCC worked to improve accuracy in the national Broadband Serviceable Location Fabric, defined as “a common data set of all residential and business locations (or structures) in the U.S. where fixed broadband internet access service is or can be installed.” Based on this new Fabric, they updated the National Broadband Map with ISP-provided data on the maximum download speed available at each location [10]. While this is a significant advancement in publicly accessible data on broadband infrastructure, the Fabric and the map have a number of drawbacks. In particular, the Broadband Map does not report actual or average performance but theoretical maximums. It also does not provide information about the reliability or stability of the access over time. Finally, providers have been repeatedly found to overstate coverage claims [1, 27].

To address these drawbacks, the research community, individual users, and governments and policymakers have turned to crowdsourced “speed test” active measurement data, such as Ookla’s speedtest.net [11], Measurement Lab’s speed.measurementlab.net [9], FAST [5] and Xfinity’s speed test [15]. These platforms are in wide use globally. For instance, Ookla reports a daily average of over 10 million Speedtests [16]. Speed test platforms offer a variety of measurements, such as instantaneous upload/download speed and latency and loaded latency data, through either an app or a website. As such, these platforms provide an important snapshot of the network state from the vantage point of the end users. Further, because they are active measurements, they provide data on actual performance instead of the theoretical maximum performance reported by the providers. Because of the inherent benefits, numerous U.S. government initiatives (e.g. [18, 4, 6, 14, 7, 3]) utilize crowdsourced speed test data to help map the internet access landscape.

However, despite the broad use speed tests, there are multiple critical limitations that can affect the estimation accuracy of these crowdsourced measurements. In particular, due to the nature of crowdsourced data, the sampling mechanism is often uncontrollable [35]. As a result, inherent in this data is a variety of biases along multiple dimensions that can skew the measurement results in unanticipated ways. For instance, the sampling biases can lead to inappropriate conclusions if they are treated as simple random samples [29]. Another barrier is that consumers’ information, such as broadband plans and demographic profiles, is not directly available from the speed tests. Even if the broadband plans can be correctly inferred for the speed test with the use of other data sets [31], the lack of a demographic profile associated with each measurement makes it challenging to correct the sampling bias directly. Finally, speed test performance may depend on the choice of test server; prior work has found that some servers consistently report speeds 10% lower than other servers [26]. Nevertheless, the vast number and diversity of speed test data points present abundant information for understanding the disparity of internet quality, particularly as related to different socio-demographic groups and the change/evolution of internet quality over time.

It is within this context that we perform our analysis. Specifically, we utilize a vast corpus of nearly 1 million individual Ookla Speedtest® measurements, to which we have access through an Ookla for GoodTM Data Use Agreement, to correlate network performance measurements and trends with demographic profiles of census block groups from the 2019 U.S. census data. Our goals are to characterize disparities in internet quality based on socio-demographic groups and to analyze the change in internet quality over time. We illustrate our novel methods using four representative cities and two device types (iOS and Android) as examples, corresponding to 978,101 Speedtest data points over 580 days. Specifically, the contributions of our work are threefold:

  1. 1.

    We develop novel methods for correcting regional sampling bias and correlating this sampling bias with demographic profiles in the population. By utilizing the chi-squared test, we find that the proportion of the Speedtests in different census block groups significantly deviates from the baseline proportions in the population. We introduce re-weighing and re-sampling methods for correcting the regional sampling bias. Even though we found a visible discrepancy between the original and corrected estimation of the internet distribution, this difference is much smaller than the regional bias, meaning that the sampling bias itself may not substantially affect the estimation of the internet speed111In this paper, we use the terms “internet speed” and “measured internet speed” interchangeably. In either case, we are specifically referring to the download speed of the connection to the user end device as measured through an Ookla Speedtest. distribution in a city. We study the underlying reasons for disproportionate sampling and find that the sampling bias is strongly associated with a few demographic variables, including income, education level, age, and ethnic distribution.

  2. 2.

    We introduce variable selection and regression techniques to fuse data at different spatial granular levels for understanding how internet quality varies across demographic profiles. By studying the individual and aggregated data, we show that the use of individual-level measurements is more statistically efficient in estimating the variability of the speed test measurements than the aggregated data. Through backward variable selection for the original and bias-corrected samples, we find that regions with higher income, younger populations, and smaller Hispanic populations tend to have higher internet speeds. Furthermore, we find strong collinearity between a few demographic variables, which tend to affect the outcomes of the fitted model jointly.

  3. 3.

    We analyze temporal changes of measured internet performance over time. We conduct both a linear regression model and a state space model to study the temporal change of the internet measurements. We utilize the state space representation of the Gaussian process with a Matérn function having a half-integer roughness parameter, which reduces the cost of computing the likelihood and making predictions from 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) to 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) operations without approximation, with n𝑛nitalic_n being the number of observations. The acceleration with no approximation enables the approach to be computationally feasible with a massive number of crowdsourced measurements. Both linear and nonlinear analyses suggest that internet speed quality gradually improves over time.

The rest of this paper is organized as follows. In Section 2, we introduce the two primary data sets used in this study. In Section 3, we investigate the regional sampling bias and the association with demographic variables. We introduce two ways for correcting the sampling bias and compare the difference of estimation between the original and bias-corrected samples. Section 4 introduces regression analysis of measured internet speed using demographic profiles at different census block groups for both the original and bias-corrected samples. Section 5 studies the change of measured internet speeds over time using both linear and nonlinear estimation. We conclude the paper in Section  6. We provide proofs and report additional results in the Appendix.

2 Description of data

We combine Ookla® internet speed measurement data [11] and demography data provided by the U.S. Census Bureau for our study. In the following sections, we describe each dataset in detail.

2.1 Ookla’s Speedtest

Ookla’s Speedtest222http://speedtest.net (data provided through Ookla’s Speedtest Intelligence®) possesses over 16k measurement servers worldwide [13] and allows users to assess the quality of their Internet connection using either a web-based portal or native mobile application [11]. For each Speedtest, a nearby test server is selected, and (potentially multiple) TCP connections are used to calculate the throughput of the path. Ookla’s Speedtest Intelligence dataset contains individual Speedtest measurements that include QoS metrics (up/down throughput, latency, packet loss, jitter), as well as meta-features such as ISP, device type, and access type. Ookla provides performance data aggregated over time and space to the public [12]. However, our Data Usage Agreement (DUA) with Ookla provides us access to nearly 1 million individual Speedtest measurements from four major metropolitan cities in the U.S., which we use for this study. Each of these cities has a population in the range of 370—650k.

We use the data obtained through our DUA to investigate the download speed data from two different device types—iOS and Android. We focus on these two device types due to the richness of metadata and geographical information associated with each data point, as well as due to the widespread popularity and usage of these platforms. For demonstration purposes, we use two representative cities for our detailed analysis. Due to our DUA, we maintain the confidentiality of these cities and only refer to them as City A and City B. We also provide numerical analysis of two additional cities, referred to as City C and City D, in Appendix B. Since precise geographic location is crucial for our study, we ensure each measurement includes latitude/longitude data recorded at the time the Speedtest is conducted; truncated GPS coordinates are reported for users who allow location sharing on their equipment. We utilize the geographic location of each Speedtest to associate the test with the relevant census block group. Table 1 shows summary statistics for the data used in this paper. We observe that in every listed city, iOS devices outnumber Android devices in our dataset, yielding almost double the number of Speedtest data points from iOS compared to Android.

Table 1: Overview of Ookla data.
City Device type # of unique devices # of tests Date range
A iOS 23,649 165,229 May 31, 2020 – Dec 31, 2021
Android 8,438 73,094
B iOS 15,507 92,526 May 31, 2020 – Dec 31, 2021
Android 8,553 70,039
C iOS 34,005 213,770 Jan 1, 2021 – Dec 31, 2021
Android 14,730 117,806
D iOS 20,607 146,703 Jan 1, 2021 – Dec 31, 2021
Android 11,122 98,934

2.2 Demographic data from the U.S. Census Bureau

To study demographic trends with performance data, we leverage data from the U.S. Census Bureau. In particular, we select the 2019 Census demographic information due to the inaccuracy of data since the COVID-19 pandemic [8]. For City A and B, we obtain population data for each census block group as well as the following demographic attributes: income, age, sex, education level, internet penetration rate, and ethnicity distribution. Income and age are represented by the median household income and median age, respectively. Sex information is characterized by the proportion of males within the total population of each census block group. The level of education is quantified as the proportion of individuals with a bachelor’s degree or higher among the population aged 25 and older. The internet penetration rate is the percentage of households with an internet subscription out of the total number of households of each census block group. Ethnic distribution is summarized by the proportion of white, black, Asian, and Hispanic populations within the total population of each census block group.

The demographic information from the U.S. Census Bureau is only available at the level of the census block group. Therefore, we build an integrated dataset with a spatial granular structure, where speed measurements identified in each census block group are correlated with the demographic profiles for that census block group based on U.S. Census Bureau data.

3 Regional sampling bias detection and correction

In this section, we explore the presence of sampling bias across regions within cities A and B. In the context of a given city, a “region” refers to a census block group, which constitutes a sub-area within that city. If the distribution of the number of samples does not align proportionally with the population, then we say there exists regional sampling bias within our sample, which can render our statistical analysis divergent from the truth at the city level.

3.1 Comparison of sample and population distributions at the census block group level

Each Ookla Speedtest measurement is associated with a certain census block group based on the geographic location of the test. By comparing the distribution of the sample sizes across census block groups with their population distribution, we can test whether the sample sizes are homogeneous with the population across regions. As described in Section 2, the population of each census block group is based on data from the U.S. Census Bureau. We plot the cumulative distributions and probability mass functions between the sample and population at the level of census block group in Fig. 1 for City A and B, where the census block groups are ordered according to their population sizes. We observe that there is a recognizable discrepancy between the distribution of sample sizes from both Android and iOS in Ookla Speedtests and the population at each census block group. The comparison suggests that there is a regional sampling bias in the Speedtest measurements, as certain census block groups contain disproportionately larger samples than others.

Refer to caption
Figure 1: Comparison of the distributions of population and sample sizes. Census block groups are ordered left to right from largest to smallest population. Cumulative distribution functions are represented for City A and B in (a) and (b), respectively, while probability mass functions for each city are in (c) and (d), respectively. The x-axis indicates the rank of the population size in each census block group, and the census block group with the largest population has rank 1.

We utilize the chi-squared test, one of the most frequently used tests for homogeneity of the samples, to quantify whether the discrepancy between the population and sample size distributions across census block groups is significant. The two assumptions of the test are summarized as follows:

Assumption 1.

The internet speed tests are independent within and between census block groups.

Assumption 2.

The internet speed tests are identically distributed within each census block group, and they are representative for each census block group.

Assumption 1 approximately holds in the data set, but it can be affected by internet traffic and availability. For instance, an internet outage event may affect more than one census block group, making speed tests not independent. Second, certain groups of internet users may be more likely to utilize Speedtest; the relationship between internet speed and demographic profiles of the census block groups will be studied in Section 4.1.

We collect the sample of Ookla Speedtests from k𝑘kitalic_k census block groups. For the i𝑖iitalic_i-th census block group, define nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the sizes of its sample and population, respectively, for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Furthermore, let n𝑛nitalic_n and N𝑁Nitalic_N respectively be the total number of samples and the population among all the regions, i.e., n=i=1kni𝑛superscriptsubscript𝑖1𝑘subscript𝑛𝑖n=\sum_{i=1}^{k}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and N=i=1kNi𝑁superscriptsubscript𝑖1𝑘subscript𝑁𝑖N=\sum_{i=1}^{k}N_{i}italic_N = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. When the samples are drawn according to the population distribution, i.e., the null distribution, the expected number of samples in the i𝑖iitalic_i-th region is nNiN𝑛subscript𝑁𝑖𝑁\frac{nN_{i}}{N}divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG. Then, the χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT test statistic is defined by:

W=i=1k(ninNi/N)2nNi/N.𝑊superscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖𝑛subscript𝑁𝑖𝑁2𝑛subscript𝑁𝑖𝑁W=\sum_{i=1}^{k}\frac{(n_{i}-nN_{i}/N)^{2}}{nN_{i}/N}.italic_W = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N end_ARG . (1)

Under the null hypothesis, where there is no regional sampling bias across different census block groups, Wχk12similar-to𝑊subscriptsuperscript𝜒2𝑘1W\sim\chi^{2}_{k-1}italic_W ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, i.e., the test statistic W𝑊Witalic_W is distributed as a chi-squared distribution with k1𝑘1k-1italic_k - 1 degrees of freedom.

Table 2: The results from χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT homogeneity test.
City # of regions Device type Test statistics W𝑊Witalic_W p-value
A 465 iOS 342,661 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
Android 169,012 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
B 868 iOS 398,812 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
Android 167,411 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT

We find that the disparity observed in Fig. 1 is in line with the results from the statistical testing as shown in Table 2. We observe that the p-values from both iOS and Android devices in City A and B are significantly low, suggesting substantial inhomogeneity in the sample sizes relative to the population across regions.

3.2 Cumulative distribution functions of internet speed: before and after regional bias correction

Because we found significant evidence of the Speedtest distribution being disproportionate across regions, we introduce two ways to correct the regional sampling bias. By comparing the estimation of the cumulative distribution function from the original and bias-corrected samples, we can assess whether this sampling bias affects the estimation of the distribution of internet speed.

To delineate the empirical CDF of internet speed, let yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represent the internet speed measurement from the j𝑗jitalic_j-th unit of the i𝑖iitalic_i-th census block group for j=1,,ni𝑗1subscript𝑛𝑖j=1,\cdots,n_{i}italic_j = 1 , ⋯ , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Since our estimation covers k𝑘kitalic_k different census block groups, we define y𝑦yitalic_y as a simple random sample of internet speed from a collection of these k𝑘kitalic_k census block groups. Based on Assumptions 1 and 2, the simple random sample y𝑦yitalic_y can be represented by a mixture of k𝑘kitalic_k random variables y1,y2,y3,,yksubscript𝑦1subscript𝑦2subscript𝑦3subscript𝑦𝑘y_{1},y_{2},y_{3},\cdots,y_{k}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which denote the internet speed measurement from the 1st, 2nd, 3rd, ,\cdots,⋯ , and k𝑘kitalic_k-th census block group, respectively. Without any regional sampling bias, a simple random sample y𝑦yitalic_y of internet speed can be expressed as

y=i=1kyiI(zi=1),with 𝐳=(z1,,zk)TMultinom(1,(N1/N,,Nk/N)),formulae-sequence𝑦superscriptsubscript𝑖1𝑘subscript𝑦𝑖𝐼subscript𝑧𝑖1with 𝐳superscriptsubscript𝑧1subscript𝑧𝑘𝑇similar-toMultinom1subscript𝑁1𝑁subscript𝑁𝑘𝑁y=\sum_{i=1}^{k}y_{i}I(z_{i}=1),\ \text{with }\mathbf{z}=(z_{1},\cdots,z_{k})^% {T}\sim\text{Multinom}\left(1,(N_{1}/N,\cdots,N_{k}/N)\right),italic_y = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) , with bold_z = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ Multinom ( 1 , ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_N , ⋯ , italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_N ) ) , (2)

where I()𝐼I(\cdot)italic_I ( ⋅ ) refers to an indicator function, zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT only takes either 0 or 1 for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k, N=i=1kNi𝑁superscriptsubscript𝑖1𝑘subscript𝑁𝑖N=\sum_{i=1}^{k}N_{i}italic_N = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the total population over all k𝑘kitalic_k census block groups, and i=1kzi=1superscriptsubscript𝑖1𝑘subscript𝑧𝑖1\sum_{i=1}^{k}z_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. A simple random sample is critical for inferring the population, and sampling bias can lead to misleading conclusions, such as those in 2016 U.S. presidential election [29].

For a given internet speed x𝑥xitalic_x, we are interested in the cumulative probability of internet speed for the entire city:

F(x)=[yx].𝐹𝑥delimited-[]𝑦𝑥F(x)=\mathbb{P}\left[y\leq x\right].italic_F ( italic_x ) = blackboard_P [ italic_y ≤ italic_x ] . (3)

Similarly, we define Fi()subscript𝐹𝑖F_{i}(\cdot)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) to be the cumulative distribution at the i𝑖iitalic_i-th census block group, for i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k. The cumulative probability in (3) is a weighted sum of the regional cumulative probability by the corresponding population, as shown in Lemma 1.

Lemma 1.

The probability in (3) is equal to the weighted sum of Fi(x)subscript𝐹𝑖𝑥F_{i}(x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), where Fi(x)=[yix]subscript𝐹𝑖𝑥delimited-[]subscript𝑦𝑖𝑥F_{i}(x)=\mathbb{P}\left[y_{i}\leq x\right]italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = blackboard_P [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ] for i=1,,k𝑖1normal-⋯𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k, by its population proportion, i.e.

F(x)=i=1kNiNFi(x).𝐹𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥F(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x).italic_F ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) . (4)

The proof is given in Appendix A.1. To estimate the cumulative distribution of the internet speed of a city given in (3), a conventional way is to use all data points to employ the empirical CDF as:

F^b(x)=1ni=1kj=1niI(yijx),subscript^𝐹𝑏𝑥1𝑛superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖𝐼subscript𝑦𝑖𝑗𝑥\hat{F}_{b}(x)=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) , (5)

where n=i=1kni𝑛superscriptsubscript𝑖1𝑘subscript𝑛𝑖n=\sum_{i=1}^{k}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, indicating the total number of sample. The subscript ‘b𝑏bitalic_b’ denotes bias, as this empirical CDF in (5) is a weighted sum of regional empirical CDFs by corresponding sample proportions:

F^b(x)=i=1kninF^i(x),subscript^𝐹𝑏𝑥superscriptsubscript𝑖1𝑘subscript𝑛𝑖𝑛subscript^𝐹𝑖𝑥\hat{F}_{b}(x)=\sum_{i=1}^{k}\frac{n_{i}}{n}\hat{F}_{i}(x),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , (6)

where F^i(x)=1nij=1niI(yijx)subscript^𝐹𝑖𝑥1subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖𝐼subscript𝑦𝑖𝑗𝑥\hat{F}_{i}(x)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ), indicating the simple empirical CDF of i𝑖iitalic_i-th census block group for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Note that F^i(x)subscript^𝐹𝑖𝑥\hat{F}_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) converges to the CDF of the census block group i𝑖iitalic_i when the sample size goes to infinity, whereas the F^b(x)subscript^𝐹𝑏𝑥\hat{F}_{b}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x ) typically does not converge to the CDF in (4), when the ratio of the sample ni/nsubscript𝑛𝑖𝑛n_{i}/nitalic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n does not converge to the ratio of the population Ni/Nsubscript𝑁𝑖𝑁N_{i}/Nitalic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N at any census block group i𝑖iitalic_i.

3.2.1 Re-weighing the sample for bias correction

A direct approach to correct the regional sampling bias is to construct an unbiased estimator for (3) simply by re-weighing the regional empirical CDFs. The revision of weights shall be based on the population proportions so that the unbiased estimator F^usubscript^𝐹𝑢\hat{F}_{u}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for (3) is given by:

F^u(x)=i=1kNiNF^i(x).subscript^𝐹𝑢𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript^𝐹𝑖𝑥\hat{F}_{u}(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x).over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) . (7)

One can show that 𝔼[F^u(x)]=i=1kNiNFi(x)=F(x)𝔼delimited-[]subscript^𝐹𝑢𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥𝐹𝑥\mathbb{E}\left[\hat{F}_{u}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x)blackboard_E [ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_F ( italic_x ), meaning that F^u(x)subscript^𝐹𝑢𝑥\hat{F}_{u}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) is the unbiased estimator of the CDF for the entire city when the samples at each census block group are simple random samples. The proof of unbiasedness of Equation (7) is provided in Appendix A.2.

3.2.2 Re-sampling for bias correction

In some scenarios, having an unbiased or representative sample is important. Thus we introduce a re-sampling approach to provide a set of unbiased samples by correcting the regional sampling bias in the original sample. For the i𝑖iitalic_i-th census block group, we determine the expected number ni*superscriptsubscript𝑛𝑖n_{i}^{*}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in the re-sampled data to be:

ni*=[nNiN], for i=1,k,formulae-sequencesuperscriptsubscript𝑛𝑖delimited-[]𝑛subscript𝑁𝑖𝑁 for 𝑖1𝑘n_{i}^{*}=\left[\frac{nN_{i}}{N}\right],\ \mbox{ for }i=1\cdots,k,italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = [ divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ] , for italic_i = 1 ⋯ , italic_k , (8)

where [x]delimited-[]𝑥[x][ italic_x ] indicates the closest integer of a given real number x𝑥xitalic_x. In other words, we sub-sample the number of samples at each census block group based on the census block group’s proportion of the population. Then, for the i𝑖iitalic_i-th census block group, we draw a random sample of internet speed measurement with replacement as many as ni*superscriptsubscript𝑛𝑖n_{i}^{*}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT times. Denote yi1r,,yini*rsuperscriptsubscript𝑦𝑖1𝑟superscriptsubscript𝑦𝑖superscriptsubscript𝑛𝑖𝑟y_{i1}^{r},\cdots,y_{in_{i}^{*}}^{r}italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_i italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT as the re-sampled internet speed measurements at i𝑖iitalic_i-th census block group for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Then, one can define the estimated CDF at a given internet speed x𝑥xitalic_x from the re-sampled data as:

F^u*(x)subscriptsuperscript^𝐹𝑢𝑥\displaystyle\hat{F}^{*}_{u}(x)over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) =i=1kj=1ni*I(yijrx)absentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑗1superscriptsubscript𝑛𝑖𝐼superscriptsubscript𝑦𝑖𝑗𝑟𝑥\displaystyle=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}^{*}}I\left(y_{ij}^{r}\leq x\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ≤ italic_x ) (9)
=i=1kni*n*F^i*(x),absentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖superscript𝑛subscriptsuperscript^𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{*}}{n^{*}}\hat{F}^{*}_{i}(x),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ,

where n*=i=1kni*superscript𝑛superscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖n^{*}=\sum_{i=1}^{k}n_{i}^{*}italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and F^i*(x)=j=1ni*I(yijrx)subscriptsuperscript^𝐹𝑖𝑥superscriptsubscript𝑗1superscriptsubscript𝑛𝑖𝐼superscriptsubscript𝑦𝑖𝑗𝑟𝑥\hat{F}^{*}_{i}(x)=\sum_{j=1}^{n_{i}^{*}}I\left(y_{ij}^{r}\leq x\right)over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ≤ italic_x ). The following lemma shows that both the re-sampled estimator in (9) and re-weighed estimator in (7) can converge to the true CDF of the internet speed in a city when the sample size goes to infinity.

Lemma 2.

Assume that Assumptions 1 and 2 hold and that the number k𝑘kitalic_k of census block groups and population Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each i=1,,k𝑖1normal-⋯𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k remain fixed as constants. Then, for a given x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, we have

F^u(x)subscript^𝐹𝑢𝑥\displaystyle\hat{F}_{u}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) F(x) as well asabsent𝐹𝑥 as well as\displaystyle\xrightarrow[]{\mathbb{P}}F(x)\text{ as well as}start_ARROW overblackboard_P → end_ARROW italic_F ( italic_x ) as well as (10)
F^u*(x)subscriptsuperscript^𝐹𝑢𝑥\displaystyle\hat{F}^{*}_{u}(x)over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) F(x),absent𝐹𝑥\displaystyle\xrightarrow[]{\mathbb{P}}F(x),start_ARROW overblackboard_P → end_ARROW italic_F ( italic_x ) ,

as ninormal-→subscript𝑛𝑖n_{i}\rightarrow\inftyitalic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → ∞ for each i=1,,k𝑖1normal-⋯𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k so that nnormal-→𝑛n\rightarrow\inftyitalic_n → ∞ with normal-→\xrightarrow[]{\mathbb{P}}start_ARROW overblackboard_P → end_ARROW denoting convergence in probability.

Lemma 2 is proved in Appendix A.3. The assessment of the uncertainty of the different methods for estimating the CDF is provided in Appendix A.4. According to Lemma 2, as long as Assumptions 1 and 2 hold, the re-weighted estimator in (7) and the re-sampled estimator in (9) are both consistent for estimating the CDF of internet speed of a city with a sufficiently large sample for each census block group. This enables us to recover a set of data with regional sampling bias adjusted so that we can easily generate reference results from various sorts of analysis by applying the same analytic tactics to re-sampled data.

Refer to caption
Figure 2: Comparison of empirical CDF from the original samples with re-weighted empirical CDF from original samples and empirical CDF from re-sampled data. (a) iOS devices in City A; (b) Android devices in City A; (c) iOS devices in City B; and (d) Android devices in City B. The insets show zoom-in plots of the estimated CDFs of the internet download speed above the 98th𝑡{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT percentile from the original samples.

In Fig. 2, we compare three different CDFs: the empirical CDF from the original samples in (6), the re-weighted empirical CDF in (7), and the empirical CDF from the re-sampled data in (9). We find that the estimation from the three methods is not identical, but the differences are much smaller than the discrepancy found from regional sampling bias, as shown in Fig 1. This result indicates that even though the Ookla data contains substantial regional sampling bias among census block groups, such bias does not have a large impact on estimating the overall distribution of the internet speed in a city.

3.3 Association of regional bias with demographic variables

While the impact of regional sampling bias over different census block groups on the estimation of the CDF is not as pronounced as the regional bias itself, this impact depends on whether the samples within the census block groups are representative and independent, as outlined in Assumptions 1-2. These assumptions may not strictly hold in practice. For instance, if a certain demographic group is over-sampled, the samples may not be representative in each census block group, violating Assumption 2. Thus, it is crucial to discover potential sampling bias among other sub-groups. We collect the demographic variable data for each census block group for the following variables: income, age, gender, educational level, internet penetration rate, and ethnicity. The variables that characterize the demographic profile of the i𝑖iitalic_i-th census block group, for each i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k, are summarized in Table 3.

Table 3: List of demographic variables.
Notation Description
x1isubscript𝑥1𝑖x_{1i}italic_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT median household income
x2isubscript𝑥2𝑖x_{2i}italic_x start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT median age
x3isubscript𝑥3𝑖x_{3i}italic_x start_POSTSUBSCRIPT 3 italic_i end_POSTSUBSCRIPT % of male people
x4isubscript𝑥4𝑖x_{4i}italic_x start_POSTSUBSCRIPT 4 italic_i end_POSTSUBSCRIPT % of people with bachelor’s or higher degree
x5isubscript𝑥5𝑖x_{5i}italic_x start_POSTSUBSCRIPT 5 italic_i end_POSTSUBSCRIPT % of household with internet subscription plans
x6isubscript𝑥6𝑖x_{6i}italic_x start_POSTSUBSCRIPT 6 italic_i end_POSTSUBSCRIPT % of people who are identified as white
x7isubscript𝑥7𝑖x_{7i}italic_x start_POSTSUBSCRIPT 7 italic_i end_POSTSUBSCRIPT % of people who are identified as black
x8isubscript𝑥8𝑖x_{8i}italic_x start_POSTSUBSCRIPT 8 italic_i end_POSTSUBSCRIPT % of people who are identified as Asian
x9isubscript𝑥9𝑖x_{9i}italic_x start_POSTSUBSCRIPT 9 italic_i end_POSTSUBSCRIPT % of people who are identified as Hispanic

3.3.1 Characterization and comparison of regions

For each demographic feature xhisubscript𝑥𝑖x_{hi}italic_x start_POSTSUBSCRIPT italic_h italic_i end_POSTSUBSCRIPT where h=1,,919h=1,\cdots,9italic_h = 1 , ⋯ , 9, we have k𝑘kitalic_k corresponding values with the index set {1,2,3,,k}123𝑘\{1,2,3,\cdots,k\}{ 1 , 2 , 3 , ⋯ , italic_k } as we collect samples from k𝑘kitalic_k different census block groups. For each index i𝑖iitalic_i where i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k, we classify the census block group into two mutually exclusive groups:

  • i𝑖iitalic_i-th census block group is called over-sampled if ni>ni*subscript𝑛𝑖subscriptsuperscript𝑛𝑖n_{i}>n^{*}_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; and

  • i𝑖iitalic_i-th census block group is called under-sampled if ni<ni*subscript𝑛𝑖subscriptsuperscript𝑛𝑖n_{i}<n^{*}_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

We note that there is no census block group where ni=ni*subscript𝑛𝑖superscriptsubscript𝑛𝑖n_{i}=n_{i}^{*}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in both City A and B, for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Let {o1,o2,,oko}subscript𝑜1subscript𝑜2subscript𝑜subscript𝑘𝑜\{o_{1},o_{2},\cdots,o_{k_{o}}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and {u1,u2,,uku}subscript𝑢1subscript𝑢2subscript𝑢subscript𝑘𝑢\{u_{1},u_{2},\cdots,u_{k_{u}}\}{ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT } denote the set of indices for over-/under-sampled regions, with ‘o𝑜oitalic_o’ and ’u𝑢uitalic_u’ denoting over-sampled and under-sampled, respectively, such that {o1,o2,,oko}{u1,u2,,uku}={1,,k}subscript𝑜1subscript𝑜2subscript𝑜subscript𝑘𝑜subscript𝑢1subscript𝑢2subscript𝑢subscript𝑘𝑢1𝑘\{o_{1},o_{2},\cdots,o_{k_{o}}\}\cup\{u_{1},u_{2},\cdots,u_{k_{u}}\}=\{1,% \cdots,k\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∪ { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = { 1 , ⋯ , italic_k }, and {o1,o2,,oko}{u1,u2,,uku}=ϕsubscript𝑜1subscript𝑜2subscript𝑜subscript𝑘𝑜subscript𝑢1subscript𝑢2subscript𝑢subscript𝑘𝑢italic-ϕ\{o_{1},o_{2},\cdots,o_{k_{o}}\}\cap\{u_{1},u_{2},\cdots,u_{k_{u}}\}=\phi{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_o start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∩ { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = italic_ϕ, and ko+ku=ksubscript𝑘𝑜subscript𝑘𝑢𝑘k_{o}+k_{u}=kitalic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_k. Accordingly, for the hhitalic_h-th demographic feature with h=1,,919h=1,\cdots,9italic_h = 1 , ⋯ , 9, we have two separate groups, denoted as {xho1,xho2,,xhoko}subscript𝑥subscript𝑜1subscript𝑥subscript𝑜2subscript𝑥subscript𝑜subscript𝑘𝑜\{x_{ho_{1}},x_{ho_{2}},\cdots,x_{ho_{k_{o}}}\}{ italic_x start_POSTSUBSCRIPT italic_h italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_h italic_o start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_h italic_o start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and {xhu1,xhu2,,xhuku}subscript𝑥subscript𝑢1subscript𝑥subscript𝑢2subscript𝑥subscript𝑢subscript𝑘𝑢\{x_{hu_{1}},x_{hu_{2}},\cdots,x_{hu_{k_{u}}}\}{ italic_x start_POSTSUBSCRIPT italic_h italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_h italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_h italic_u start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT } for over-/under-sampled census block groups, respectively.

For the hhitalic_h-th demographic variable with h=1,,919h=1,\cdots,9italic_h = 1 , ⋯ , 9, define x¯ho=1koi=1koxhoisubscript¯𝑥𝑜1subscript𝑘𝑜superscriptsubscript𝑖1subscript𝑘𝑜subscript𝑥subscript𝑜𝑖\bar{x}_{ho}=\frac{1}{k_{o}}\sum_{i=1}^{k_{o}}x_{ho_{i}}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_h italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and x¯hu=1kui=1kuxhuisubscript¯𝑥𝑢1subscript𝑘𝑢superscriptsubscript𝑖1subscript𝑘𝑢subscript𝑥subscript𝑢𝑖\bar{x}_{hu}=\frac{1}{k_{u}}\sum_{i=1}^{k_{u}}x_{hu_{i}}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_u end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_h italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which represent the sample mean of the over-/under-sampled regions, respectively. To evaluate the difference of location between two distributions, we apply a two-sample t-test with the test statistics T𝑇Titalic_T given by:

T=x¯hox¯hus2(1/ko+1/ku),𝑇subscript¯𝑥𝑜subscript¯𝑥𝑢superscript𝑠21subscript𝑘𝑜1subscript𝑘𝑢T=\frac{\bar{x}_{ho}-\bar{x}_{hu}}{\sqrt{s^{2}\left(1/k_{o}+1/k_{u}\right)}},italic_T = divide start_ARG over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_u end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 / italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + 1 / italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_ARG end_ARG , (11)

where s2=(i=1ko(xhoix¯ho)2+i=1ku(xhuix¯hu)2)/(k2)superscript𝑠2superscriptsubscript𝑖1subscript𝑘𝑜superscriptsubscript𝑥subscript𝑜𝑖subscript¯𝑥𝑜2superscriptsubscript𝑖1subscript𝑘𝑢superscriptsubscript𝑥subscript𝑢𝑖subscript¯𝑥𝑢2𝑘2s^{2}=\left(\sum_{i=1}^{k_{o}}(x_{ho_{i}}-\bar{x}_{ho})^{2}+\sum_{i=1}^{k_{u}}% (x_{hu_{i}}-\bar{x}_{hu})^{2}\right)/(k-2)italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_h italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_h italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h italic_u end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( italic_k - 2 ). Under the null hypothesis where the distributions of over-/under-sampled census block groups have the same location, the statistic Ttk2similar-to𝑇subscript𝑡𝑘2T\sim t_{k-2}italic_T ∼ italic_t start_POSTSUBSCRIPT italic_k - 2 end_POSTSUBSCRIPT, a Student’s t-distribution [19] with k2𝑘2k-2italic_k - 2 degrees of freedom.

Refer to caption
Figure 3: Comparison of distributions of demographic variables between over-/under-sampled census block groups. Significance is based on the p-values from the two-sample t-test: *** (p-value<<<0.001); ** (p-value<<<0.01); * (p-value<<<0.05); and empty mark for non-significant cases. (a) iOS devices from City A; (b) iOS devices from City B.
Refer to caption
Figure 4: Comparison of distributions of demographic variables between over-/under-sampled census block groups. Significance is based on the p-values from the two-sample t-test: *** (p-value<<<0.001); ** (p-value<<<0.01); * (p-value<<<0.05); and empty mark for non-significant cases. (a) Android devices from City A; (b) Android devices from City B.

In Fig. 3 and 4, we compare the distributions of demographic variables between over-sampled and under-sampled regions by boxplots, where the stars represent the significance level from the two-sample t-test. In both cities, we find that the sampling of Ookla’s Speedtest is indeed inhomogeneous over the demographic variables. Specifically, we observe in both cities that over-sampled census block groups, for either iOS or Android devices, tend to have a greater age and a larger proportion of individuals with bachelor’s degrees or higher. However, City A (part (a) in Fig. 3 and 4) presents more prominent asymmetry than City B in many aspects. First, we find in City A that the over-sampled census block groups dominantly have higher income, a larger proportion of the population with a bachelor’s degree or higher, and a greater prevalence of households with internet subscriptions. Furthermore, the comparison of Cities A and B reveals the ethnic variety within and between cities. City A shows the contrast between census block groups with a higher percentage of white or black residents, whereas most of the census block groups in City B have a high representation of black residents. This contrast in City A has a relationship with Speedtest sampling; over-sampled regions tend to have a significantly higher representation of white residents and a lower representation of black residents, which is not the case in City B. Additionally, the over-sampled regions for iOS devices in City A tend to have more Hispanic population whereas the difference is not significant in City B.

4 Correlating internet quality with demographic variables

Section 3.3 showed that Ookla Speedtests from our cities under study are unevenly sampled across different demographic groups. This result is important if there is a relationship between internet quality (e.g., speed) and demographic variables. Hence we now ask the question, does internet access quality have a relationship to demographic variables? To answer this question, we correlate the internet speed with the demographic profiles in each census block group for both the original data set and the re-sampled data set after correcting the regional sampling bias. Because demographic profiles of individuals who conducted Speedtests are not available, we will integrate the Speedtests and demographic variables at two different spatial granular levels in the next section.

4.1 Multiple linear regression

We begin with a multiple linear regression model that relates internet speed yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the Speedtest j𝑗jitalic_j at census block group i𝑖iitalic_i, to the demographic variable 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

yij=𝐱iT𝜷+ϵij,subscript𝑦𝑖𝑗superscriptsubscript𝐱𝑖𝑇𝜷subscriptitalic-ϵ𝑖𝑗y_{ij}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}+\epsilon_{ij},italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β + italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (12)

where 𝐱i=(1,x1i,x2i,,xpi)Tsubscript𝐱𝑖superscript1subscript𝑥1𝑖subscript𝑥2𝑖subscript𝑥𝑝𝑖𝑇\mathbf{x}_{i}=(1,x_{1i},x_{2i},\cdots,x_{pi})^{T}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_x start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes a (p+1)𝑝1(p+1)( italic_p + 1 )-dimensional vector of predictors (such as income, age, level of education, etc.) with p=9𝑝9p=9italic_p = 9, 𝜷=(β0,β1,,βp)T(p+1)𝜷superscriptsubscript𝛽0subscript𝛽1subscript𝛽𝑝𝑇superscript𝑝1\boldsymbol{\beta}=(\beta_{0},\beta_{1},\cdots,\beta_{p})^{T}\in\mathbb{R}^{(p% +1)}bold_italic_β = ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT is a vector of linear coefficients with p=9𝑝9p=9italic_p = 9, and ϵij𝒩(0,σ2)similar-tosubscriptitalic-ϵ𝑖𝑗𝒩0superscript𝜎2\epsilon_{ij}\sim\mathcal{N}(0,\sigma^{2})italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) denotes a Gaussian white noise with variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, for j=1,,ni,𝑗1subscript𝑛𝑖j=1,\cdots,n_{i},italic_j = 1 , ⋯ , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , and i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Note that the observations from the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT census block group, denoted as yi1,,yinisubscript𝑦𝑖1subscript𝑦𝑖subscript𝑛𝑖y_{i1},\cdots,y_{in_{i}}italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_i italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, share the same predictor vector 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. It follows that the model (12) is equivalent to:

𝐲=𝐕𝜷+ϵ,𝐲𝐕𝜷bold-italic-ϵ\mathbf{y}=\mathbf{V}\boldsymbol{\beta}+\boldsymbol{\epsilon},bold_y = bold_V bold_italic_β + bold_italic_ϵ , (13)

where 𝐲=(y11,,y1n1,,yk1,,yknk)Tn𝐲superscriptsubscript𝑦11subscript𝑦1subscript𝑛1subscript𝑦𝑘1subscript𝑦𝑘subscript𝑛𝑘𝑇superscript𝑛\mathbf{y}=(y_{11},\cdots,y_{1n_{1}},\cdots,y_{k1},\cdots,y_{kn_{k}})^{T}\in% \mathbb{R}^{n}bold_y = ( italic_y start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT 1 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_k italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with n=i=1kni𝑛superscriptsubscript𝑖1𝑘subscript𝑛𝑖n=\sum_{i=1}^{k}n_{i}italic_n = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐕=[𝐱1𝟏n1T,𝐱2𝟏n2T,,𝐱k𝟏nkT]Tn×(p+1)𝐕superscriptsubscript𝐱1superscriptsubscript1subscript𝑛1𝑇subscript𝐱2superscriptsubscript1subscript𝑛2𝑇subscript𝐱𝑘superscriptsubscript1subscript𝑛𝑘𝑇𝑇superscript𝑛𝑝1\mathbf{V}=[\mathbf{x}_{1}\mathbf{1}_{n_{1}}^{T},\mathbf{x}_{2}\mathbf{1}_{n_{% 2}}^{T},\cdots,\mathbf{x}_{k}\mathbf{1}_{n_{k}}^{T}]^{T}\in\mathbb{R}^{n\times% (p+1)}bold_V = [ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( italic_p + 1 ) end_POSTSUPERSCRIPT, 𝟏dsubscript1𝑑\mathbf{1}_{d}bold_1 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes a d𝑑ditalic_d-dimensional column vector with all elements being 1, and
ϵ=(ϵ11,,ϵ1n1,,ϵk1,,ϵknk)T𝒩(𝟎,σ2𝐈n)bold-italic-ϵsuperscriptsubscriptitalic-ϵ11subscriptitalic-ϵ1subscript𝑛1subscriptitalic-ϵ𝑘1subscriptitalic-ϵ𝑘subscript𝑛𝑘𝑇similar-to𝒩0superscript𝜎2subscript𝐈𝑛\boldsymbol{\epsilon}=(\epsilon_{11},\cdots,\epsilon_{1n_{1}},\cdots,\epsilon_% {k1},\cdots,\epsilon_{kn_{k}})^{T}\sim\mathcal{MN}(\mathbf{0},\sigma^{2}% \mathbf{I}_{n})bold_italic_ϵ = ( italic_ϵ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , ⋯ , italic_ϵ start_POSTSUBSCRIPT 1 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_ϵ start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , ⋯ , italic_ϵ start_POSTSUBSCRIPT italic_k italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_M caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) with 𝐈nsubscript𝐈𝑛\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT being an n𝑛nitalic_n-dimensional identity matrix.

Due to the two different spatial granular levels in model (12), an aggregated version of model (12) can be suggested to equalize the granular levels without affecting the estimation of 𝜷𝜷\boldsymbol{\beta}bold_italic_β in model (12).

y¯i=𝐱iT𝜷+ϵi,subscript¯𝑦𝑖subscriptsuperscript𝐱𝑇𝑖𝜷subscriptitalic-ϵ𝑖\bar{y}_{i}=\mathbf{x}^{T}_{i}\boldsymbol{\beta}+\epsilon_{i},over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_β + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (14)

where y¯i=1nij=1niyijsubscript¯𝑦𝑖1subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑦𝑖𝑗\bar{y}_{i}=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}y_{ij}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT independently follows 𝒩(0,σ2/ni)𝒩0superscript𝜎2subscript𝑛𝑖\mathcal{N}(0,\sigma^{2}/n_{i})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Model (14) can be written as a matrix form:

𝐲¯=𝐗𝜷+ϵagg,¯𝐲𝐗𝜷subscriptbold-italic-ϵagg\bar{\mathbf{y}}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\text{agg% }},over¯ start_ARG bold_y end_ARG = bold_X bold_italic_β + bold_italic_ϵ start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT , (15)

where 𝐲¯=(y¯1,,y¯k)Tk¯𝐲superscriptsubscript¯𝑦1subscript¯𝑦𝑘𝑇superscript𝑘\bar{\mathbf{y}}=(\bar{y}_{1},\cdots,\bar{y}_{k})^{T}\in\mathbb{R}^{k}over¯ start_ARG bold_y end_ARG = ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, 𝐗=[𝐱1,,𝐱k]Tk×(p+1)𝐗superscriptsubscript𝐱1subscript𝐱𝑘𝑇superscript𝑘𝑝1\mathbf{X}=[\mathbf{x}_{1},\cdots,\mathbf{x}_{k}]^{T}\in\mathbb{R}^{k\times(p+% 1)}bold_X = [ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × ( italic_p + 1 ) end_POSTSUPERSCRIPT, ϵagg=(ϵ1,,ϵk)T𝒩(𝟎,σ2𝐖2)subscriptbold-italic-ϵaggsuperscriptsubscriptitalic-ϵ1subscriptitalic-ϵ𝑘𝑇similar-to𝒩0superscript𝜎2superscript𝐖2\boldsymbol{\epsilon}_{\text{agg}}=(\epsilon_{1},\cdots,\epsilon_{k})^{T}\sim% \mathcal{MN}(\mathbf{0},\sigma^{2}\mathbf{W}^{-2})bold_italic_ϵ start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT = ( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_M caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), and 𝐖𝐖\mathbf{W}bold_W is a k×k𝑘𝑘k\times kitalic_k × italic_k diagonal matrix with diagonal entries being n1,,nksubscript𝑛1subscript𝑛𝑘\sqrt{n_{1}},\cdots,\sqrt{n_{k}}square-root start_ARG italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , ⋯ , square-root start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG. We have the following lemma that connects the individual level model (12) and the aggregated level model (14).

Lemma 3.

Suppose our predictor vectors are given, i.e., we know the vector 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,k𝑖1normal-⋯𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Let l𝑙litalic_l and l¯normal-¯𝑙\bar{l}over¯ start_ARG italic_l end_ARG be the natural logarithm of the likelihood of model (12) and (14), respectively. Then, we have:

l(𝜷,σ2)=l¯(𝜷,σ2)+cσ2,𝑙𝜷superscript𝜎2¯𝑙𝜷superscript𝜎2subscript𝑐superscript𝜎2l(\boldsymbol{\beta},\sigma^{2})=\bar{l}(\boldsymbol{\beta},\sigma^{2})+c_{% \sigma^{2}},italic_l ( bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_l end_ARG ( bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (16)

where l¯(𝛃,σ2)=klog2πσ2+i=1klognii=1kni(y¯i𝐱iT𝛃)2/(2σ2)normal-¯𝑙𝛃superscript𝜎2𝑘2𝜋superscript𝜎2superscriptsubscript𝑖1𝑘subscript𝑛𝑖superscriptsubscript𝑖1𝑘subscript𝑛𝑖superscriptsubscriptnormal-¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝛃22superscript𝜎2\bar{l}(\boldsymbol{\beta},\sigma^{2})=k\log\sqrt{2\pi\sigma^{2}}+\sum_{i=1}^{% k}\log\sqrt{n_{i}}-\sum_{i=1}^{k}n_{i}(\bar{y}_{i}-\mathbf{x}_{i}^{T}% \boldsymbol{\beta})^{2}/(2\sigma^{2})over¯ start_ARG italic_l end_ARG ( bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_k roman_log square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_log square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and cσ2=(nk)log2πσ2i=1jnii=1kj=1ni(yijy¯i)2/(2σ2)subscript𝑐superscript𝜎2𝑛𝑘2𝜋superscript𝜎2superscriptsubscript𝑖1𝑗subscript𝑛𝑖superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗subscriptnormal-¯𝑦𝑖22superscript𝜎2c_{\sigma^{2}}=-(n-k)\log\sqrt{2\pi\sigma^{2}}-\sum_{i=1}^{j}\sqrt{n_{i}}-\sum% _{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}/(2\sigma^{2})italic_c start_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = - ( italic_n - italic_k ) roman_log square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT square-root start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Lemma 3 is derived in Appendix A.5. Equation (16) in Lemma 3 means that the difference between the log-likelihood of model (12) and (14) does not depend on the linear coefficient 𝜷𝜷\boldsymbol{\beta}bold_italic_β, and only depends on the variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus the sufficient statistics of the linear coefficients [19] is (𝐲¯,σ2)¯𝐲superscript𝜎2(\bar{\mathbf{y}},\sigma^{2})( over¯ start_ARG bold_y end_ARG , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with 𝐲¯=(y¯1,,y¯k)T¯𝐲superscriptsubscript¯𝑦1subscript¯𝑦𝑘𝑇\bar{\mathbf{y}}=(\bar{y}_{1},\cdots,\bar{y}_{k})^{T}over¯ start_ARG bold_y end_ARG = ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT being the aggregated data. Note the maximum likelihood estimator of linear coefficients does not depend on σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, whereas the uncertainty of the estimation depends on the noise variance.

In practice, however, the variance parameter σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the noise is also unknown. One may be tempted to use the aggregated model (15) to estimate σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The lemma below indicates that the estimation of the variance of the noise by individual-level data is more efficient than that by the aggregated data.

Lemma 4.

Define σ^2=𝐲T(𝐈n𝐉)𝐲/(np1)superscriptnormal-^𝜎2superscript𝐲𝑇subscript𝐈𝑛𝐉𝐲𝑛𝑝1\hat{\sigma}^{2}=\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/(n-p-1)over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) bold_y / ( italic_n - italic_p - 1 ) and σ^𝑎𝑔𝑔2=𝐲T𝐖(𝐈k𝐇)𝐖𝐲/(kp1)subscriptsuperscriptnormal-^𝜎2𝑎𝑔𝑔superscript𝐲𝑇𝐖subscript𝐈𝑘𝐇𝐖𝐲𝑘𝑝1\hat{\sigma}^{2}_{\text{agg}}=\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{% H})\mathbf{W}\mathbf{y}/(k-p-1)over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT = bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W ( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) bold_Wy / ( italic_k - italic_p - 1 ) where 𝐉=𝐕(𝐕T𝐕)1𝐕T𝐉𝐕superscriptsuperscript𝐕𝑇𝐕1superscript𝐕𝑇\mathbf{J}=\mathbf{V}(\mathbf{V}^{T}\mathbf{V})^{-1}\mathbf{V}^{T}bold_J = bold_V ( bold_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_V ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and 𝐇=𝐖𝐗(𝐗𝐖2𝐗)1𝐗T𝐖𝐇𝐖𝐗superscriptsuperscript𝐗𝐖2𝐗1superscript𝐗𝑇𝐖\mathbf{H}=\mathbf{W}\mathbf{X}(\mathbf{X}\mathbf{W}^{2}\mathbf{X})^{-1}% \mathbf{X}^{T}\mathbf{W}bold_H = bold_WX ( bold_XW start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W following the notation from (13) and (15). Note that both σ^2superscriptnormal-^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σ^𝑎𝑔𝑔2subscriptsuperscriptnormal-^𝜎2𝑎𝑔𝑔\hat{\sigma}^{2}_{\text{agg}}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT are unbiased for σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e. 𝔼[σ^2]=𝔼[σ^𝑎𝑔𝑔2]=σ2𝔼delimited-[]superscriptnormal-^𝜎2𝔼delimited-[]subscriptsuperscriptnormal-^𝜎2𝑎𝑔𝑔superscript𝜎2\mathbb{E}\left[\hat{\sigma}^{2}\right]=\mathbb{E}\left[\hat{\sigma}^{2}_{% \text{agg}}\right]=\sigma^{2}blackboard_E [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. However,

𝑉𝑎𝑟[σ^2]=2σ4np1<2σ4kp1=𝑉𝑎𝑟[σ^𝑎𝑔𝑔2],𝑉𝑎𝑟delimited-[]superscript^𝜎22superscript𝜎4𝑛𝑝12superscript𝜎4𝑘𝑝1𝑉𝑎𝑟delimited-[]subscriptsuperscript^𝜎2𝑎𝑔𝑔\text{Var}\left[\hat{\sigma}^{2}\right]=\frac{2\sigma^{4}}{n-p-1}<\frac{2% \sigma^{4}}{k-p-1}=\text{Var}\left[\hat{\sigma}^{2}_{\text{agg}}\right],Var [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 2 italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n - italic_p - 1 end_ARG < divide start_ARG 2 italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k - italic_p - 1 end_ARG = Var [ over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT ] ,

as long as n>k𝑛𝑘n>kitalic_n > italic_k.

Thus, we adopt the formulation of model (12) to assess the linear relationship between internet download speed (Mbps) and demographic features. We apply the regression approaches for both the original samples and the re-sampled data studied in Section 3.2.2. Similarly, for the re-sampled data, we construct the linear regression as follows:

yijr=𝐱iT𝜷r+ϵijr,superscriptsubscript𝑦𝑖𝑗𝑟superscriptsubscript𝐱𝑖𝑇superscript𝜷𝑟superscriptsubscriptitalic-ϵ𝑖𝑗𝑟y_{ij}^{r}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}^{r}+\epsilon_{ij}^{r},italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , (17)

where 𝜷r=(β0r,β1r,,βpr)T(p+1)superscript𝜷𝑟superscriptsuperscriptsubscript𝛽0𝑟superscriptsubscript𝛽1𝑟superscriptsubscript𝛽𝑝𝑟𝑇superscript𝑝1\boldsymbol{\beta}^{r}=(\beta_{0}^{r},\beta_{1}^{r},\cdots,\beta_{p}^{r})^{T}% \in\mathbb{R}^{(p+1)}bold_italic_β start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , ⋯ , italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT, and ϵijrsuperscriptsubscriptitalic-ϵ𝑖𝑗𝑟\epsilon_{ij}^{r}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT independently follows 𝒩(0,σr2)𝒩0superscriptsubscript𝜎𝑟2\mathcal{N}(0,\sigma_{r}^{2})caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for j=1,,ni*𝑗1superscriptsubscript𝑛𝑖j=1,\cdots,n_{i}^{*}italic_j = 1 , ⋯ , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k.

4.2 Model selection

Instead of directly comparing the full models presented in (12) and (17), we employ a backward model selection procedure to select demographic variables that significantly impact measured internet speeds based on the Akaike Information Criterion (AIC)  [17]. Let msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote a model with index m𝑚mitalic_m, and define dmsubscript𝑑𝑚d_{m}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as the dimension of model msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The AIC for msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is defined as follows:

AICm=2l^n,m+dm,subscriptAIC𝑚2subscript^𝑙𝑛𝑚subscript𝑑𝑚\text{AIC}_{m}=-2\hat{l}_{n,m}+d_{m},AIC start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = - 2 over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (18)

where l^n,msubscript^𝑙𝑛𝑚\hat{l}_{n,m}over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT represents the maximum log-likelihood of model msubscript𝑚\mathcal{M}_{m}caligraphic_M start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT given n𝑛nitalic_n observed data. Commencing with a full model featuring p𝑝pitalic_p predictors, we consider p𝑝pitalic_p distinct submodels created by eliminating one variable at a time. For each submodel, we compute the corresponding AIC using (18). If any submodel exhibits a smaller AIC compared to the existing model, we select that model for the subsequent stage and repeat the same process. If no such submodel is found to have a small AIC, we conclude with the existing model as the final selected model.

Refer to caption
Figure 5: Comparison of regression coefficient estimates (dots) and 95% confidence intervals (bars) from original data and re-sampled data. Multiple linear regression is conducted with backward variable selection by AIC for both iOS and Android from (a) City A, and (b) City B. Only the variables selected from the model selection step are shown in the figure.

4.3 Disparity of internet quality among demographic groups

Fig. 5 provides estimated linear coefficients from the regression analysis conducted on both the original and re-sampled data after the backward elimination technique using AIC. It presents estimates of coefficients and their corresponding 95% confidence intervals for variables, whereas the variables dropped from the backward selection procedure are not shown. From the graphs, we can make multiple observations.

First, there is a significant negative correlation between measured internet speed and age in City A, City C, and City D shown in Fig 12 in the Appendix. This means internet speed is greater in census block groups with a younger population on average, given other estimated variables. City B is an exception, as the effect of age does not seem to be clear.

Second, regions with a larger percentage of Hispanic population generally have lower measured internet speed for both cities and device types shown in Fig. 5. For the two cities shown in Fig.12 in Appendix B.2, the effect of the percentage of the Hispanic population is not as clear as the two cities shown in Fig. 5; however, after examining the pairwise correlation plot between the covariates in Fig. 14-15, we find that the percentage of the Hispanic population is negatively correlated with the percentage of the bachelor’s degree and availability of the internet. Thus the effect of Hispanic percentage in population can be partly offset by these two effects. For instance, the coefficient of bachelor’s degree in the re-sampled data is significantly larger than zero in part (a) of Fig. 12, which implicitly suggests that the regions with higher Hispanic percentage may have comparatively lower internet speed, as these regions tend to have a lower percentage of residents with bachelor’s degrees.

Third, we find that the regression coefficients for median income are positive for the original data of both device types in cities A and B, meaning that the regions with higher income tend to have faster-measured internet speed. This can likely be attributed to the availability of faster internet plans, as well as the higher purchasing power of local residents; prior work found that the median income of the census block groups play a critical role in determining whether a region gets a fiber deployment and consequently faster internet speeds [30]. The coefficients of median income in the re-sampled data are positive for City B but negative for City A, as shown in Fig. 5. In both cities, the linear coefficients of income in the re-sampled data are smaller than the ones from the original data. To further explore the difference, we find that the pairwise correlation between income and bachelor’s degree is strongly positive in both cities, as shown in Fig. 6. The estimated linear coefficients of the bachelor’s degree in the re-sampled data are larger than those in the original data for both devices and cities. As the bachelor’s degree and income have strongly positive correlation, the larger coefficient of the bachelor’s degree explains the positive impact on the internet speed, which makes the coefficients of the income smaller in the re-sampled data. Note that an estimated coefficient represents the conditional effects of a covariate given all other covariates in multiple linear regression. The effect of a covariate typically depends on the effects of other variables as multicollinearity of the covariates is common in practice [28].

The regression analysis of Speedtest data from the two cities shows that measured internet speed critically depends on demographic profiles of regions, such as the income, education level and ethnic composition. Future work should study the reasons behind these associations, such as the availability of faster internet plans, and their cost per bit, e.g. carriage value [30].

Refer to caption
Figure 6: Heatmap of pair-wise correlation coefficients between demographic covariates from re-sampled data for iOS devices in (a) City A and (b) City B.

5 Temporal progression of internet speed

Estimating time-dependent internet speed is important for assessing the change of internet quality over time. In this section, we investigate the temporal trend of measured internet speed using both linear regression analysis and Gaussian processes for modeling time sequences. We also compare the original samples and the bias-corrected samples to evaluate whether the regional sampling bias affects the estimation of temporal analysis. For both cities, we have data from 05-31-2020 to 12-31-2021.

5.1 Assessing the linear trend of internet speed

We first analyze the linear trends of internet speed. Let y(ti)𝑦subscript𝑡𝑖y(t_{i})italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the measured internet speed at time point tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n; i.e., we have n𝑛nitalic_n distinct time points and each tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to a positive real number indicating the time lapse from the starting date measured by days. The linear regression model of internet speed with an intercept and time as a covariate is given by:

y(ti)=β0t+β1tti+εi,𝑦subscript𝑡𝑖subscript𝛽0𝑡subscript𝛽1𝑡subscript𝑡𝑖subscript𝜀𝑖y(t_{i})=\beta_{0t}+\beta_{1t}t_{i}+\varepsilon_{i},italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (19)

where εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents Gaussian white noise, with variance σε2superscriptsubscript𝜎𝜀2\sigma_{\varepsilon}^{2}italic_σ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n. We focus on the temporal change rate measured by the linear coefficient over time β1tsubscript𝛽1𝑡\beta_{1t}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT, where the maximum likelihood estimator is the least square estimator below:

β^1t=i=1n(tit¯)(y(ti)y¯)i=1n(tit¯)2,subscript^𝛽1𝑡superscriptsubscript𝑖1𝑛subscript𝑡𝑖¯𝑡𝑦subscript𝑡𝑖¯𝑦superscriptsubscript𝑖1𝑛superscriptsubscript𝑡𝑖¯𝑡2\hat{\beta}_{1t}=\frac{\sum_{i=1}^{n}(t_{i}-\bar{t})(y(t_{i})-\bar{y})}{\sum_{% i=1}^{n}(t_{i}-\bar{t})^{2}},over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_t end_ARG ) ( italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over¯ start_ARG italic_y end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_t end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (20)

where t¯=1ni=1nti¯𝑡1𝑛superscriptsubscript𝑖1𝑛subscript𝑡𝑖\bar{t}=\frac{1}{n}\sum_{i=1}^{n}t_{i}over¯ start_ARG italic_t end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and y¯=1ni=1ny(ti)¯𝑦1𝑛superscriptsubscript𝑖1𝑛𝑦subscript𝑡𝑖\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y(t_{i})over¯ start_ARG italic_y end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

To evaluate the impact of regional sampling bias, we obtain a re-sampled data set with bias-correction introduced in Section 3.2.2. Let tirsuperscriptsubscript𝑡𝑖𝑟t_{i}^{r}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT denote the distinct time point after re-sampling, for i=1,,nr𝑖1superscript𝑛𝑟i=1,\cdots,n^{r}italic_i = 1 , ⋯ , italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. We define y(tir)𝑦superscriptsubscript𝑡𝑖𝑟y(t_{i}^{r})italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) as the re-sampled internet speed at time point tirsuperscriptsubscript𝑡𝑖𝑟t_{i}^{r}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. The linear regression model using the re-sampled data is given by:

y(tir)=β0tr+β1trtir+εir,𝑦superscriptsubscript𝑡𝑖𝑟superscriptsubscript𝛽0𝑡𝑟superscriptsubscript𝛽1𝑡𝑟superscriptsubscript𝑡𝑖𝑟superscriptsubscript𝜀𝑖𝑟y(t_{i}^{r})=\beta_{0t}^{r}+\beta_{1t}^{r}t_{i}^{r}+\varepsilon_{i}^{r},italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) = italic_β start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , (21)

where εirsuperscriptsubscript𝜀𝑖𝑟\varepsilon_{i}^{r}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT represents a Gaussian white noise with variance σεr2superscriptsubscript𝜎superscript𝜀𝑟2\sigma_{\varepsilon^{r}}^{2}italic_σ start_POSTSUBSCRIPT italic_ε start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for i=1,,nr𝑖1superscript𝑛𝑟i=1,\cdots,n^{r}italic_i = 1 , ⋯ , italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. The maximum likelihood estimator of β1trsuperscriptsubscript𝛽1𝑡𝑟\beta_{1t}^{r}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT in model (21) follows:

β^1tr=i=1nr(tirt¯r)(y(tir)y¯r)i=1nr(tirt¯r)2,superscriptsubscript^𝛽1𝑡𝑟superscriptsubscript𝑖1superscript𝑛𝑟superscriptsubscript𝑡𝑖𝑟superscript¯𝑡𝑟𝑦superscriptsubscript𝑡𝑖𝑟superscript¯𝑦𝑟superscriptsubscript𝑖1superscript𝑛𝑟superscriptsuperscriptsubscript𝑡𝑖𝑟superscript¯𝑡𝑟2\hat{\beta}_{1t}^{r}=\frac{\sum_{i=1}^{n^{r}}(t_{i}^{r}-\bar{t}^{r})(y(t_{i}^{% r})-\bar{y}^{r})}{\sum_{i=1}^{n^{r}}(t_{i}^{r}-\bar{t}^{r})^{2}},over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT - over¯ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) ( italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) - over¯ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT - over¯ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (22)

where t¯r=1nri=1nrtirsuperscript¯𝑡𝑟1superscript𝑛𝑟superscriptsubscript𝑖1superscript𝑛𝑟superscriptsubscript𝑡𝑖𝑟\bar{t}^{r}=\frac{1}{n^{r}}\sum_{i=1}^{n^{r}}t_{i}^{r}over¯ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and y¯r=1nri=1nry(tir)superscript¯𝑦𝑟1superscript𝑛𝑟superscriptsubscript𝑖1superscript𝑛𝑟𝑦superscriptsubscript𝑡𝑖𝑟\bar{y}^{r}=\frac{1}{n^{r}}\sum_{i=1}^{n^{r}}y(t_{i}^{r})over¯ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ).

There are two limitations of the linear regression analysis. First, the estimated linear coefficients in (19) and (21) can only capture average change over a time period. Second, the assumption is that the residuals are independent over time, whereas the Speedtest measurements are temporally correlated. To avoid these limitations, we introduce Gaussian processes for modeling the time sequences and accelerate the computation by state space representation without approximation.

5.2 Modeling the internet speed by state space models

Internet speeds are temporally correlated. A common way to model the temporal or spatio-temporal data is by Gaussian processes (GPs) [33]. However, the complexity of computing the likelihood function and making predictions by GPs increases cubically fast along with the number of observations, due to computing the inversion and log determinant of the covariance matrix. In our study, the number of measurements for each device in a city is between 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT-106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, which makes directly computing the likelihood by GPs prohibitively slow. Fortunately, GPs with some widely used covariance functions, such as the Matérn covariance function [23] with half-integer roughness parameters, can be equivalently represented by linear state space models, which makes computational complexity linearly increase with respect to the number of observations without making any approximations [24]. We briefly introduce a GP model of Speedtest measurements and relate it to the state space model for fast computation for the original Speedtest observations. The fast algorithm through the state space representation can be similarly applied to the regional bias-corrected samples.

Suppose any internet speed measurement is modeled by a noisy Gaussian process, meaning that any marginal distribution at time {t1,,tn}subscript𝑡1subscript𝑡𝑛\{t_{1},...,t_{n}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } follows a multivariate normal distribution (y(t1),,y(tn))T𝒩(𝟎,σ2(𝐑+η𝐈n))similar-tosuperscript𝑦subscript𝑡1𝑦subscript𝑡𝑛𝑇𝒩0superscript𝜎2𝐑𝜂subscript𝐈𝑛(y(t_{1}),...,y(t_{n}))^{T}\sim\mathcal{N}(\mathbf{0},\sigma^{2}(\mathbf{R}+% \eta\mathbf{I}_{n}))( italic_y ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_y ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_R + italic_η bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ), where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and η𝜂\etaitalic_η are variance and nugget parameters, respectively, and 𝐑𝐑\mathbf{R}bold_R is a correlation matrix with the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )th term parameterized by a kernel function K(ti,tj)𝐾subscript𝑡𝑖subscript𝑡𝑗K(t_{i},t_{j})italic_K ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Denote d=|tt|𝑑𝑡superscript𝑡d=|t-t^{\prime}|italic_d = | italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | as the distance between any time points t𝑡titalic_t and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We focus on the Matérn covariance function, which has the expression:

σ2K(d)=σ221νΓ(ν)(2νdγ)ν𝒦ν(2νdγ),superscript𝜎2𝐾𝑑superscript𝜎2superscript21𝜈Γ𝜈superscript2𝜈𝑑𝛾𝜈subscript𝒦𝜈2𝜈𝑑𝛾\sigma^{2}K(d)=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}d% }{\gamma}\right)^{\nu}\mathcal{K}_{\nu}\left(\frac{\sqrt{2\nu}d}{\gamma}\right),italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K ( italic_d ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT 1 - italic_ν end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( italic_ν ) end_ARG ( divide start_ARG square-root start_ARG 2 italic_ν end_ARG italic_d end_ARG start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( divide start_ARG square-root start_ARG 2 italic_ν end_ARG italic_d end_ARG start_ARG italic_γ end_ARG ) , (23)

where Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) is the gamma function, 𝒦ν()subscript𝒦𝜈\mathcal{K}_{\nu}(\cdot)caligraphic_K start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( ⋅ ) is the modified Bessel function of the second kind with a positive parameter ν𝜈\nuitalic_ν and γ𝛾\gammaitalic_γ is a range or lengthscale parameter of the correlation. The Matérn covariance has a closed form expression when the roughness parameter is a half-integer, ν=2m+12𝜈2𝑚12\nu=\frac{2m+1}{2}italic_ν = divide start_ARG 2 italic_m + 1 end_ARG start_ARG 2 end_ARG for m𝐍𝑚𝐍m\in\mathbf{N}italic_m ∈ bold_N. For instance, the Matérn with ν=5/2𝜈52\nu=5/2italic_ν = 5 / 2 has the expression:

σ2K(d)=σ2(1+5dγ+5d23γ2)exp(5dγ).superscript𝜎2𝐾𝑑superscript𝜎215𝑑𝛾5superscript𝑑23superscript𝛾25𝑑𝛾\displaystyle\sigma^{2}K(d)=\sigma^{2}\left(1+\frac{\sqrt{5}d}{\gamma}+\frac{5% d^{2}}{3\gamma^{2}}\right)\exp\left(-\frac{\sqrt{5}d}{\gamma}\right).italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K ( italic_d ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG square-root start_ARG 5 end_ARG italic_d end_ARG start_ARG italic_γ end_ARG + divide start_ARG 5 italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) roman_exp ( - divide start_ARG square-root start_ARG 5 end_ARG italic_d end_ARG start_ARG italic_γ end_ARG ) . (24)

An appealing feature of the Matérn covariance is that the process is ν1𝜈1\lfloor\nu-1\rfloor⌊ italic_ν - 1 ⌋ mean squared differentiable [21], as the smoothness of the process can be controlled by the roughness parameter.

Suppose we have internet speed measurements at n𝑛nitalic_n time points, denoted as 𝐲=(y(t1),,y(tn))T𝐲superscript𝑦subscript𝑡1𝑦subscript𝑡𝑛𝑇\mathbf{y}=(y(t_{1}),...,y(t_{n}))^{T}bold_y = ( italic_y ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_y ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. A conventional method is to estimate the parameter by the maximum likelihood estimator. Note that given range parameter and nugget parameters (γ,η)𝛾𝜂(\gamma,\eta)( italic_γ , italic_η ), the maximum likelihood estimator for variance is σ^2=S2/nsuperscript^𝜎2superscript𝑆2𝑛\hat{\sigma}^{2}=S^{2}/nover^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n where S2=𝐲T𝐑~1𝐲superscript𝑆2superscript𝐲𝑇superscript~𝐑1𝐲S^{2}=\mathbf{y}^{T}\mathbf{\tilde{R}}^{-1}\mathbf{y}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_R end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y with 𝐑~=𝐑+η𝐈n~𝐑𝐑𝜂subscript𝐈𝑛\mathbf{\tilde{R}}=\mathbf{R}+\eta\mathbf{I}_{n}over~ start_ARG bold_R end_ARG = bold_R + italic_η bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Plugging the σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT into the likelihood function, the profile likelihood [21] follows:

p(𝐲γ,η,σ^2)|𝐑~|1/2|S2|n/2.proportional-to𝑝conditional𝐲𝛾𝜂superscript^𝜎2superscript~𝐑12superscriptsuperscript𝑆2𝑛2p(\mathbf{y}\mid\gamma,\eta,\hat{\sigma}^{2})\propto|\mathbf{\tilde{R}}|^{-1/2% }|S^{2}|^{-n/2}.italic_p ( bold_y ∣ italic_γ , italic_η , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∝ | over~ start_ARG bold_R end_ARG | start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT - italic_n / 2 end_POSTSUPERSCRIPT . (25)

The parameters can be obtained by maximizing the log profile likelihood: (γ^,η^)=argmaxγ,ηlog(p(𝐲γ,η,σ^2))^𝛾^𝜂subscriptargmax𝛾𝜂log𝑝conditional𝐲𝛾𝜂superscript^𝜎2(\hat{\gamma},\hat{\eta})=\mbox{argmax}_{\gamma,\eta}\mbox{log}(p(\mathbf{y}% \mid\gamma,\eta,\hat{\sigma}^{2}))( over^ start_ARG italic_γ end_ARG , over^ start_ARG italic_η end_ARG ) = argmax start_POSTSUBSCRIPT italic_γ , italic_η end_POSTSUBSCRIPT log ( italic_p ( bold_y ∣ italic_γ , italic_η , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ). After obtaining the MLE, the predictive distribution at any time point t𝑡titalic_t follows a normal distribution:

(y(t)𝐲,σ^2,η^,γ^)𝒩(y^(t),σ^2K*(t)),similar-toconditional𝑦𝑡𝐲superscript^𝜎2^𝜂^𝛾𝒩^𝑦𝑡superscript^𝜎2superscript𝐾𝑡(y(t)\mid\mathbf{y},\hat{\sigma}^{2},\hat{\eta},\hat{\gamma})\sim\mathcal{N}(% \hat{y}(t),\hat{\sigma}^{2}K^{*}(t)),( italic_y ( italic_t ) ∣ bold_y , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over^ start_ARG italic_η end_ARG , over^ start_ARG italic_γ end_ARG ) ∼ caligraphic_N ( over^ start_ARG italic_y end_ARG ( italic_t ) , over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_t ) ) , (26)

where y^(t)=𝐫T(t)𝐑~1𝐲^𝑦𝑡superscript𝐫𝑇𝑡superscript~𝐑1𝐲\hat{y}(t)=\mathbf{r}^{T}(t)\mathbf{\tilde{R}}^{-1}\mathbf{y}over^ start_ARG italic_y end_ARG ( italic_t ) = bold_r start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_t ) over~ start_ARG bold_R end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y with 𝐫(t)=(K(t,t1),,K(t,tn))T𝐫𝑡superscript𝐾𝑡subscript𝑡1𝐾𝑡subscript𝑡𝑛𝑇\mathbf{r}(t)=(K(t,t_{1}),...,K(t,t_{n}))^{T}bold_r ( italic_t ) = ( italic_K ( italic_t , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_K ( italic_t , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and K*(t)=K(t,t)+η𝐫T(t)𝐑~1𝐫(t)superscript𝐾𝑡𝐾𝑡𝑡𝜂superscript𝐫𝑇𝑡superscript~𝐑1𝐫𝑡K^{*}(t)=K(t,t)+\eta-\mathbf{r}^{T}(t)\mathbf{\tilde{R}}^{-1}\mathbf{r}(t)italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_t ) = italic_K ( italic_t , italic_t ) + italic_η - bold_r start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_t ) over~ start_ARG bold_R end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r ( italic_t ). The predictive mean y^(t)^𝑦𝑡\hat{y}(t)over^ start_ARG italic_y end_ARG ( italic_t ) is often used for predicting y(t)𝑦𝑡y(t)italic_y ( italic_t ) and the predictive intervals can be obtained from (26) for quantifying the uncertainty in prediction.

Refer to caption
Figure 7: Comparison of temporal trend of internet download speed from original versus re-sampled data based on linear and Gaussian process regression. (a) iOS devices in City A; (b) Android devices in City A; (c) iOS devices in City B; and (d) Android devices in City B. The solid curves are predictive mean and shaded areas are 95% percent interval of the estimation.

Directly computing the likelihood function or predictive distribution requires inverting an n×n𝑛𝑛n\times nitalic_n × italic_n covariance matrix, which has computational complexity 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ). Here n𝑛nitalic_n can be at the order of 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, prohibiting directly computing GP models. Fortunately, the Matérn covariance with half-integer parameters can be written as a set of stochastic differential equations and the solution follows a continuous-time state space model [24]. For instance, the GP with a Matérn covariance with roughness 5/2 in (24) can be written as a state space model below [22]:

y(ti)=𝐅𝜽(ti)+ϵi,𝜽(ti)=𝐆(ti)𝜽(ti1)+𝐰(ti),formulae-sequence𝑦subscript𝑡𝑖𝐅𝜽subscript𝑡𝑖subscriptitalic-ϵ𝑖𝜽subscript𝑡𝑖𝐆subscript𝑡𝑖𝜽subscript𝑡𝑖1𝐰subscript𝑡𝑖\displaystyle\begin{split}y(t_{i})&=\mathbf{F}\bm{\theta}(t_{i})+\epsilon_{i},% \\ \bm{\theta}(t_{i})&=\mathbf{G}(t_{i})\bm{\theta}(t_{i-1})+\mathbf{w}(t_{i}),\,% \end{split}start_ROW start_CELL italic_y ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL = bold_F bold_italic_θ ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_italic_θ ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL = bold_G ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_θ ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + bold_w ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW (27)

where 𝐅=(1,0,0)𝐅100\mathbf{F}=(1,0,0)bold_F = ( 1 , 0 , 0 ), 𝐰(xi)𝒩(0,𝐖(xi))similar-to𝐰subscript𝑥𝑖𝒩0𝐖subscript𝑥𝑖\mathbf{w}(x_{i})\sim\mathcal{N}(0,\mathbf{W}(x_{i}))bold_w ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ caligraphic_N ( 0 , bold_W ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) for i=2,,N𝑖2𝑁i=2,...,Nitalic_i = 2 , … , italic_N, and the initial state follows 𝜽(x1)𝒩(𝟎,𝐖(x1))similar-to𝜽subscript𝑥1𝒩0𝐖subscript𝑥1\bm{\theta}(x_{1})\sim\mathcal{MN}(\mathbf{0},\mathbf{W}(x_{1}))bold_italic_θ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∼ caligraphic_M caligraphic_N ( bold_0 , bold_W ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ). The closed-form expression of 𝐆(xi)𝐆subscript𝑥𝑖\mathbf{G}(x_{i})bold_G ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝐖(xi)𝐖subscript𝑥𝑖\mathbf{W}(x_{i})bold_W ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in (27) can be found in Appendix A in [20].

With the state space representation, we can compute the likelihood function and predictive distribution by Kalman filter [25] and RTS smoother [34]. This algorithm is commonly known as the forward filtering and backward smoothing (FFBS) algorithm [36, 32]. The details for computing the likelihood and predictive distribution for state space representation of GP with the covariance in (24) are provided in Lemma 2 in [20]. Computing the likelihood in (25) and predictive distribution in (26) using the FFBS algorithm reduces the computational operations from 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) to 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) operations without approximation. The computational advance enables us to estimate the nonlinear temporal trend of internet speed with a massive number of crowdsourced observations.

5.3 Temporal progression of measured internet speeds

We plot the estimated temporal progression of the download speed based on both linear and state space models in Fig. 7. Both models show that download speeds measured by Speedtest improve over time for both cities and device types. However, the improvements of the download speed do not appear to be homogeneous over time. We find a comparatively large improvement occurs at the beginning of 2021 for iOS devices in both cities, shown in parts (a) and (c) in Fig 7. Speed improvement can also be found for Android devices, as shown in parts (b) and (d); however the variation of the estimation for Android devices seems larger, as the sample size from Android devices is much smaller than iOS devices, particularly for City A, as shown in Table 1. The increase may be partly due to the acceleration of the deployment and marketing of fiber internet since 2020 [2], as we find that a high proportion of the Speedtest measurements have substantially faster speeds than others since late 2020, and the proportion grows over time.

Second, the estimation from re-sampled data suggests that the trend from iOS devices tends to be overestimated (part (a) and (c) in Fig. 7). This overestimation is larger in City B, particularly during 2021, as both linear and state space models shows the noticeable difference of fitting between original and re-sampled data. We suspect that the overestimation is due to more Speedtests from high-speed internet plans, such as fiber internet plan, as subscribers of these plans may tend to submit more Speedtests to validate the speed from these high-speed internet plans. Re-sampling among census block groups can address a part of this bias, as it samples more from regions with relatively smaller numbers of tests compared to their population, which may have lower-speed internet plans. The temporal trends of internet download speed for the two other cities are plotted in Figure 16 in Appendix B.3. The estimation by both the original and bias-corrected samples shows the improvement of download speeds over time, and the difference between the estimation is not large.

Table 4: Linear trends of measured internet download speed (Mbps) per day.
City Device type Estimates β^1tsubscript^𝛽1𝑡\hat{\beta}_{1t}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT (95% CI) Estimates β^1trsuperscriptsubscript^𝛽1𝑡𝑟\hat{\beta}_{1t}^{r}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT (95% CI)
A iOS 0.1149 (0.1088, 0.1209) 0.0928 (0.0857, 0.0999)
Android 0.1879 (0.1701, 0.2058) 0.1788 (0.1546, 0.2029)
B iOS 0.1661 (0.1567, 0.1755) 0.1225 (0.1110, 0.134)
Android 0.1309 (0.1200, 0.1417) 0.1370 (0.1232, 0.1509)

Finally, we compare the estimates of linear coefficients β1tsubscript𝛽1𝑡\beta_{1t}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT and β1trsuperscriptsubscript𝛽1𝑡𝑟\beta_{1t}^{r}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT for both iOS and Android devices from City A and B in Table 4. For both device types, the estimates are positive, which suggests that the download speed increases over time. The estimates of β1trsuperscriptsubscript𝛽1𝑡𝑟\beta_{1t}^{r}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT from the bias-corrected samples are smaller than those of β1tsubscript𝛽1𝑡\beta_{1t}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT in the first three rows, indicating that the improvement of internet speed may be slightly overestimated by the original data for these two cities. The estimates of β1trsuperscriptsubscript𝛽1𝑡𝑟\beta_{1t}^{r}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT are similar to β1tsubscript𝛽1𝑡\beta_{1t}italic_β start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT for Android device download speeds in City B, whereas the intercept β0trsuperscriptsubscript𝛽0𝑡𝑟\beta_{0t}^{r}italic_β start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is smaller than β0tsubscript𝛽0𝑡\beta_{0t}italic_β start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT, as shown in part (d) of Fig. 7. The results from Fig. 7 and Table 4 consistently suggest that measured internet speed improves over the time range.

6 Conclusion

In this paper, we integrated Ookla Speedtest measurements with regional demographic profiles for analyzing disparities of measured internet quality and the temporal evolution of internet speed. We developed re-weighing and re-sampling methods to correct the large regional sampling bias across census block groups. Through regression analysis of integrated data, we found that census block groups with higher income, younger population, and fewer Hispanic residents tend towards higher measured internet speeds. Furthermore, we discerned an encouraging trend of internet speed improvement through temporal modeling of Speedtest measurements. Nevertheless, it is essential to approach these findings with caution, as they are susceptible to different biases inherent in Ookla Speedtest measurements. We anticipate that our new methods can be applied to different crowdsourced data and we outline a few directions for further study.

First, while our current investigation primarily concentrates on urban areas, the realm of speed test measurements in rural and sparsely populated subareas within cities remains largely unexplored. Consequently, a comprehensive analysis of internet performance profiles between urban and rural areas presents an intriguing prospect. The principle challenge in this pursuit is the scarcity or absence of crowdsourced samples in rural areas, rendering accurate statistical inference on these regions difficult. However, it is likely that spatial proximity and the similarity of demographic profiles can be used to infer internet speed through interpolation.

Second, our study naturally stimulates further inquiry into internet speed across other cities or states. Given the expansive coverage of crowdsourced measurements across the United States, researchers can leverage data on a larger scale to generalize internet speed characteristics throughout the country. To accommodate a more diverse range of states and cities, one plausible modeling approach involves incorporating mixed effects into the model, accounting for spatial variations. The nature of these mixed effects may vary depending on the hierarchical structure amongst states, counties, and census block groups. Identifying a suitable metric to define correlations between regions may be challenging, but demographic similarity stands as a viable option to gauge correlated structures. This methodology enables the construction of a more comprehensive model alongside city- and state-specific explanatory terms.

Third, our analysis of the temporal progression of internet speed motivates future studies in this direction. Our study employing state space models reveals a notable degree of volatility in the time series of internet speed, suggesting that the distribution of internet speed comprises multiple heterogeneous groups over time. An essential latent factor in this context is the internet subscription plan. To address this, one may consider employing a mixture of Gaussian processes or state space models to analyze the temporal evolution of internet speed while accounting for different subscription plans.

Lastly, it is essential to consider potential sources of bias other than the sampling bias among census block groups. While our study has identified an association between regional sampling bias and demographic disparities, further investigation is needed due to the constraints against the availability of demographic profiles corresponding to individual speed tests. Additional data for calibrating the model can be obtained by anonymous surveys to address this limitation. The collection of paired data encompassing internet speed measurements and demographic data of the speed test taker can be used for regression analysis to understand whether other forms of bias affect the estimation between internet quality and demographic features.

References

Appendix A Proofs and derivations

A.1 Proof of Lemma 1

Proof.

Consider a fixed x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R. Note that the indicator function I(yix)𝐼subscript𝑦𝑖𝑥I(y_{i}\leq x)italic_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ) follows Bernoulli distribution with success probability (yix)subscript𝑦𝑖𝑥\mathbb{P}(y_{i}\leq x)blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ), for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Based on the law of total expectation, we have:

(yx)𝑦𝑥\displaystyle\mathbb{P}(y\leq x)blackboard_P ( italic_y ≤ italic_x ) =𝔼[I(yx)]absent𝔼delimited-[]𝐼𝑦𝑥\displaystyle=\mathbb{E}\left[I(y\leq x)\right]= blackboard_E [ italic_I ( italic_y ≤ italic_x ) ]
=𝔼[𝔼[I(yx)|𝐳]]absent𝔼delimited-[]𝔼delimited-[]conditional𝐼𝑦𝑥𝐳\displaystyle=\mathbb{E}\left[\mathbb{E}\left[I(y\leq x)|\mathbf{z}\right]\right]= blackboard_E [ blackboard_E [ italic_I ( italic_y ≤ italic_x ) | bold_z ] ]
=i=1k(zi=1)𝔼[I(yx)|zi=1]absentsuperscriptsubscript𝑖1𝑘subscript𝑧𝑖1𝔼delimited-[]conditional𝐼𝑦𝑥subscript𝑧𝑖1\displaystyle=\sum_{i=1}^{k}\mathbb{P}(z_{i}=1)\mathbb{E}\left[I(y\leq x)|z_{i% }=1\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) blackboard_E [ italic_I ( italic_y ≤ italic_x ) | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ]
=i=1k(zi=1)𝔼[I(yix)]absentsuperscriptsubscript𝑖1𝑘subscript𝑧𝑖1𝔼delimited-[]𝐼subscript𝑦𝑖𝑥\displaystyle=\sum_{i=1}^{k}\mathbb{P}(z_{i}=1)\mathbb{E}\left[I(y_{i}\leq x)\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_P ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) blackboard_E [ italic_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ) ]
=i=1kNiN𝔼[I(yix)]absentsuperscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁𝔼delimited-[]𝐼subscript𝑦𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{E}\left[I(y_{i}\leq x)\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG blackboard_E [ italic_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ) ]
=i=1kNiN(yix)absentsuperscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝑦𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{P}(y_{i}\leq x)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x )
=i=1kNiNFi(x),absentsuperscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ,

where 𝐳𝐳\mathbf{z}bold_z follows a multinomial distribution in Equation (2). ∎

A.2 Proof of unbiased estimator in Equation (7)

Proof.

Consider a fixed x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R. Note that F^i(x)subscript^𝐹𝑖𝑥\hat{F}_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is unbiased for Fi(x)subscript𝐹𝑖𝑥F_{i}(x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) for any census block group i𝑖iitalic_i, I(yijx)𝐼subscript𝑦𝑖𝑗𝑥I(y_{ij}\leq x)italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) represents independent Bernoulli process with probability Fi(x)=(yix)subscript𝐹𝑖𝑥subscript𝑦𝑖𝑥F_{i}(x)=\mathbb{P}(y_{i}\leq x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_x ), i.e.

𝔼[F^i(x)]𝔼delimited-[]subscript^𝐹𝑖𝑥\displaystyle\mathbb{E}\left[\hat{F}_{i}(x)\right]blackboard_E [ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] =𝔼[1nijniI(yijx)]absent𝔼delimited-[]1subscript𝑛𝑖superscriptsubscript𝑗subscript𝑛𝑖𝐼subscript𝑦𝑖𝑗𝑥\displaystyle=\mathbb{E}\left[\frac{1}{n_{i}}\sum_{j}^{n_{i}}I\left(y_{ij}\leq x% \right)\right]= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) ]
=1nij=1ni𝔼[I(yijx)]absent1subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖𝔼delimited-[]𝐼subscript𝑦𝑖𝑗𝑥\displaystyle=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\mathbb{E}\left[I\left(y_{ij}% \leq x\right)\right]= divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E [ italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) ]
=1nij=1niFi(x)absent1subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖subscript𝐹𝑖𝑥\displaystyle=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}F_{i}(x)= divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )
=1niniFi(x)=Fi(x).absent1subscript𝑛𝑖subscript𝑛𝑖subscript𝐹𝑖𝑥subscript𝐹𝑖𝑥\displaystyle=\frac{1}{n_{i}}n_{i}F_{i}(x)=F_{i}(x).= divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) .

Based on the linearity of expectation operator 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ], we then have:

𝔼[F^u(x)]=i=1kNiN𝔼[F^i(x)]=i=1kNiNFi(x)=F(x).𝔼delimited-[]subscript^𝐹𝑢𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁𝔼delimited-[]subscript^𝐹𝑖𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥𝐹𝑥\mathbb{E}\left[\hat{F}_{u}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{E}% \left[\hat{F}_{i}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x).blackboard_E [ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG blackboard_E [ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_F ( italic_x ) .

A.3 Proof of Lemma 2

Proof.

By the weak law of large numbers, we have F^i(x)Fi(x)subscript^𝐹𝑖𝑥subscript𝐹𝑖𝑥\hat{F}_{i}(x)\xrightarrow[]{\mathbb{P}}F_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_ARROW overblackboard_P → end_ARROW italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) and F^i*(x)Fi(x)subscriptsuperscript^𝐹𝑖𝑥subscript𝐹𝑖𝑥\hat{F}^{*}_{i}(x)\xrightarrow[]{\mathbb{P}}F_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_ARROW overblackboard_P → end_ARROW italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) for any real number x𝑥xitalic_x, when nisubscript𝑛𝑖n_{i}\rightarrow\inftyitalic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → ∞ for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Since k𝑘kitalic_k is a finite number, it follows from Slutsky’s Theorem that:

F^u(x)=i=1kNiNF^i(x)i=1kNiNFi(x)=F(x),subscript^𝐹𝑢𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript^𝐹𝑖𝑥superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥𝐹𝑥\hat{F}_{u}(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x)\xrightarrow[]{% \mathbb{P}}\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_ARROW overblackboard_P → end_ARROW ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = italic_F ( italic_x ) ,

when nisubscript𝑛𝑖n_{i}\rightarrow\inftyitalic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → ∞ at each i𝑖iitalic_i. It suffices to show that:

limnni*n*=NiN.subscript𝑛superscriptsubscript𝑛𝑖superscript𝑛subscript𝑁𝑖𝑁\lim_{n\rightarrow\infty}\frac{n_{i}^{*}}{n^{*}}=\frac{N_{i}}{N}.roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG . (28)

By definition of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (8), we have:

nNiN12𝑛subscript𝑁𝑖𝑁12\displaystyle\frac{nN_{i}}{N}-\frac{1}{2}divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ni*=[nNiN]nNiN+12, soformulae-sequenceabsentsuperscriptsubscript𝑛𝑖delimited-[]𝑛subscript𝑁𝑖𝑁𝑛subscript𝑁𝑖𝑁12 so\displaystyle\leq n_{i}^{*}=\left[\frac{nN_{i}}{N}\right]\leq\frac{nN_{i}}{N}+% \frac{1}{2},\text{ so}≤ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = [ divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ] ≤ divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG , so (29)
n12k𝑛12𝑘\displaystyle n-\frac{1}{2}kitalic_n - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_k n*=i=1k[nNiN]n+12k.absentsuperscript𝑛superscriptsubscript𝑖1𝑘delimited-[]𝑛𝑁𝑖𝑁𝑛12𝑘\displaystyle\leq n^{*}=\sum_{i=1}^{k}\left[\frac{nN{i}}{N}\right]\leq n+\frac% {1}{2}k.≤ italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ divide start_ARG italic_n italic_N italic_i end_ARG start_ARG italic_N end_ARG ] ≤ italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_k .

Then, from (29), we obtain:

nNi/N1/2n+k/2ni*n*nNi/N+1/2nk/2.𝑛subscript𝑁𝑖𝑁12𝑛𝑘2superscriptsubscript𝑛𝑖superscript𝑛𝑛subscript𝑁𝑖𝑁12𝑛𝑘2\frac{nN_{i}/N-1/2}{n+k/2}\leq\frac{n_{i}^{*}}{n^{*}}\leq\frac{nN_{i}/N+1/2}{n% -k/2}.divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N - 1 / 2 end_ARG start_ARG italic_n + italic_k / 2 end_ARG ≤ divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_n italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N + 1 / 2 end_ARG start_ARG italic_n - italic_k / 2 end_ARG . (30)

Letting n𝑛n\rightarrow\inftyitalic_n → ∞, we apply the Squeeze Theorem to the inequality in (30) to yield the result of (28). ∎

A.4 Derivation of confidence intervals for the methods in Section 3.2

We derive the confidence intervals of different ways for estimating the CDF. For all methods, we denote x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R to be any fixed input.

A.4.1 Empirical CDF from original data

Our empirical CDF for Fi(x)subscript𝐹𝑖𝑥F_{i}(x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) is written by:

F^i(x)=1nij=1niI(yijx).subscript^𝐹𝑖𝑥1subscript𝑛𝑖superscriptsubscript𝑗1subscript𝑛𝑖𝐼subscript𝑦𝑖𝑗𝑥\hat{F}_{i}(x)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x).over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) . (31)

For a large number of samples in each region (with sufficiently large nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for every i𝑖iitalic_i), we apply the Central Limit Theorem under Assumptions 1 and 2. Then,

F^i(x)approx𝒩(Fi(x),Fi(x)(1Fi(x))ni),superscriptsimilar-toapproxsubscript^𝐹𝑖𝑥𝒩subscript𝐹𝑖𝑥subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥subscript𝑛𝑖\hat{F}_{i}(x)\sim^{\text{approx}}\mathcal{N}\left(F_{i}(x),\frac{F_{i}(x)(1-F% _{i}(x))}{n_{i}}\right),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∼ start_POSTSUPERSCRIPT approx end_POSTSUPERSCRIPT caligraphic_N ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , divide start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (32)

for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Under Assumption 1 and by the fact that I(yijx)𝐼subscript𝑦𝑖𝑗𝑥I(y_{ij}\leq x)italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) independently follows a Bernoulli distribution with success probability Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i,j𝑖𝑗i,jitalic_i , italic_j, variance of F^(x)^𝐹𝑥\hat{F}(x)over^ start_ARG italic_F end_ARG ( italic_x ) follows:

Var(F^(x))Var^𝐹𝑥\displaystyle\text{Var}(\hat{F}(x))Var ( over^ start_ARG italic_F end_ARG ( italic_x ) ) =Var(i=1kninF^i(x))absentVarsuperscriptsubscript𝑖1𝑘subscript𝑛𝑖𝑛subscript^𝐹𝑖𝑥\displaystyle=\text{Var}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}\hat{F}_{i}(x)\right)= Var ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) (33)
=i=1kVar(ninF^i(x))absentsuperscriptsubscript𝑖1𝑘Varsubscript𝑛𝑖𝑛subscript^𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\text{Var}\left(\frac{n_{i}}{n}\hat{F}_{i}(x)\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT Var ( divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) )
=i=1kni2n2Var(F^i(x))absentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖2superscript𝑛2Varsubscript^𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{2}}{n^{2}}\text{Var}\left(\hat{F}_{i}% (x)\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG Var ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) )
=i=1kni2n2Fi(x)(1Fi(x))niabsentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖2superscript𝑛2subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥subscript𝑛𝑖\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{2}}{n^{2}}\frac{F_{i}(x)(1-F_{i}(x))}% {n_{i}}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
=1n2i=1kniFi(x)(1Fi(x)).absent1superscript𝑛2superscriptsubscript𝑖1𝑘subscript𝑛𝑖subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}F_{i}(x)(1-F_{i}(x)).= divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) .

Based on the asymptotic normality in (32) and independence based on Assumption 1, we have:

F^(x)approx𝒩(i=1kninFi(x),1n2i=1kniFi(x)(1Fi(x))),superscriptsimilar-toapprox^𝐹𝑥𝒩superscriptsubscript𝑖1𝑘subscript𝑛𝑖𝑛subscript𝐹𝑖𝑥1superscript𝑛2superscriptsubscript𝑖1𝑘subscript𝑛𝑖subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥\hat{F}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}F_% {i}(x),\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}F_{i}(x)(1-F_{i}(x))\right),over^ start_ARG italic_F end_ARG ( italic_x ) ∼ start_POSTSUPERSCRIPT approx end_POSTSUPERSCRIPT caligraphic_N ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ) , (34)

where its asymptotic variance is obtained by (33). Note that F^i(x)subscript^𝐹𝑖𝑥\hat{F}_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) converges in probability to Fi(x)subscript𝐹𝑖𝑥F_{i}(x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), i.e. F^i(x)Fi(x)subscript^𝐹𝑖𝑥subscript𝐹𝑖𝑥\hat{F}_{i}(x)\xrightarrow{\mathbb{P}}F_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_ARROW overblackboard_P → end_ARROW italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), under Assumptions 1 and 2 for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. By Slutsky’s theorem, we have the following expressions:

F^(x)approx𝒩(i=1kninFi(x),1n2i=1kniF^i(x)(1F^i(x))).superscriptsimilar-toapprox^𝐹𝑥𝒩superscriptsubscript𝑖1𝑘subscript𝑛𝑖𝑛subscript𝐹𝑖𝑥1superscript𝑛2superscriptsubscript𝑖1𝑘subscript𝑛𝑖subscript^𝐹𝑖𝑥1subscript^𝐹𝑖𝑥\hat{F}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}F_% {i}(x),\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}\hat{F}_{i}(x)(1-\hat{F}_{i}(x))% \right).over^ start_ARG italic_F end_ARG ( italic_x ) ∼ start_POSTSUPERSCRIPT approx end_POSTSUPERSCRIPT caligraphic_N ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ) .

Therefore, the 95% confidence interval for the simple empirical CDF follows:

F^(x)±1.96i=1kniF^i(x)(1F^i(x))n.plus-or-minus^𝐹𝑥1.96superscriptsubscript𝑖1𝑘subscript𝑛𝑖subscript^𝐹𝑖𝑥1subscript^𝐹𝑖𝑥𝑛\hat{F}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}n_{i}\hat{F}_{i}(x)(1-\hat{F}_{i}(% x))}}{n}.over^ start_ARG italic_F end_ARG ( italic_x ) ± 1.96 divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG end_ARG start_ARG italic_n end_ARG .

A.4.2 Re-weighted empirical CDF from original data

Under Assumption 1 and by the fact that I(yijx)𝐼subscript𝑦𝑖𝑗𝑥I(y_{ij}\leq x)italic_I ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_x ) independently follows a Bernoulli distribution with success probability Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i,j𝑖𝑗i,jitalic_i , italic_j, the variance of (7) follows:

Var(F^u(x))Varsubscript^𝐹𝑢𝑥\displaystyle\text{Var}\left(\hat{F}_{u}(x)\right)Var ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ) =Var(i=1kNiNF^i(x))absentVarsuperscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript^𝐹𝑖𝑥\displaystyle=\text{Var}\left(\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x)\right)= Var ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) (35)
=i=1kVar(NiNF^i(x))absentsuperscriptsubscript𝑖1𝑘Varsubscript𝑁𝑖𝑁subscript^𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\text{Var}\left(\frac{N_{i}}{N}\hat{F}_{i}(x)\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT Var ( divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) )
=i=1kNi2N2Var(F^i(x))absentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑁𝑖2superscript𝑁2Varsubscript^𝐹𝑖𝑥\displaystyle=\sum_{i=1}^{k}\frac{N_{i}^{2}}{N^{2}}\text{Var}\left(\hat{F}_{i}% (x)\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG Var ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) )
=i=1kNi2N2Fi(x)(1Fi(x))niabsentsuperscriptsubscript𝑖1𝑘superscriptsubscript𝑁𝑖2superscript𝑁2subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥subscript𝑛𝑖\displaystyle=\sum_{i=1}^{k}\frac{N_{i}^{2}}{N^{2}}\frac{F_{i}(x)(1-F_{i}(x))}% {n_{i}}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG

Since the empirical CDFs from (31) are asymptotically normal, as shown in (32), with independence between regions (Assumption 1), the re-weighted sum of these CDFs are asymptotically normal, i.e.

F^u(x)approx𝒩(i=1kNiNFi(x),1N2i=1kNi2niFi(x)(1Fi(x))),superscriptsimilar-toapproxsubscript^𝐹𝑢𝑥𝒩superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥1superscript𝑁2superscriptsubscript𝑖1𝑘superscriptsubscript𝑁𝑖2subscript𝑛𝑖subscript𝐹𝑖𝑥1subscript𝐹𝑖𝑥\hat{F}_{u}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{N_{i}}{% N}F_{i}(x),\frac{1}{N^{2}}\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}F_{i}(x)(1-F_{i% }(x))\right),over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ∼ start_POSTSUPERSCRIPT approx end_POSTSUPERSCRIPT caligraphic_N ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ) , (36)

where the asymptotic variance is obtained by (35).

Since F^i(x)subscript^𝐹𝑖𝑥\hat{F}_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) converges in probability to Fi(x)subscript𝐹𝑖𝑥F_{i}(x)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), i.e. F^i(x)Fi(x)subscript^𝐹𝑖𝑥subscript𝐹𝑖𝑥\hat{F}_{i}(x)\xrightarrow{\mathbb{P}}F_{i}(x)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_ARROW overblackboard_P → end_ARROW italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), under Assumptions 1 and 2 for i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k, we employ Slutsky’s theorem to bridge (36) to obtain:

F^u(x)approx𝒩(i=1kNiNFi(x),1N2i=1kNi2niF^i(x)(1F^i(x))).superscriptsimilar-toapproxsubscript^𝐹𝑢𝑥𝒩superscriptsubscript𝑖1𝑘subscript𝑁𝑖𝑁subscript𝐹𝑖𝑥1superscript𝑁2superscriptsubscript𝑖1𝑘superscriptsubscript𝑁𝑖2subscript𝑛𝑖subscript^𝐹𝑖𝑥1subscript^𝐹𝑖𝑥\hat{F}_{u}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{N_{i}}{% N}F_{i}(x),\frac{1}{N^{2}}\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}\hat{F}_{i}(x)(% 1-\hat{F}_{i}(x))\right).over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ∼ start_POSTSUPERSCRIPT approx end_POSTSUPERSCRIPT caligraphic_N ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) ) .

Then the 95% confidence interval for cdf F(x)𝐹𝑥F(x)italic_F ( italic_x ) is obtained by:

F^u(x)±1.96i=1kNi2niF^i(x)(1F^i(x))Nplus-or-minussubscript^𝐹𝑢𝑥1.96superscriptsubscript𝑖1𝑘superscriptsubscript𝑁𝑖2subscript𝑛𝑖subscript^𝐹𝑖𝑥1subscript^𝐹𝑖𝑥𝑁\hat{F}_{u}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}\hat{F}% _{i}(x)(1-\hat{F}_{i}(x))}}{N}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_x ) ± 1.96 divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ( 1 - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG end_ARG start_ARG italic_N end_ARG

A.4.3 Simple empirical CDF from re-sampled data

Following the notation in Section 3.2.2 and the derivation in Appendix A.4.1, we obtain the 95% confidence interval for F^*superscript^𝐹\hat{F}^{*}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as:

F^*(x)±1.96i=1kni*F^i*(x)(1F^i*(x))n*.plus-or-minussuperscript^𝐹𝑥1.96superscriptsubscript𝑖1𝑘superscriptsubscript𝑛𝑖superscriptsubscript^𝐹𝑖𝑥1subscriptsuperscript^𝐹𝑖𝑥superscript𝑛\hat{F}^{*}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}n_{i}^{*}\hat{F}_{i}^{*}(x)(1-% \hat{F}^{*}_{i}(x))}}{n^{*}}.over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x ) ± 1.96 divide start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x ) ( 1 - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) end_ARG end_ARG start_ARG italic_n start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG .

A.5 Proof of Lemma 3

Proof.

From models (12) and (14), we have yijindep.𝒩(𝐱iT𝜷,σ2)superscriptsimilar-toindep.subscript𝑦𝑖𝑗𝒩superscriptsubscript𝐱𝑖𝑇𝜷superscript𝜎2y_{ij}\sim^{\text{indep.}}\mathcal{N}(\mathbf{x}_{i}^{T}\boldsymbol{\beta},% \sigma^{2})italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ start_POSTSUPERSCRIPT indep. end_POSTSUPERSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and y¯iindep.𝒩(𝐱iT𝜷,σ2/ni)superscriptsimilar-toindep.subscript¯𝑦𝑖𝒩superscriptsubscript𝐱𝑖𝑇𝜷superscript𝜎2subscript𝑛𝑖\bar{y}_{i}\sim^{\text{indep.}}\mathcal{N}(\mathbf{x}_{i}^{T}\boldsymbol{\beta% },\sigma^{2}/n_{i})over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ start_POSTSUPERSCRIPT indep. end_POSTSUPERSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for j=1,,ni𝑗1subscript𝑛𝑖j=1,\cdots,n_{i}italic_j = 1 , ⋯ , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. First, note that:

l¯(𝜷,σ2)=k2log(2πσ2)+12iklogni12σ2ikni(y¯i𝐱iT𝜷)2.¯𝑙𝜷superscript𝜎2𝑘22𝜋superscript𝜎212superscriptsubscript𝑖𝑘subscript𝑛𝑖12superscript𝜎2superscriptsubscript𝑖𝑘subscript𝑛𝑖superscriptsubscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷2\bar{l}(\boldsymbol{\beta},\sigma^{2})=-\frac{k}{2}\log(2\pi\sigma^{2})+\frac{% 1}{2}\sum_{i}^{k}\log n_{i}-\frac{1}{2\sigma^{2}}\sum_{i}^{k}n_{i}(\bar{y}_{i}% -\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}.over¯ start_ARG italic_l end_ARG ( bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = - divide start_ARG italic_k end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_log italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (37)

On the other hand, we also have:

l(𝜷,σ2)=n2log(2πσ2)12σ2i=1kj=1ni(yij𝐱iT𝜷)2=n2log(2πσ2)12σ2i=1kj=1ni(yijy¯i+y¯i𝐱iT𝜷)2=n2log(2πσ2)12σ2i=1kj=1ni(yijy¯i)212σ2i=1kj=1ni(y¯i𝐱iT𝜷)21σ2i=1kj=1ni(yijy¯i)(y¯i𝐱iT𝜷)=n2log(2πσ2)12σ2i=1kj=1ni(yijy¯i)212σ2i=1kni(y¯i𝐱iT𝜷)21σ2i=1kj=1ni(yijy¯i)(y¯i𝐱iT𝜷)=n2log(2πσ2)12σ2i=1kj=1ni(yijy¯i)212σ2i=1kni(y¯i𝐱iT𝜷)2,𝑙𝜷superscript𝜎2𝑛22𝜋superscript𝜎212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗superscriptsubscript𝐱𝑖𝑇𝜷2𝑛22𝜋superscript𝜎212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗subscript¯𝑦𝑖subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷2𝑛22𝜋superscript𝜎212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗subscript¯𝑦𝑖212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷21superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑦𝑖𝑗subscript¯𝑦𝑖subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷𝑛22𝜋superscript𝜎212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗subscript¯𝑦𝑖212superscript𝜎2superscriptsubscript𝑖1𝑘subscript𝑛𝑖superscriptsubscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷21superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑦𝑖𝑗subscript¯𝑦𝑖subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷𝑛22𝜋superscript𝜎212superscript𝜎2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1subscript𝑛𝑖superscriptsubscript𝑦𝑖𝑗subscript¯𝑦𝑖212superscript𝜎2superscriptsubscript𝑖1𝑘subscript𝑛𝑖superscriptsubscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷2\begin{split}&l(\boldsymbol{\beta},\sigma^{2})=-\frac{n}{2}\log(2\pi\sigma^{2}% )-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\mathbf{x}_{i}^% {T}\boldsymbol{\beta})^{2}\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i}+\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta}% )^{2}\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}-\frac{1}{% \sigma^{2}}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-% \mathbf{x}_{i}^{T}\boldsymbol{\beta})\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}n_{i}(% \bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}-\frac{1}{\sigma^{2}}\sum% _{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-\mathbf{x}_{i}^{T% }\boldsymbol{\beta})\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}n_{i}(% \bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2},\end{split}start_ROW start_CELL end_CELL start_CELL italic_l ( bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (38)

as j=1ni(yijy¯i)(y¯i𝐱iT𝜷)=(y¯i𝐱iT𝜷)j=1ni(yijy¯i)=(y¯i𝐱iT𝜷)(niy¯iniy¯i)=0superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑦𝑖𝑗subscript¯𝑦𝑖subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷superscriptsubscript𝑗1subscript𝑛𝑖subscript𝑦𝑖𝑗subscript¯𝑦𝑖subscript¯𝑦𝑖superscriptsubscript𝐱𝑖𝑇𝜷subscript𝑛𝑖subscript¯𝑦𝑖subscript𝑛𝑖subscript¯𝑦𝑖0\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-\mathbf{x}_{i}^{T}% \boldsymbol{\beta})=(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})=(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{% \beta})(n_{i}\bar{y}_{i}-n_{i}\bar{y}_{i})=0∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) = ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_β ) ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 for each i=1,,k𝑖1𝑘i=1,\cdots,kitalic_i = 1 , ⋯ , italic_k. Comparing with Equation (37), the results follow.

A.6 Proof of Lemma 4

We first introduce the following lemma to derive estimators of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the model (12) and (14), and their variability. Here, for the generality, let 𝐱i=(1,xi1,xi2,,xip)T(p+1)subscript𝐱𝑖superscript1subscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖𝑝𝑇superscript𝑝1\mathbf{x}_{i}=(1,x_{i1},x_{i2},\cdots,x_{ip})^{T}\in\mathbb{R}^{(p+1)}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_p + 1 ) end_POSTSUPERSCRIPT represent the predictor vector of i𝑖iitalic_i-th region with p𝑝pitalic_p different predictors for i=1,,k2formulae-sequence𝑖1𝑘2i=1,\cdots,k\geq 2italic_i = 1 , ⋯ , italic_k ≥ 2, and p>2𝑝2p>2italic_p > 2. We assume that n>>k>p+1much-greater-than𝑛𝑘𝑝1n>>k>p+1italic_n > > italic_k > italic_p + 1, indicating that the total number of sample across the collected regions is significantly greater than the number of regions, and the dimension of feature space does not exceed the number of regions.

Lemma 5.

Let 𝐲𝐲\mathbf{y}bold_y be a n𝑛nitalic_n-dimensional random vector with y𝒩(𝛍,𝐈n)similar-to𝑦𝒩𝛍subscript𝐈𝑛y\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{I}_{n})italic_y ∼ caligraphic_N ( bold_italic_μ , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where 𝛍n𝛍superscript𝑛\boldsymbol{\mu}\in\mathbb{R}^{n}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐈nsubscript𝐈𝑛\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes n×n𝑛𝑛n\times nitalic_n × italic_n identity matrix. If 𝐌n×n𝐌superscript𝑛𝑛\mathbf{M}\in\mathbb{R}^{n\times n}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is an orthogonal projection matrix, then

𝐲T𝐌𝐲χ2(r(𝐌),𝝁T𝐌𝝁/2),similar-tosuperscript𝐲𝑇𝐌𝐲superscript𝜒2𝑟𝐌superscript𝝁𝑇𝐌𝝁2\mathbf{y}^{T}\mathbf{M}\mathbf{y}\sim\chi^{2}(r(\mathbf{M}),\boldsymbol{\mu}^% {T}\mathbf{M}\boldsymbol{\mu}/2),bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_My ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_r ( bold_M ) , bold_italic_μ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_μ / 2 ) ,

where r(𝐀)𝑟𝐀r(\mathbf{A})italic_r ( bold_A ) indicates the rank of a given square matrix 𝐀𝐀\mathbf{A}bold_A and χ2(d,γ)superscript𝜒2𝑑𝛾\chi^{2}(d,\gamma)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d , italic_γ ) refers to the noncentral χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution with degree of freedom d𝑑ditalic_d and noncentrality parameter γ𝛾\gammaitalic_γ. A noncentral chi-squared distribution χ2(d,γ)superscript𝜒2𝑑𝛾\chi^{2}(d,\gamma)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d , italic_γ ) is generated by a sum of squared independent Gaussian random variables z1,,zd𝒩(μ,1)similar-tosubscript𝑧1normal-⋯subscript𝑧𝑑𝒩𝜇1z_{1},\cdots,z_{d}\sim\mathcal{N}(\mu,1)italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ , 1 ), i.e. i=1dzi2superscriptsubscript𝑖1𝑑superscriptsubscript𝑧𝑖2\sum_{i=1}^{d}z_{i}^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Here, the noncentrality parameter γ𝛾\gammaitalic_γ is defined by γ=i=1dμi2/2𝛾superscriptsubscript𝑖1𝑑superscriptsubscript𝜇𝑖22\gamma=\sum_{i=1}^{d}\mu_{i}^{2}/2italic_γ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2.

Proof.

Let r(𝐌)=r𝑟𝐌𝑟r(\mathbf{M})=ritalic_r ( bold_M ) = italic_r and let 𝐛1,,𝐛rnsubscript𝐛1subscript𝐛𝑟superscript𝑛\mathbf{b}_{1},\cdots,\mathbf{b}_{r}\in\mathbb{R}^{n}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be an orthonormal basis for the column space of 𝐌𝐌\mathbf{M}bold_M, say 𝒞(𝐌)𝒞𝐌\mathcal{C}(\mathbf{M})caligraphic_C ( bold_M ). Let 𝐁=[𝐛1,,𝐛r]𝐑n×r𝐁subscript𝐛1subscript𝐛𝑟superscript𝐑𝑛𝑟\mathbf{B}=[\mathbf{b}_{1},\cdots,\mathbf{b}_{r}]\in\mathbf{R}^{n\times r}bold_B = [ bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] ∈ bold_R start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT so that 𝐌=𝐁𝐁T𝐌superscript𝐁𝐁𝑇\mathbf{M}=\mathbf{B}\mathbf{B}^{T}bold_M = bold_BB start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. We have 𝐲T𝐌𝐲=𝐲𝐁𝐁T𝐲=(𝐁T𝐲)T(𝐁T𝐲)superscript𝐲𝑇𝐌𝐲superscript𝐲𝐁𝐁𝑇𝐲superscriptsuperscript𝐁𝑇𝐲𝑇superscript𝐁𝑇𝐲\mathbf{y}^{T}\mathbf{M}\mathbf{y}=\mathbf{y}\mathbf{B}\mathbf{B}^{T}\mathbf{y% }=(\mathbf{B}^{T}\mathbf{y})^{T}(\mathbf{B}^{T}\mathbf{y})bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_My = bold_yBB start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y = ( bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y ), where 𝐁T𝐲𝒩(𝐁T𝝁,𝐁T𝐁)similar-tosuperscript𝐁𝑇𝐲𝒩superscript𝐁𝑇𝝁superscript𝐁𝑇𝐁\mathbf{B}^{T}\mathbf{y}\sim\mathcal{N}(\mathbf{B}^{T}\boldsymbol{\mu},\mathbf% {B}^{T}\mathbf{B})bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y ∼ caligraphic_N ( bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_μ , bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_B ). Since the columns of 𝐁𝐁\mathbf{B}bold_B are orthonormal, 𝐁T𝐁=𝐈superscript𝐁𝑇𝐁𝐈\mathbf{B}^{T}\mathbf{B}=\mathbf{I}bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_B = bold_I. By definition of noncentral χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution, (𝐁T𝐲)T(𝐁T𝐲)χ2(r,𝝁𝐁𝐁T𝝁/2(\mathbf{B}^{T}\mathbf{y})^{T}(\mathbf{B}^{T}\mathbf{y})\sim\chi^{2}(r,% \boldsymbol{\mu}\mathbf{B}\mathbf{B}^{T}\boldsymbol{\mu}/2( bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y ) ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_r , bold_italic_μ bold_BB start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_μ / 2) where 𝝁T𝐁𝐁T𝝁=𝝁T𝐌𝝁superscript𝝁𝑇superscript𝐁𝐁𝑇𝝁superscript𝝁𝑇𝐌𝝁\boldsymbol{\mu}^{T}\mathbf{B}\mathbf{B}^{T}\boldsymbol{\mu}=\boldsymbol{\mu}^% {T}\mathbf{M}\boldsymbol{\mu}bold_italic_μ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_BB start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_μ = bold_italic_μ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_M bold_italic_μ. ∎

A.6.1 Unbiasedness of σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

W.l.o.g., we only consider 𝐕𝐕\mathbf{V}bold_V has full column rank, given that distinctive features among regions. Define 𝐉=𝐕(𝐕T𝐕)1𝐕T𝐉𝐕superscriptsuperscript𝐕𝑇𝐕1superscript𝐕𝑇\mathbf{J}=\mathbf{V}(\mathbf{V}^{T}\mathbf{V})^{-1}\mathbf{V}^{T}bold_J = bold_V ( bold_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_V ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT as the orthogonal projection matrix onto 𝒞(𝐕)𝒞𝐕\mathcal{C}(\mathbf{V})caligraphic_C ( bold_V ) with r(𝐉)=p+1𝑟𝐉𝑝1r(\mathbf{J})=p+1italic_r ( bold_J ) = italic_p + 1. It follows that (𝐈n𝐕)subscript𝐈𝑛𝐕(\mathbf{I}_{n}-\mathbf{V})( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_V ) is the orthogonal projection matrix onto 𝒞(𝐕)𝒞superscript𝐕perpendicular-to\mathcal{C}(\mathbf{V})^{\perp}caligraphic_C ( bold_V ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT with r(𝐈n𝐉)=np1𝑟subscript𝐈𝑛𝐉𝑛𝑝1r(\mathbf{I}_{n}-\mathbf{J})=n-p-1italic_r ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) = italic_n - italic_p - 1. By Lemma 5, we have

𝐲T(𝐈n𝐉)𝐲/σ2χ2(np1),similar-tosuperscript𝐲𝑇subscript𝐈𝑛𝐉𝐲superscript𝜎2superscript𝜒2𝑛𝑝1\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/\sigma^{2}\sim\chi^{2}(n-p% -1),bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) bold_y / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n - italic_p - 1 ) ,

where χ2(np1)superscript𝜒2𝑛𝑝1\chi^{2}(n-p-1)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n - italic_p - 1 ) is a chi-sqaured distribution with degree of freedom np1𝑛𝑝1n-p-1italic_n - italic_p - 1 and non-central parameter being zero, because (𝐈n𝐉)𝐕=0subscript𝐈𝑛𝐉𝐕0(\mathbf{I}_{n}-\mathbf{J})\mathbf{V}=0( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) bold_V = 0. Then, one can construct an unbiased estimator:

σ^2=𝐲T(𝐈n𝐉)𝐲np1,superscript^𝜎2superscript𝐲𝑇subscript𝐈𝑛𝐉𝐲𝑛𝑝1\hat{\sigma}^{2}=\frac{\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}}{n-% p-1},over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) bold_y end_ARG start_ARG italic_n - italic_p - 1 end_ARG , (39)

for σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as 𝔼[𝐲T(𝐈n𝐉)𝐲/σ2]=np1𝔼delimited-[]superscript𝐲𝑇subscript𝐈𝑛𝐉𝐲superscript𝜎2𝑛𝑝1\mathbb{E}\left[\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/\sigma^{2}% \right]=n-p-1blackboard_E [ bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_J ) bold_y / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_n - italic_p - 1.

A.6.2 Unbiasedness of σ^agg2subscriptsuperscript^𝜎2agg\hat{\sigma}^{2}_{\text{agg}}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT

From the model (15), we have 𝐖𝐲𝒩(𝐖𝐗𝜷,σ2𝐈k)similar-to𝐖𝐲𝒩𝐖𝐗𝜷superscript𝜎2subscript𝐈𝑘\mathbf{W}\mathbf{y}\sim\mathcal{N}(\mathbf{W}\mathbf{X}\boldsymbol{\beta},% \sigma^{2}\mathbf{I}_{k})bold_Wy ∼ caligraphic_N ( bold_WX bold_italic_β , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Assume that 𝐗𝐗\mathbf{X}bold_X is of full column rank. Define 𝐇=𝐖𝐗(𝐗𝐖2𝐗)1𝐗T𝐖𝐇𝐖𝐗superscriptsuperscript𝐗𝐖2𝐗1superscript𝐗𝑇𝐖\mathbf{H}=\mathbf{W}\mathbf{X}(\mathbf{X}\mathbf{W}^{2}\mathbf{X})^{-1}% \mathbf{X}^{T}\mathbf{W}bold_H = bold_WX ( bold_XW start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_X ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W so that 𝐇𝐇\mathbf{H}bold_H is the orthogonal projection matrix onto 𝒞(𝐖𝐗\mathcal{C}(\mathbf{WX}caligraphic_C ( bold_WX) with r(𝐖𝐗)=p+1𝑟𝐖𝐗𝑝1r(\mathbf{WX})=p+1italic_r ( bold_WX ) = italic_p + 1. Similarly, (𝐈k𝐇)subscript𝐈𝑘𝐇(\mathbf{I}_{k}-\mathbf{H})( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) is an orthogonal matrix onto 𝒞(𝐖𝐗)𝒞superscript𝐖𝐗perpendicular-to\mathcal{C}(\mathbf{WX})^{\perp}caligraphic_C ( bold_WX ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT with r(𝐈k𝐇)=kp1𝑟subscript𝐈𝑘𝐇𝑘𝑝1r(\mathbf{I}_{k}-\mathbf{H})=k-p-1italic_r ( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) = italic_k - italic_p - 1. By Lemma 5, it follows that

𝐲T𝐖(𝐈k𝐇)𝐖𝐲/σ2χ2(kp1),similar-tosuperscript𝐲𝑇𝐖subscript𝐈𝑘𝐇𝐖𝐲superscript𝜎2superscript𝜒2𝑘𝑝1\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}\mathbf{y}/\sigma% ^{2}\sim\chi^{2}(k-p-1),bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W ( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) bold_Wy / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_k - italic_p - 1 ) ,

as (𝐈k𝐇)𝐖𝐗=0subscript𝐈𝑘𝐇𝐖𝐗0(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}\mathbf{X}=0( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) bold_WX = 0. Then, we can use an unbiased estimator

σ^agg2=𝐲T𝐖(𝐈k𝐇)𝐖𝐲kp1subscriptsuperscript^𝜎2aggsuperscript𝐲𝑇𝐖subscript𝐈𝑘𝐇𝐖𝐲𝑘𝑝1\hat{\sigma}^{2}_{\text{agg}}=\frac{\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-% \mathbf{H})\mathbf{W}\mathbf{y}}{k-p-1}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT = divide start_ARG bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W ( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) bold_Wy end_ARG start_ARG italic_k - italic_p - 1 end_ARG (40)

for estimate σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT since 𝔼[𝐲T𝐖(𝐈k𝐇)𝐖𝐲/σ2]=kp1𝔼delimited-[]superscript𝐲𝑇𝐖subscript𝐈𝑘𝐇𝐖𝐲superscript𝜎2𝑘𝑝1\mathbb{E}\left[\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}% \mathbf{y}/\sigma^{2}\right]=k-p-1blackboard_E [ bold_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_W ( bold_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H ) bold_Wy / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_k - italic_p - 1.

A.6.3 Quantification of efficiency

Note that if a random variable W𝑊Witalic_W follows a χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distribution with degree of freedom d𝑑ditalic_d, then Var(W)=2dVar𝑊2𝑑\text{Var}(W)=2dVar ( italic_W ) = 2 italic_d. As such, we can conclude that even though the estimators in (39) and (40) are both unbiased for σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the estimator in (39) is more efficient than one in (40) because Var(σ^2)=2σ4/(np1)<2σ4/(kp1)=Var(σ^agg2)Varsuperscript^𝜎22superscript𝜎4𝑛𝑝12superscript𝜎4𝑘𝑝1Varsubscriptsuperscript^𝜎2agg\text{Var}(\hat{\sigma}^{2})=2\sigma^{4}/(n-p-1)<2\sigma^{4}/(k-p-1)=\text{Var% }\left(\hat{\sigma}^{2}_{\text{agg}}\right)Var ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = 2 italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT / ( italic_n - italic_p - 1 ) < 2 italic_σ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT / ( italic_k - italic_p - 1 ) = Var ( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT ), and this difference is noticeable since the number of regions k𝑘kitalic_k is much smaller than the total number of samples n𝑛nitalic_n.

Refer to caption
Figure 8: Comparison of the distributions of population and sample sizes. Census block groups are ordered from the one with the largest population to the one with the smallest population. Cumulative distribution functions are of population/sample sizes from iOS and Android devices at different census block groups in City C (part (a)) and City D (part (b)). The probability mass functions are of population/sample sizes from iOS and Android devices in City C (part (c)) and City D (part (d)). The number on the x-axis indicates the rank of the population size in each census block group; the census block group with the largest population has rank 1.

Appendix B Other cities and devices types

B.1 Regional sampling bias detection and correction

Here we provide additional numerical results of the regional sampling bias in two other cities, referred to as City C and City D. We also examine whether the regional sampling bias affects the estimation of internet speed and its association with regional demographic profiles, as discussed in Section 3. We compare the cumulative distribution of sample sizes with population in Fig. 8 for City C and D, and the difference seems to be more noticeable for City C.

Refer to caption
Figure 9: Comparison of empirical CDFs of the internet download speed from the original samples with re-weighted empirical CDF from original samples, and empirical CDF from re-sampled data. (a) iOS devices in City C; (b) Android devices in City C; (c) iOS devices in City D; and (d) Android devices in City D. The insets show zoom-in plots of the estimated CDFs of the internet download speed above the 98th percentile from the original samples.
Table 5: The result from χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT homogeneity test for City C and D.
City # of regions Device type Test statistics W𝑊Witalic_W p-value
C 525 iOS 780,884 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
Android 128,545 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
D 397 iOS 88,953 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
Android 142,934 <1016absentsuperscript1016<10^{-16}< 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT

This deviance is statistically evaluated by the chi-squared homogeneity test as shown in Table 5. For each device type in either City C or D, we find that the proportion of sample sizes is significantly different from the proportion of the population. We further compare the cumulative distribution functions of internet speed in Fig. 9 based on three different empirical CDFs introduced in Section 3.2. Neither re-weighing and re-sampling methods show a notable difference from the empirical CDF with regional sampling bias, which is consistent with the findings for City A and B.

The demographic disparity between the over- and under-sampled census block groups of City C and City D for iOS devices is visualized in Fig. 10. We find that in both City C and D that the over-sampled regions tend to have higher income, greater age, a larger proportion of population with a bachelor’s degree or higher, a greater percentage of households with internet subscription plans, higher representation of white and Asian residents; and lower representation of black and Hispanic residents. The differences by demographic profiles between over-sampled and under-sampled regions across all four cities generally agree with each other.

Refer to caption
Figure 10: Comparison of distributions of demographic variables between over-/under-sampled census block groups. Significance is based on the p-values from the two-sample t-test: *** (p-value<<<0.001); ** (p-value<<<0.01); * (p-value<<<0.05); and empty mark for non-significant cases. (a) iOS devices from City C; (b) iOS devices from City D.
Refer to caption
Figure 11: Comparison of distributions of demographic variables between over-/under-sampled census block groups. Significance is based on the p-values from the two-sample t-test: *** (p-value<<<0.001); ** (p-value<<<0.01); * (p-value<<<0.05); and empty mark for non-significant cases. (a) Android devices from City C; (b) Android devices from City D.

The comparison of Android devices for City C and D is given in Fig. 11. We find that the over-sampled census block groups tend to have similar demographic characteristics to those of the iOS devices for each city. The difference of demographic profiles between over-sampled regions and under-sampled regions for the Android devices tends to be smaller than those of the iOS devices.

B.2 Correlating internet quality with demographic variables

Refer to caption
Figure 12: Comparison of regression coefficient estimates (dots) and 95% confidence intervals (bars) from original data and re-sampled data. Multiple linear regression is conducted with backward variable selection by AIC for both iOS and Android from (a) City C and (b) City D. The variables dropped from model selection are not shown in the plot.
Refer to caption
Figure 13: Heat-map of pair-wise correlation coefficients between demographic covariates from re-sampled data for Android devices in (a) City A and (b) City B.
Refer to caption
Figure 14: Heat-map of pair-wise correlation coefficients between demographic covariates from re-sampled data for iOS devices in (a) City C and (b) City D.
Refer to caption
Figure 15: Heat-map of pair-wise correlation coefficients between demographic covariates from re-sampled data for Android devices in (a) City C and (b) City D.

We applied the regression models (12) and (14) in Section 4 for both iOS and Android devices in City C and D with backward variable selection based on AIC. The estimated coefficients from the selected variables and 95%percent9595\%95 % confidence intervals of the estimation are shown in Fig. 12. First, income is positively correlated with internet speed for both cities and devices types, whereas the coefficients of income from the re-sampled data have a smaller impact on the internet speed. Second, the coefficients of age are negative from both the original and re-sampled data in City C and D, suggesting that the regions with younger age tend to have faster internet. Third, we observe in part (a) of Fig. 14 that the proportion of Hispanic residents is negatively correlated with the percentage of bachelor’s degree or higher, which can partially explain the increased coefficient of Hispanic after re-sampling along with the coefficient of bachelor’s degree, particularly for iOS devices. Furthermore, the internet penetration rate has a positive correlation with income and proportion of individuals with bachelor’s degrees or higher, as shown in Fig. 14 and Fig. 15. Such correlation can potentially explain the coefficients of the internet penetration rates for iOS devices for City C and City D in the re-sampled data, as positive coefficients of bachelor and income offsets the impact of internet penetration rates.

The results from the impact of demographic profiles in Cities C and D are generally consistent with Cities A and B. Census block groups with higher income and younger population tend to have higher internet speed. Such effects can be offset by other positively correlated variables, such as internet penetration rates.

B.3 Temporal progression of internet speed

Refer to caption
Figure 16: Comparison of temporal trend of internet download speed from original versus re-sampled data based on linear and Gaussian process (GP) regression represented by the state space model. (a) iOS devices in City C; (b) Android devices in City C; (c) iOS devices in City D; and (d) Android devices in City D. The solid curves are the predictive mean and the shaded areas are the 95% percent confidence interval of the estimation.

We analyze the temporal progression of internet speed for Cities C and D based on methods discussed in Section 5, where in this case the available data spans from 01-01-2021 to 12-31-2021 for both cities. The estimated mean and 95%percent9595\%95 % confidence intervals from both linear models and state space models are shown in Fig. 16. All models indicate that internet download speed increases over the 1-year period. Estimation from original data and re-sampled data seems to be similar for these two cities, indicating that the regional sampling bias across census block groups does not lead to large difference in estimation of temporal trend. The data for both cities contains large variability, potentially due to the different internet subscription plans. The availability of fiber internet plan can be one of the main reasons to drive the increase of internet speed.

Table 6: Linear trends of measured internet download speed (Mbps) per day for Cities C and D.
City Device type Estimates β^1tsubscript^𝛽1𝑡\hat{\beta}_{1t}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT (95% CI) Estimates β^1trsuperscriptsubscript^𝛽1𝑡𝑟\hat{\beta}_{1t}^{r}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT (95% CI)
C iOS 0.1384 (0.1310, 0.1458) 0.1509 (0.1403, 0.1615)
Android 0.2944 (0.2827, 0.3062) 0.2824 (0.2659, 0.2988)
D iOS 0.1052 (0.0950, 0.1154) 0.0919 (0.0786, 0.1053)
Android 0.0552 (0.0428, 0.0676) 0.0633 (0.0460, 0.0805)

We further compare the estimated linear coefficients β^1tsubscript^𝛽1𝑡\hat{\beta}_{1t}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT and β^1trsuperscriptsubscript^𝛽1𝑡𝑟\hat{\beta}_{1t}^{r}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT of the linear trend of internet download speed for City C and City D in Table 6. All estimates are positive for both device types, whereas City C may have a faster increasing trend than City D.