Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Hyeongseong Lee, Udit Paul, Arpit Gupta, Elizabeth Belding, and Mengyang Gu

University of California, Santa Barbara, California, USA Corresponding author ([email protected])

Abstract

Crowdsourced speed test measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest^® measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger population, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed.

1 Introduction

The allocation of large US federal grants and subsidies to expand Internet infrastructure, such as through the $42.5 billion Broadband Equity Access and Deployment (BEAD) program initiated in November 2021, is an important step in addressing digital inequity and providing high-quality Internet access to all Americans. To properly allocate this funding to regions of greatest need, there must first be a methodology for measuring the current state of Internet access and quality at a given location. Indeed, the US Federal Communications Commission (FCC) recently undertook a significant effort to update publicly available information about US broadband infrastructure. First, the FCC worked to improve accuracy in the national Broadband Serviceable Location Fabric, defined as “a common data set of all residential and business locations (or structures) in the U.S. where fixed broadband internet access service is or can be installed.” Based on this new Fabric, they updated the National Broadband Map with ISP-provided data on the maximum download speed available at each location [10]. While this is a significant advancement in publicly accessible data on broadband infrastructure, the Fabric and the map have a number of drawbacks. In particular, the Broadband Map does not report actual or average performance but theoretical maximums. It also does not provide information about the reliability or stability of the access over time. Finally, providers have been repeatedly found to overstate coverage claims [1, 27].

To address these drawbacks, the research community, individual users, and governments and policymakers have turned to crowdsourced “speed test” active measurement data, such as Ookla’s speedtest.net [11], Measurement Lab’s speed.measurementlab.net [9], FAST [5] and Xfinity’s speed test [15]. These platforms are in wide use globally. For instance, Ookla reports a daily average of over 10 million Speedtests [16]. Speed test platforms offer a variety of measurements, such as instantaneous upload/download speed and latency and loaded latency data, through either an app or a website. As such, these platforms provide an important snapshot of the network state from the vantage point of the end users. Further, because they are active measurements, they provide data on actual performance instead of the theoretical maximum performance reported by the providers. Because of the inherent benefits, numerous U.S. government initiatives (e.g. [18, 4, 6, 14, 7, 3]) utilize crowdsourced speed test data to help map the internet access landscape.

However, despite the broad use speed tests, there are multiple critical limitations that can affect the estimation accuracy of these crowdsourced measurements. In particular, due to the nature of crowdsourced data, the sampling mechanism is often uncontrollable [35]. As a result, inherent in this data is a variety of biases along multiple dimensions that can skew the measurement results in unanticipated ways. For instance, the sampling biases can lead to inappropriate conclusions if they are treated as simple random samples [29]. Another barrier is that consumers’ information, such as broadband plans and demographic profiles, is not directly available from the speed tests. Even if the broadband plans can be correctly inferred for the speed test with the use of other data sets [31], the lack of a demographic profile associated with each measurement makes it challenging to correct the sampling bias directly. Finally, speed test performance may depend on the choice of test server; prior work has found that some servers consistently report speeds 10% lower than other servers [26]. Nevertheless, the vast number and diversity of speed test data points present abundant information for understanding the disparity of internet quality, particularly as related to different socio-demographic groups and the change/evolution of internet quality over time.

It is within this context that we perform our analysis. Specifically, we utilize a vast corpus of nearly 1 million individual Ookla Speedtest^® measurements, to which we have access through an Ookla for Good^TM Data Use Agreement, to correlate network performance measurements and trends with demographic profiles of census block groups from the 2019 U.S. census data. Our goals are to characterize disparities in internet quality based on socio-demographic groups and to analyze the change in internet quality over time. We illustrate our novel methods using four representative cities and two device types (iOS and Android) as examples, corresponding to 978,101 Speedtest data points over 580 days. Specifically, the contributions of our work are threefold:

1.

We develop novel methods for correcting regional sampling bias and correlating this sampling bias with demographic profiles in the population. By utilizing the chi-squared test, we find that the proportion of the Speedtests in different census block groups significantly deviates from the baseline proportions in the population. We introduce re-weighing and re-sampling methods for correcting the regional sampling bias. Even though we found a visible discrepancy between the original and corrected estimation of the internet distribution, this difference is much smaller than the regional bias, meaning that the sampling bias itself may not substantially affect the estimation of the internet speed¹¹1In this paper, we use the terms “internet speed” and “measured internet speed” interchangeably. In either case, we are specifically referring to the download speed of the connection to the user end device as measured through an Ookla Speedtest. distribution in a city. We study the underlying reasons for disproportionate sampling and find that the sampling bias is strongly associated with a few demographic variables, including income, education level, age, and ethnic distribution.
2.

We introduce variable selection and regression techniques to fuse data at different spatial granular levels for understanding how internet quality varies across demographic profiles. By studying the individual and aggregated data, we show that the use of individual-level measurements is more statistically efficient in estimating the variability of the speed test measurements than the aggregated data. Through backward variable selection for the original and bias-corrected samples, we find that regions with higher income, younger populations, and smaller Hispanic populations tend to have higher internet speeds. Furthermore, we find strong collinearity between a few demographic variables, which tend to affect the outcomes of the fitted model jointly.
3.

We analyze temporal changes of measured internet performance over time. We conduct both a linear regression model and a state space model to study the temporal change of the internet measurements. We utilize the state space representation of the Gaussian process with a Matérn function having a half-integer roughness parameter, which reduces the cost of computing the likelihood and making predictions from $\mathcal{O}(n^{3})$ to $\mathcal{O}(n)$ operations without approximation, with $n$ being the number of observations. The acceleration with no approximation enables the approach to be computationally feasible with a massive number of crowdsourced measurements. Both linear and nonlinear analyses suggest that internet speed quality gradually improves over time.

The rest of this paper is organized as follows. In Section 2, we introduce the two primary data sets used in this study. In Section 3, we investigate the regional sampling bias and the association with demographic variables. We introduce two ways for correcting the sampling bias and compare the difference of estimation between the original and bias-corrected samples. Section 4 introduces regression analysis of measured internet speed using demographic profiles at different census block groups for both the original and bias-corrected samples. Section 5 studies the change of measured internet speeds over time using both linear and nonlinear estimation. We conclude the paper in Section 6. We provide proofs and report additional results in the Appendix.

2 Description of data

We combine Ookla^® internet speed measurement data [11] and demography data provided by the U.S. Census Bureau for our study. In the following sections, we describe each dataset in detail.

2.1 Ookla’s Speedtest

Ookla’s Speedtest²²2http://speedtest.net (data provided through Ookla’s Speedtest Intelligence^®) possesses over 16k measurement servers worldwide [13] and allows users to assess the quality of their Internet connection using either a web-based portal or native mobile application [11]. For each Speedtest, a nearby test server is selected, and (potentially multiple) TCP connections are used to calculate the throughput of the path. Ookla’s Speedtest Intelligence dataset contains individual Speedtest measurements that include QoS metrics (up/down throughput, latency, packet loss, jitter), as well as meta-features such as ISP, device type, and access type. Ookla provides performance data aggregated over time and space to the public [12]. However, our Data Usage Agreement (DUA) with Ookla provides us access to nearly 1 million individual Speedtest measurements from four major metropolitan cities in the U.S., which we use for this study. Each of these cities has a population in the range of 370—650k.

We use the data obtained through our DUA to investigate the download speed data from two different device types—iOS and Android. We focus on these two device types due to the richness of metadata and geographical information associated with each data point, as well as due to the widespread popularity and usage of these platforms. For demonstration purposes, we use two representative cities for our detailed analysis. Due to our DUA, we maintain the confidentiality of these cities and only refer to them as City A and City B. We also provide numerical analysis of two additional cities, referred to as City C and City D, in Appendix B. Since precise geographic location is crucial for our study, we ensure each measurement includes latitude/longitude data recorded at the time the Speedtest is conducted; truncated GPS coordinates are reported for users who allow location sharing on their equipment. We utilize the geographic location of each Speedtest to associate the test with the relevant census block group. Table 1 shows summary statistics for the data used in this paper. We observe that in every listed city, iOS devices outnumber Android devices in our dataset, yielding almost double the number of Speedtest data points from iOS compared to Android.

Table 1: Overview of Ookla data.

City	Device type	# of unique devices	# of tests	Date range
A	iOS	23,649	165,229	May 31, 2020 – Dec 31, 2021
A	Android	8,438	73,094	May 31, 2020 – Dec 31, 2021
B	iOS	15,507	92,526	May 31, 2020 – Dec 31, 2021
B	Android	8,553	70,039	May 31, 2020 – Dec 31, 2021
C	iOS	34,005	213,770	Jan 1, 2021 – Dec 31, 2021
C	Android	14,730	117,806	Jan 1, 2021 – Dec 31, 2021
D	iOS	20,607	146,703	Jan 1, 2021 – Dec 31, 2021
D	Android	11,122	98,934	Jan 1, 2021 – Dec 31, 2021

2.2 Demographic data from the U.S. Census Bureau

To study demographic trends with performance data, we leverage data from the U.S. Census Bureau. In particular, we select the 2019 Census demographic information due to the inaccuracy of data since the COVID-19 pandemic [8]. For City A and B, we obtain population data for each census block group as well as the following demographic attributes: income, age, sex, education level, internet penetration rate, and ethnicity distribution. Income and age are represented by the median household income and median age, respectively. Sex information is characterized by the proportion of males within the total population of each census block group. The level of education is quantified as the proportion of individuals with a bachelor’s degree or higher among the population aged 25 and older. The internet penetration rate is the percentage of households with an internet subscription out of the total number of households of each census block group. Ethnic distribution is summarized by the proportion of white, black, Asian, and Hispanic populations within the total population of each census block group.

The demographic information from the U.S. Census Bureau is only available at the level of the census block group. Therefore, we build an integrated dataset with a spatial granular structure, where speed measurements identified in each census block group are correlated with the demographic profiles for that census block group based on U.S. Census Bureau data.

3 Regional sampling bias detection and correction

In this section, we explore the presence of sampling bias across regions within cities A and B. In the context of a given city, a “region” refers to a census block group, which constitutes a sub-area within that city. If the distribution of the number of samples does not align proportionally with the population, then we say there exists regional sampling bias within our sample, which can render our statistical analysis divergent from the truth at the city level.

3.1 Comparison of sample and population distributions at the census block group level

Each Ookla Speedtest measurement is associated with a certain census block group based on the geographic location of the test. By comparing the distribution of the sample sizes across census block groups with their population distribution, we can test whether the sample sizes are homogeneous with the population across regions. As described in Section 2, the population of each census block group is based on data from the U.S. Census Bureau. We plot the cumulative distributions and probability mass functions between the sample and population at the level of census block group in Fig. 1 for City A and B, where the census block groups are ordered according to their population sizes. We observe that there is a recognizable discrepancy between the distribution of sample sizes from both Android and iOS in Ookla Speedtests and the population at each census block group. The comparison suggests that there is a regional sampling bias in the Speedtest measurements, as certain census block groups contain disproportionately larger samples than others.

Refer to caption — Figure 1: Comparison of the distributions of population and sample sizes. Census block groups are ordered left to right from largest to smallest population. Cumulative distribution functions are represented for City A and B in (a) and (b), respectively, while probability mass functions for each city are in (c) and (d), respectively. The x-axis indicates the rank of the population size in each census block group, and the census block group with the largest population has rank 1.

We utilize the chi-squared test, one of the most frequently used tests for homogeneity of the samples, to quantify whether the discrepancy between the population and sample size distributions across census block groups is significant. The two assumptions of the test are summarized as follows:

Assumption 1.

The internet speed tests are independent within and between census block groups.

Assumption 2.

The internet speed tests are identically distributed within each census block group, and they are representative for each census block group.

Assumption 1 approximately holds in the data set, but it can be affected by internet traffic and availability. For instance, an internet outage event may affect more than one census block group, making speed tests not independent. Second, certain groups of internet users may be more likely to utilize Speedtest; the relationship between internet speed and demographic profiles of the census block groups will be studied in Section 4.1.

We collect the sample of Ookla Speedtests from $k$ census block groups. For the $i$ -th census block group, define $n_{i}$ and $N_{i}$ as the sizes of its sample and population, respectively, for $i=1,\cdots,k$ . Furthermore, let $n$ and $N$ respectively be the total number of samples and the population among all the regions, i.e., $n=\sum_{i=1}^{k}n_{i}$ and $N=\sum_{i=1}^{k}N_{i}$ . When the samples are drawn according to the population distribution, i.e., the null distribution, the expected number of samples in the $i$ -th region is $\frac{nN_{i}}{N}$ . Then, the $\chi^{2}$ test statistic is defined by:

W=\sum_{i=1}^{k}\frac{(n_{i}-nN_{i}/N)^{2}}{nN_{i}/N}.

(1)

Under the null hypothesis, where there is no regional sampling bias across different census block groups, $W\sim\chi^{2}_{k-1}$ , i.e., the test statistic $W$ is distributed as a chi-squared distribution with $k-1$ degrees of freedom.

Table 2: The results from

\chi^{2}

homogeneity test.

City	# of regions	Device type	Test statistics $W$	p-value
A	465	iOS	342,661	$<10^{-16}$
A	465	Android	169,012	$<10^{-16}$
B	868	iOS	398,812	$<10^{-16}$
B	868	Android	167,411	$<10^{-16}$

We find that the disparity observed in Fig. 1 is in line with the results from the statistical testing as shown in Table 2. We observe that the p-values from both iOS and Android devices in City A and B are significantly low, suggesting substantial inhomogeneity in the sample sizes relative to the population across regions.

3.2 Cumulative distribution functions of internet speed: before and after regional bias correction

Because we found significant evidence of the Speedtest distribution being disproportionate across regions, we introduce two ways to correct the regional sampling bias. By comparing the estimation of the cumulative distribution function from the original and bias-corrected samples, we can assess whether this sampling bias affects the estimation of the distribution of internet speed.

To delineate the empirical CDF of internet speed, let $y_{ij}$ represent the internet speed measurement from the $j$ -th unit of the $i$ -th census block group for $j=1,\cdots,n_{i}$ and $i=1,\cdots,k$ . Since our estimation covers $k$ different census block groups, we define $y$ as a simple random sample of internet speed from a collection of these $k$ census block groups. Based on Assumptions 1 and 2, the simple random sample $y$ can be represented by a mixture of $k$ random variables $y_{1},y_{2},y_{3},\cdots,y_{k}$ , which denote the internet speed measurement from the 1st, 2nd, 3rd, $\cdots,$ and $k$ -th census block group, respectively. Without any regional sampling bias, a simple random sample $y$ of internet speed can be expressed as

y=\sum_{i=1}^{k}y_{i}I(z_{i}=1),\ \text{with }\mathbf{z}=(z_{1},\cdots,z_{k})^% {T}\sim\text{Multinom}\left(1,(N_{1}/N,\cdots,N_{k}/N)\right),

(2)

where $I(\cdot)$ refers to an indicator function, $z_{i}$ only takes either 0 or 1 for $i=1,\cdots,k$ , $N=\sum_{i=1}^{k}N_{i}$ indicates the total population over all $k$ census block groups, and $\sum_{i=1}^{k}z_{i}=1$ . A simple random sample is critical for inferring the population, and sampling bias can lead to misleading conclusions, such as those in 2016 U.S. presidential election [29].

For a given internet speed $x$ , we are interested in the cumulative probability of internet speed for the entire city:

F(x)=\mathbb{P}\left[y\leq x\right].

(3)

Similarly, we define $F_{i}(\cdot)$ to be the cumulative distribution at the $i$ -th census block group, for $i=1,...,k$ . The cumulative probability in (3) is a weighted sum of the regional cumulative probability by the corresponding population, as shown in Lemma 1.

Lemma 1.

The probability in (3) is equal to the weighted sum of $F_{i}(x)$ , where $F_{i}(x)=\mathbb{P}\left[y_{i}\leq x\right]$ for $i=1,\cdots,k$ , by its population proportion, i.e.

F(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x).

(4)

The proof is given in Appendix A.1. To estimate the cumulative distribution of the internet speed of a city given in (3), a conventional way is to use all data points to employ the empirical CDF as:

\hat{F}_{b}(x)=\frac{1}{n}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x),

(5)

where $n=\sum_{i=1}^{k}n_{i}$ , indicating the total number of sample. The subscript ‘ $b$ ’ denotes bias, as this empirical CDF in (5) is a weighted sum of regional empirical CDFs by corresponding sample proportions:

\hat{F}_{b}(x)=\sum_{i=1}^{k}\frac{n_{i}}{n}\hat{F}_{i}(x),

(6)

where $\hat{F}_{i}(x)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x)$ , indicating the simple empirical CDF of $i$ -th census block group for $i=1,\cdots,k$ . Note that $\hat{F}_{i}(x)$ converges to the CDF of the census block group $i$ when the sample size goes to infinity, whereas the $\hat{F}_{b}(x)$ typically does not converge to the CDF in (4), when the ratio of the sample $n_{i}/n$ does not converge to the ratio of the population $N_{i}/N$ at any census block group $i$ .

3.2.1 Re-weighing the sample for bias correction

A direct approach to correct the regional sampling bias is to construct an unbiased estimator for (3) simply by re-weighing the regional empirical CDFs. The revision of weights shall be based on the population proportions so that the unbiased estimator $\hat{F}_{u}$ for (3) is given by:

\hat{F}_{u}(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x).

(7)

One can show that $\mathbb{E}\left[\hat{F}_{u}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x)$ , meaning that $\hat{F}_{u}(x)$ is the unbiased estimator of the CDF for the entire city when the samples at each census block group are simple random samples. The proof of unbiasedness of Equation (7) is provided in Appendix A.2.

3.2.2 Re-sampling for bias correction

In some scenarios, having an unbiased or representative sample is important. Thus we introduce a re-sampling approach to provide a set of unbiased samples by correcting the regional sampling bias in the original sample. For the $i$ -th census block group, we determine the expected number $n_{i}^{*}$ in the re-sampled data to be:

n_{i}^{*}=\left[\frac{nN_{i}}{N}\right],\ \mbox{ for }i=1\cdots,k,

(8)

where $[x]$ indicates the closest integer of a given real number $x$ . In other words, we sub-sample the number of samples at each census block group based on the census block group’s proportion of the population. Then, for the $i$ -th census block group, we draw a random sample of internet speed measurement with replacement as many as $n_{i}^{*}$ times. Denote $y_{i1}^{r},\cdots,y_{in_{i}^{*}}^{r}$ as the re-sampled internet speed measurements at $i$ -th census block group for $i=1,\cdots,k$ . Then, one can define the estimated CDF at a given internet speed $x$ from the re-sampled data as:

	$\displaystyle\hat{F}^{*}_{u}(x)$	$\displaystyle=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}^{*}}I\left(y_{ij}^{r}\leq x\right)$		(9)
		$\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{}}{n^{}}\hat{F}^{*}_{i}(x),$		(9)

where $n^{*}=\sum_{i=1}^{k}n_{i}^{*}$ and $\hat{F}^{*}_{i}(x)=\sum_{j=1}^{n_{i}^{*}}I\left(y_{ij}^{r}\leq x\right)$ . The following lemma shows that both the re-sampled estimator in (9) and re-weighed estimator in (7) can converge to the true CDF of the internet speed in a city when the sample size goes to infinity.

Lemma 2.

Assume that Assumptions 1 and 2 hold and that the number $k$ of census block groups and population $N_{i}$ for each $i=1,\cdots,k$ remain fixed as constants. Then, for a given $x\in\mathbb{R}$ , we have

	$\displaystyle\hat{F}_{u}(x)$	$\displaystyle\xrightarrow[]{\mathbb{P}}F(x)\text{ as well as}$		(10)
	$\displaystyle\hat{F}^{*}_{u}(x)$	$\displaystyle\xrightarrow[]{\mathbb{P}}F(x),$		(10)

as $n_{i}\rightarrow\infty$ for each $i=1,\cdots,k$ so that $n\rightarrow\infty$ with $\xrightarrow[]{\mathbb{P}}$ denoting convergence in probability.

Lemma 2 is proved in Appendix A.3. The assessment of the uncertainty of the different methods for estimating the CDF is provided in Appendix A.4. According to Lemma 2, as long as Assumptions 1 and 2 hold, the re-weighted estimator in (7) and the re-sampled estimator in (9) are both consistent for estimating the CDF of internet speed of a city with a sufficiently large sample for each census block group. This enables us to recover a set of data with regional sampling bias adjusted so that we can easily generate reference results from various sorts of analysis by applying the same analytic tactics to re-sampled data.

In Fig. 2, we compare three different CDFs: the empirical CDF from the original samples in (6), the re-weighted empirical CDF in (7), and the empirical CDF from the re-sampled data in (9). We find that the estimation from the three methods is not identical, but the differences are much smaller than the discrepancy found from regional sampling bias, as shown in Fig 1. This result indicates that even though the Ookla data contains substantial regional sampling bias among census block groups, such bias does not have a large impact on estimating the overall distribution of the internet speed in a city.

3.3 Association of regional bias with demographic variables

While the impact of regional sampling bias over different census block groups on the estimation of the CDF is not as pronounced as the regional bias itself, this impact depends on whether the samples within the census block groups are representative and independent, as outlined in Assumptions 1-2. These assumptions may not strictly hold in practice. For instance, if a certain demographic group is over-sampled, the samples may not be representative in each census block group, violating Assumption 2. Thus, it is crucial to discover potential sampling bias among other sub-groups. We collect the demographic variable data for each census block group for the following variables: income, age, gender, educational level, internet penetration rate, and ethnicity. The variables that characterize the demographic profile of the $i$ -th census block group, for each $i=1,\cdots,k$ , are summarized in Table 3.

Table 3: List of demographic variables.

Notation	Description
$x_{1i}$	median household income
$x_{2i}$	median age
$x_{3i}$	% of male people
$x_{4i}$	% of people with bachelor’s or higher degree
$x_{5i}$	% of household with internet subscription plans
$x_{6i}$	% of people who are identified as white
$x_{7i}$	% of people who are identified as black
$x_{8i}$	% of people who are identified as Asian
$x_{9i}$	% of people who are identified as Hispanic

3.3.1 Characterization and comparison of regions

For each demographic feature $x_{hi}$ where $h=1,\cdots,9$ , we have $k$ corresponding values with the index set $\{1,2,3,\cdots,k\}$ as we collect samples from $k$ different census block groups. For each index $i$ where $i=1,\cdots,k$ , we classify the census block group into two mutually exclusive groups:

•

$i$ -th census block group is called over-sampled if $n_{i}>n^{*}_{i}$ ; and
•

$i$ -th census block group is called under-sampled if $n_{i}<n^{*}_{i}$ .

We note that there is no census block group where $n_{i}=n_{i}^{*}$ in both City A and B, for $i=1,\cdots,k$ . Let $\{o_{1},o_{2},\cdots,o_{k_{o}}\}$ and $\{u_{1},u_{2},\cdots,u_{k_{u}}\}$ denote the set of indices for over-/under-sampled regions, with ‘ $o$ ’ and ’ $u$ ’ denoting over-sampled and under-sampled, respectively, such that $\{o_{1},o_{2},\cdots,o_{k_{o}}\}\cup\{u_{1},u_{2},\cdots,u_{k_{u}}\}=\{1,% \cdots,k\}$ , and $\{o_{1},o_{2},\cdots,o_{k_{o}}\}\cap\{u_{1},u_{2},\cdots,u_{k_{u}}\}=\phi$ , and $k_{o}+k_{u}=k$ . Accordingly, for the $h$ -th demographic feature with $h=1,\cdots,9$ , we have two separate groups, denoted as $\{x_{ho_{1}},x_{ho_{2}},\cdots,x_{ho_{k_{o}}}\}$ and $\{x_{hu_{1}},x_{hu_{2}},\cdots,x_{hu_{k_{u}}}\}$ for over-/under-sampled census block groups, respectively.

For the $h$ -th demographic variable with $h=1,\cdots,9$ , define $\bar{x}_{ho}=\frac{1}{k_{o}}\sum_{i=1}^{k_{o}}x_{ho_{i}}$ , and $\bar{x}_{hu}=\frac{1}{k_{u}}\sum_{i=1}^{k_{u}}x_{hu_{i}}$ , which represent the sample mean of the over-/under-sampled regions, respectively. To evaluate the difference of location between two distributions, we apply a two-sample t-test with the test statistics $T$ given by:

T=\frac{\bar{x}_{ho}-\bar{x}_{hu}}{\sqrt{s^{2}\left(1/k_{o}+1/k_{u}\right)}},

(11)

where $s^{2}=\left(\sum_{i=1}^{k_{o}}(x_{ho_{i}}-\bar{x}_{ho})^{2}+\sum_{i=1}^{k_{u}}% (x_{hu_{i}}-\bar{x}_{hu})^{2}\right)/(k-2)$ . Under the null hypothesis where the distributions of over-/under-sampled census block groups have the same location, the statistic $T\sim t_{k-2}$ , a Student’s t-distribution [19] with $k-2$ degrees of freedom.

In Fig. 3 and 4, we compare the distributions of demographic variables between over-sampled and under-sampled regions by boxplots, where the stars represent the significance level from the two-sample t-test. In both cities, we find that the sampling of Ookla’s Speedtest is indeed inhomogeneous over the demographic variables. Specifically, we observe in both cities that over-sampled census block groups, for either iOS or Android devices, tend to have a greater age and a larger proportion of individuals with bachelor’s degrees or higher. However, City A (part (a) in Fig. 3 and 4) presents more prominent asymmetry than City B in many aspects. First, we find in City A that the over-sampled census block groups dominantly have higher income, a larger proportion of the population with a bachelor’s degree or higher, and a greater prevalence of households with internet subscriptions. Furthermore, the comparison of Cities A and B reveals the ethnic variety within and between cities. City A shows the contrast between census block groups with a higher percentage of white or black residents, whereas most of the census block groups in City B have a high representation of black residents. This contrast in City A has a relationship with Speedtest sampling; over-sampled regions tend to have a significantly higher representation of white residents and a lower representation of black residents, which is not the case in City B. Additionally, the over-sampled regions for iOS devices in City A tend to have more Hispanic population whereas the difference is not significant in City B.

4 Correlating internet quality with demographic variables

Section 3.3 showed that Ookla Speedtests from our cities under study are unevenly sampled across different demographic groups. This result is important if there is a relationship between internet quality (e.g., speed) and demographic variables. Hence we now ask the question, does internet access quality have a relationship to demographic variables? To answer this question, we correlate the internet speed with the demographic profiles in each census block group for both the original data set and the re-sampled data set after correcting the regional sampling bias. Because demographic profiles of individuals who conducted Speedtests are not available, we will integrate the Speedtests and demographic variables at two different spatial granular levels in the next section.

4.1 Multiple linear regression

We begin with a multiple linear regression model that relates internet speed $y_{ij}$ , the Speedtest $j$ at census block group $i$ , to the demographic variable $\mathbf{x}_{i}$ :

y_{ij}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}+\epsilon_{ij},

(12)

where $\mathbf{x}_{i}=(1,x_{1i},x_{2i},\cdots,x_{pi})^{T}$ denotes a $(p+1)$ -dimensional vector of predictors (such as income, age, level of education, etc.) with $p=9$ , $\boldsymbol{\beta}=(\beta_{0},\beta_{1},\cdots,\beta_{p})^{T}\in\mathbb{R}^{(p% +1)}$ is a vector of linear coefficients with $p=9$ , and $\epsilon_{ij}\sim\mathcal{N}(0,\sigma^{2})$ denotes a Gaussian white noise with variance $\sigma^{2}$ , for $j=1,\cdots,n_{i},$ and $i=1,\cdots,k$ . Note that the observations from the $i^{th}$ census block group, denoted as $y_{i1},\cdots,y_{in_{i}}$ , share the same predictor vector $\mathbf{x}_{i}$ for $i=1,\cdots,k$ . It follows that the model (12) is equivalent to:

\mathbf{y}=\mathbf{V}\boldsymbol{\beta}+\boldsymbol{\epsilon},

(13)

where $\mathbf{y}=(y_{11},\cdots,y_{1n_{1}},\cdots,y_{k1},\cdots,y_{kn_{k}})^{T}\in% \mathbb{R}^{n}$ with $n=\sum_{i=1}^{k}n_{i}$ , $\mathbf{V}=[\mathbf{x}_{1}\mathbf{1}_{n_{1}}^{T},\mathbf{x}_{2}\mathbf{1}_{n_{% 2}}^{T},\cdots,\mathbf{x}_{k}\mathbf{1}_{n_{k}}^{T}]^{T}\in\mathbb{R}^{n\times% (p+1)}$ , $\mathbf{1}_{d}$ denotes a $d$ -dimensional column vector with all elements being 1, and
$\boldsymbol{\epsilon}=(\epsilon_{11},\cdots,\epsilon_{1n_{1}},\cdots,\epsilon_% {k1},\cdots,\epsilon_{kn_{k}})^{T}\sim\mathcal{MN}(\mathbf{0},\sigma^{2}% \mathbf{I}_{n})$ with $\mathbf{I}_{n}$ being an $n$ -dimensional identity matrix.

Due to the two different spatial granular levels in model (12), an aggregated version of model (12) can be suggested to equalize the granular levels without affecting the estimation of $\boldsymbol{\beta}$ in model (12).

\bar{y}_{i}=\mathbf{x}^{T}_{i}\boldsymbol{\beta}+\epsilon_{i},

(14)

where $\bar{y}_{i}=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}y_{ij}$ , and $\epsilon_{i}$ independently follows $\mathcal{N}(0,\sigma^{2}/n_{i})$ for $i=1,\cdots,k$ . Model (14) can be written as a matrix form:

\bar{\mathbf{y}}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}_{\text{agg% }},

(15)

where $\bar{\mathbf{y}}=(\bar{y}_{1},\cdots,\bar{y}_{k})^{T}\in\mathbb{R}^{k}$ , $\mathbf{X}=[\mathbf{x}_{1},\cdots,\mathbf{x}_{k}]^{T}\in\mathbb{R}^{k\times(p+% 1)}$ , $\boldsymbol{\epsilon}_{\text{agg}}=(\epsilon_{1},\cdots,\epsilon_{k})^{T}\sim% \mathcal{MN}(\mathbf{0},\sigma^{2}\mathbf{W}^{-2})$ , and $\mathbf{W}$ is a $k\times k$ diagonal matrix with diagonal entries being $\sqrt{n_{1}},\cdots,\sqrt{n_{k}}$ . We have the following lemma that connects the individual level model (12) and the aggregated level model (14).

Lemma 3.

Suppose our predictor vectors are given, i.e., we know the vector $\mathbf{x}_{i}$ for $i=1,\cdots,k$ . Let $l$ and $\bar{l}$ be the natural logarithm of the likelihood of model (12) and (14), respectively. Then, we have:

l(\boldsymbol{\beta},\sigma^{2})=\bar{l}(\boldsymbol{\beta},\sigma^{2})+c_{% \sigma^{2}},

(16)

where $\bar{l}(\boldsymbol{\beta},\sigma^{2})=k\log\sqrt{2\pi\sigma^{2}}+\sum_{i=1}^{% k}\log\sqrt{n_{i}}-\sum_{i=1}^{k}n_{i}(\bar{y}_{i}-\mathbf{x}_{i}^{T}% \boldsymbol{\beta})^{2}/(2\sigma^{2})$ and $c_{\sigma^{2}}=-(n-k)\log\sqrt{2\pi\sigma^{2}}-\sum_{i=1}^{j}\sqrt{n_{i}}-\sum% _{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}/(2\sigma^{2})$ .

Lemma 3 is derived in Appendix A.5. Equation (16) in Lemma 3 means that the difference between the log-likelihood of model (12) and (14) does not depend on the linear coefficient $\boldsymbol{\beta}$ , and only depends on the variance $\sigma^{2}$ . Thus the sufficient statistics of the linear coefficients [19] is $(\bar{\mathbf{y}},\sigma^{2})$ with $\bar{\mathbf{y}}=(\bar{y}_{1},\cdots,\bar{y}_{k})^{T}$ being the aggregated data. Note the maximum likelihood estimator of linear coefficients does not depend on $\sigma^{2}$ , whereas the uncertainty of the estimation depends on the noise variance.

In practice, however, the variance parameter $\sigma^{2}$ of the noise is also unknown. One may be tempted to use the aggregated model (15) to estimate $\sigma^{2}$ . The lemma below indicates that the estimation of the variance of the noise by individual-level data is more efficient than that by the aggregated data.

Lemma 4.

Define $\hat{\sigma}^{2}=\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/(n-p-1)$ and $\hat{\sigma}^{2}_{\text{agg}}=\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{% H})\mathbf{W}\mathbf{y}/(k-p-1)$ where $\mathbf{J}=\mathbf{V}(\mathbf{V}^{T}\mathbf{V})^{-1}\mathbf{V}^{T}$ and $\mathbf{H}=\mathbf{W}\mathbf{X}(\mathbf{X}\mathbf{W}^{2}\mathbf{X})^{-1}% \mathbf{X}^{T}\mathbf{W}$ following the notation from (13) and (15). Note that both $\hat{\sigma}^{2}$ and $\hat{\sigma}^{2}_{\text{agg}}$ are unbiased for $\sigma^{2}$ , i.e. $\mathbb{E}\left[\hat{\sigma}^{2}\right]=\mathbb{E}\left[\hat{\sigma}^{2}_{% \text{agg}}\right]=\sigma^{2}$ . However,

\text{Var}\left[\hat{\sigma}^{2}\right]=\frac{2\sigma^{4}}{n-p-1}<\frac{2% \sigma^{4}}{k-p-1}=\text{Var}\left[\hat{\sigma}^{2}_{\text{agg}}\right],

as long as $n>k$ .

Thus, we adopt the formulation of model (12) to assess the linear relationship between internet download speed (Mbps) and demographic features. We apply the regression approaches for both the original samples and the re-sampled data studied in Section 3.2.2. Similarly, for the re-sampled data, we construct the linear regression as follows:

y_{ij}^{r}=\mathbf{x}_{i}^{T}\boldsymbol{\beta}^{r}+\epsilon_{ij}^{r},

(17)

where $\boldsymbol{\beta}^{r}=(\beta_{0}^{r},\beta_{1}^{r},\cdots,\beta_{p}^{r})^{T}% \in\mathbb{R}^{(p+1)}$ , and $\epsilon_{ij}^{r}$ independently follows $\mathcal{N}(0,\sigma_{r}^{2})$ for $j=1,\cdots,n_{i}^{*}$ , and $i=1,\cdots,k$ .

4.2 Model selection

Instead of directly comparing the full models presented in (12) and (17), we employ a backward model selection procedure to select demographic variables that significantly impact measured internet speeds based on the Akaike Information Criterion (AIC) [17]. Let $\mathcal{M}_{m}$ denote a model with index $m$ , and define $d_{m}$ as the dimension of model $\mathcal{M}_{m}$ . The AIC for $\mathcal{M}_{m}$ is defined as follows:

\text{AIC}_{m}=-2\hat{l}_{n,m}+d_{m},

(18)

where $\hat{l}_{n,m}$ represents the maximum log-likelihood of model $\mathcal{M}_{m}$ given $n$ observed data. Commencing with a full model featuring $p$ predictors, we consider $p$ distinct submodels created by eliminating one variable at a time. For each submodel, we compute the corresponding AIC using (18). If any submodel exhibits a smaller AIC compared to the existing model, we select that model for the subsequent stage and repeat the same process. If no such submodel is found to have a small AIC, we conclude with the existing model as the final selected model.

4.3 Disparity of internet quality among demographic groups

Fig. 5 provides estimated linear coefficients from the regression analysis conducted on both the original and re-sampled data after the backward elimination technique using AIC. It presents estimates of coefficients and their corresponding 95% confidence intervals for variables, whereas the variables dropped from the backward selection procedure are not shown. From the graphs, we can make multiple observations.

First, there is a significant negative correlation between measured internet speed and age in City A, City C, and City D shown in Fig 12 in the Appendix. This means internet speed is greater in census block groups with a younger population on average, given other estimated variables. City B is an exception, as the effect of age does not seem to be clear.

Second, regions with a larger percentage of Hispanic population generally have lower measured internet speed for both cities and device types shown in Fig. 5. For the two cities shown in Fig.12 in Appendix B.2, the effect of the percentage of the Hispanic population is not as clear as the two cities shown in Fig. 5; however, after examining the pairwise correlation plot between the covariates in Fig. 14-15, we find that the percentage of the Hispanic population is negatively correlated with the percentage of the bachelor’s degree and availability of the internet. Thus the effect of Hispanic percentage in population can be partly offset by these two effects. For instance, the coefficient of bachelor’s degree in the re-sampled data is significantly larger than zero in part (a) of Fig. 12, which implicitly suggests that the regions with higher Hispanic percentage may have comparatively lower internet speed, as these regions tend to have a lower percentage of residents with bachelor’s degrees.

Third, we find that the regression coefficients for median income are positive for the original data of both device types in cities A and B, meaning that the regions with higher income tend to have faster-measured internet speed. This can likely be attributed to the availability of faster internet plans, as well as the higher purchasing power of local residents; prior work found that the median income of the census block groups play a critical role in determining whether a region gets a fiber deployment and consequently faster internet speeds [30]. The coefficients of median income in the re-sampled data are positive for City B but negative for City A, as shown in Fig. 5. In both cities, the linear coefficients of income in the re-sampled data are smaller than the ones from the original data. To further explore the difference, we find that the pairwise correlation between income and bachelor’s degree is strongly positive in both cities, as shown in Fig. 6. The estimated linear coefficients of the bachelor’s degree in the re-sampled data are larger than those in the original data for both devices and cities. As the bachelor’s degree and income have strongly positive correlation, the larger coefficient of the bachelor’s degree explains the positive impact on the internet speed, which makes the coefficients of the income smaller in the re-sampled data. Note that an estimated coefficient represents the conditional effects of a covariate given all other covariates in multiple linear regression. The effect of a covariate typically depends on the effects of other variables as multicollinearity of the covariates is common in practice [28].

The regression analysis of Speedtest data from the two cities shows that measured internet speed critically depends on demographic profiles of regions, such as the income, education level and ethnic composition. Future work should study the reasons behind these associations, such as the availability of faster internet plans, and their cost per bit, e.g. carriage value [30].

5 Temporal progression of internet speed

Estimating time-dependent internet speed is important for assessing the change of internet quality over time. In this section, we investigate the temporal trend of measured internet speed using both linear regression analysis and Gaussian processes for modeling time sequences. We also compare the original samples and the bias-corrected samples to evaluate whether the regional sampling bias affects the estimation of temporal analysis. For both cities, we have data from 05-31-2020 to 12-31-2021.

5.1 Assessing the linear trend of internet speed

We first analyze the linear trends of internet speed. Let $y(t_{i})$ denote the measured internet speed at time point $t_{i}$ for $i=1,\cdots,n$ ; i.e., we have $n$ distinct time points and each $t_{i}$ corresponds to a positive real number indicating the time lapse from the starting date measured by days. The linear regression model of internet speed with an intercept and time as a covariate is given by:

y(t_{i})=\beta_{0t}+\beta_{1t}t_{i}+\varepsilon_{i},

(19)

where $\varepsilon_{i}$ represents Gaussian white noise, with variance $\sigma_{\varepsilon}^{2}$ for $i=1,\cdots,n$ . We focus on the temporal change rate measured by the linear coefficient over time $\beta_{1t}$ , where the maximum likelihood estimator is the least square estimator below:

\hat{\beta}_{1t}=\frac{\sum_{i=1}^{n}(t_{i}-\bar{t})(y(t_{i})-\bar{y})}{\sum_{% i=1}^{n}(t_{i}-\bar{t})^{2}},

(20)

where $\bar{t}=\frac{1}{n}\sum_{i=1}^{n}t_{i}$ and $\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y(t_{i})$ .

To evaluate the impact of regional sampling bias, we obtain a re-sampled data set with bias-correction introduced in Section 3.2.2. Let $t_{i}^{r}$ denote the distinct time point after re-sampling, for $i=1,\cdots,n^{r}$ . We define $y(t_{i}^{r})$ as the re-sampled internet speed at time point $t_{i}^{r}$ . The linear regression model using the re-sampled data is given by:

y(t_{i}^{r})=\beta_{0t}^{r}+\beta_{1t}^{r}t_{i}^{r}+\varepsilon_{i}^{r},

(21)

where $\varepsilon_{i}^{r}$ represents a Gaussian white noise with variance $\sigma_{\varepsilon^{r}}^{2}$ for $i=1,\cdots,n^{r}$ . The maximum likelihood estimator of $\beta_{1t}^{r}$ in model (21) follows:

\hat{\beta}_{1t}^{r}=\frac{\sum_{i=1}^{n^{r}}(t_{i}^{r}-\bar{t}^{r})(y(t_{i}^{% r})-\bar{y}^{r})}{\sum_{i=1}^{n^{r}}(t_{i}^{r}-\bar{t}^{r})^{2}},

(22)

where $\bar{t}^{r}=\frac{1}{n^{r}}\sum_{i=1}^{n^{r}}t_{i}^{r}$ and $\bar{y}^{r}=\frac{1}{n^{r}}\sum_{i=1}^{n^{r}}y(t_{i}^{r})$ .

There are two limitations of the linear regression analysis. First, the estimated linear coefficients in (19) and (21) can only capture average change over a time period. Second, the assumption is that the residuals are independent over time, whereas the Speedtest measurements are temporally correlated. To avoid these limitations, we introduce Gaussian processes for modeling the time sequences and accelerate the computation by state space representation without approximation.

5.2 Modeling the internet speed by state space models

Internet speeds are temporally correlated. A common way to model the temporal or spatio-temporal data is by Gaussian processes (GPs) [33]. However, the complexity of computing the likelihood function and making predictions by GPs increases cubically fast along with the number of observations, due to computing the inversion and log determinant of the covariance matrix. In our study, the number of measurements for each device in a city is between $10^{5}$ - $10^{6}$ , which makes directly computing the likelihood by GPs prohibitively slow. Fortunately, GPs with some widely used covariance functions, such as the Matérn covariance function [23] with half-integer roughness parameters, can be equivalently represented by linear state space models, which makes computational complexity linearly increase with respect to the number of observations without making any approximations [24]. We briefly introduce a GP model of Speedtest measurements and relate it to the state space model for fast computation for the original Speedtest observations. The fast algorithm through the state space representation can be similarly applied to the regional bias-corrected samples.

Suppose any internet speed measurement is modeled by a noisy Gaussian process, meaning that any marginal distribution at time $\{t_{1},...,t_{n}\}$ follows a multivariate normal distribution $(y(t_{1}),...,y(t_{n}))^{T}\sim\mathcal{N}(\mathbf{0},\sigma^{2}(\mathbf{R}+% \eta\mathbf{I}_{n}))$ , where $\sigma^{2}$ and $\eta$ are variance and nugget parameters, respectively, and $\mathbf{R}$ is a correlation matrix with the $(i,j)$ th term parameterized by a kernel function $K(t_{i},t_{j})$ . Denote $d=|t-t^{\prime}|$ as the distance between any time points $t$ and $t^{\prime}$ . We focus on the Matérn covariance function, which has the expression:

\sigma^{2}K(d)=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}d% }{\gamma}\right)^{\nu}\mathcal{K}_{\nu}\left(\frac{\sqrt{2\nu}d}{\gamma}\right),

(23)

where $\Gamma(\cdot)$ is the gamma function, $\mathcal{K}_{\nu}(\cdot)$ is the modified Bessel function of the second kind with a positive parameter $\nu$ and $\gamma$ is a range or lengthscale parameter of the correlation. The Matérn covariance has a closed form expression when the roughness parameter is a half-integer, $\nu=\frac{2m+1}{2}$ for $m\in\mathbf{N}$ . For instance, the Matérn with $\nu=5/2$ has the expression:

\displaystyle\sigma^{2}K(d)=\sigma^{2}\left(1+\frac{\sqrt{5}d}{\gamma}+\frac{5% d^{2}}{3\gamma^{2}}\right)\exp\left(-\frac{\sqrt{5}d}{\gamma}\right).

(24)

An appealing feature of the Matérn covariance is that the process is $\lfloor\nu-1\rfloor$ mean squared differentiable [21], as the smoothness of the process can be controlled by the roughness parameter.

Suppose we have internet speed measurements at $n$ time points, denoted as $\mathbf{y}=(y(t_{1}),...,y(t_{n}))^{T}$ . A conventional method is to estimate the parameter by the maximum likelihood estimator. Note that given range parameter and nugget parameters $(\gamma,\eta)$ , the maximum likelihood estimator for variance is $\hat{\sigma}^{2}=S^{2}/n$ where $S^{2}=\mathbf{y}^{T}\mathbf{\tilde{R}}^{-1}\mathbf{y}$ with $\mathbf{\tilde{R}}=\mathbf{R}+\eta\mathbf{I}_{n}$ . Plugging the $\hat{\sigma}^{2}$ into the likelihood function, the profile likelihood [21] follows:

p(\mathbf{y}\mid\gamma,\eta,\hat{\sigma}^{2})\propto|\mathbf{\tilde{R}}|^{-1/2% }|S^{2}|^{-n/2}.

(25)

The parameters can be obtained by maximizing the log profile likelihood: $(\hat{\gamma},\hat{\eta})=\mbox{argmax}_{\gamma,\eta}\mbox{log}(p(\mathbf{y}% \mid\gamma,\eta,\hat{\sigma}^{2}))$ . After obtaining the MLE, the predictive distribution at any time point $t$ follows a normal distribution:

(y(t)\mid\mathbf{y},\hat{\sigma}^{2},\hat{\eta},\hat{\gamma})\sim\mathcal{N}(% \hat{y}(t),\hat{\sigma}^{2}K^{*}(t)),

(26)

where $\hat{y}(t)=\mathbf{r}^{T}(t)\mathbf{\tilde{R}}^{-1}\mathbf{y}$ with $\mathbf{r}(t)=(K(t,t_{1}),...,K(t,t_{n}))^{T}$ and $K^{*}(t)=K(t,t)+\eta-\mathbf{r}^{T}(t)\mathbf{\tilde{R}}^{-1}\mathbf{r}(t)$ . The predictive mean $\hat{y}(t)$ is often used for predicting $y(t)$ and the predictive intervals can be obtained from (26) for quantifying the uncertainty in prediction.

Directly computing the likelihood function or predictive distribution requires inverting an $n\times n$ covariance matrix, which has computational complexity $\mathcal{O}(n^{3})$ . Here $n$ can be at the order of $10^{6}$ , prohibiting directly computing GP models. Fortunately, the Matérn covariance with half-integer parameters can be written as a set of stochastic differential equations and the solution follows a continuous-time state space model [24]. For instance, the GP with a Matérn covariance with roughness 5/2 in (24) can be written as a state space model below [22]:

\displaystyle\begin{split}y(t_{i})&=\mathbf{F}\bm{\theta}(t_{i})+\epsilon_{i},% \\ \bm{\theta}(t_{i})&=\mathbf{G}(t_{i})\bm{\theta}(t_{i-1})+\mathbf{w}(t_{i}),\,% \end{split}

(27)

where $\mathbf{F}=(1,0,0)$ , $\mathbf{w}(x_{i})\sim\mathcal{N}(0,\mathbf{W}(x_{i}))$ for $i=2,...,N$ , and the initial state follows $\bm{\theta}(x_{1})\sim\mathcal{MN}(\mathbf{0},\mathbf{W}(x_{1}))$ . The closed-form expression of $\mathbf{G}(x_{i})$ and $\mathbf{W}(x_{i})$ in (27) can be found in Appendix A in [20].

With the state space representation, we can compute the likelihood function and predictive distribution by Kalman filter [25] and RTS smoother [34]. This algorithm is commonly known as the forward filtering and backward smoothing (FFBS) algorithm [36, 32]. The details for computing the likelihood and predictive distribution for state space representation of GP with the covariance in (24) are provided in Lemma 2 in [20]. Computing the likelihood in (25) and predictive distribution in (26) using the FFBS algorithm reduces the computational operations from $\mathcal{O}(n^{3})$ to $\mathcal{O}(n)$ operations without approximation. The computational advance enables us to estimate the nonlinear temporal trend of internet speed with a massive number of crowdsourced observations.

5.3 Temporal progression of measured internet speeds

We plot the estimated temporal progression of the download speed based on both linear and state space models in Fig. 7. Both models show that download speeds measured by Speedtest improve over time for both cities and device types. However, the improvements of the download speed do not appear to be homogeneous over time. We find a comparatively large improvement occurs at the beginning of 2021 for iOS devices in both cities, shown in parts (a) and (c) in Fig 7. Speed improvement can also be found for Android devices, as shown in parts (b) and (d); however the variation of the estimation for Android devices seems larger, as the sample size from Android devices is much smaller than iOS devices, particularly for City A, as shown in Table 1. The increase may be partly due to the acceleration of the deployment and marketing of fiber internet since 2020 [2], as we find that a high proportion of the Speedtest measurements have substantially faster speeds than others since late 2020, and the proportion grows over time.

Second, the estimation from re-sampled data suggests that the trend from iOS devices tends to be overestimated (part (a) and (c) in Fig. 7). This overestimation is larger in City B, particularly during 2021, as both linear and state space models shows the noticeable difference of fitting between original and re-sampled data. We suspect that the overestimation is due to more Speedtests from high-speed internet plans, such as fiber internet plan, as subscribers of these plans may tend to submit more Speedtests to validate the speed from these high-speed internet plans. Re-sampling among census block groups can address a part of this bias, as it samples more from regions with relatively smaller numbers of tests compared to their population, which may have lower-speed internet plans. The temporal trends of internet download speed for the two other cities are plotted in Figure 16 in Appendix B.3. The estimation by both the original and bias-corrected samples shows the improvement of download speeds over time, and the difference between the estimation is not large.

Table 4: Linear trends of measured internet download speed (Mbps) per day.

City	Device type	Estimates $\hat{\beta}_{1t}$ (95% CI)	Estimates $\hat{\beta}_{1t}^{r}$ (95% CI)
A	iOS	0.1149 (0.1088, 0.1209)	0.0928 (0.0857, 0.0999)
A	Android	0.1879 (0.1701, 0.2058)	0.1788 (0.1546, 0.2029)
B	iOS	0.1661 (0.1567, 0.1755)	0.1225 (0.1110, 0.134)
B	Android	0.1309 (0.1200, 0.1417)	0.1370 (0.1232, 0.1509)

Finally, we compare the estimates of linear coefficients $\beta_{1t}$ and $\beta_{1t}^{r}$ for both iOS and Android devices from City A and B in Table 4. For both device types, the estimates are positive, which suggests that the download speed increases over time. The estimates of $\beta_{1t}^{r}$ from the bias-corrected samples are smaller than those of $\beta_{1t}$ in the first three rows, indicating that the improvement of internet speed may be slightly overestimated by the original data for these two cities. The estimates of $\beta_{1t}^{r}$ are similar to $\beta_{1t}$ for Android device download speeds in City B, whereas the intercept $\beta_{0t}^{r}$ is smaller than $\beta_{0t}$ , as shown in part (d) of Fig. 7. The results from Fig. 7 and Table 4 consistently suggest that measured internet speed improves over the time range.

6 Conclusion

In this paper, we integrated Ookla Speedtest measurements with regional demographic profiles for analyzing disparities of measured internet quality and the temporal evolution of internet speed. We developed re-weighing and re-sampling methods to correct the large regional sampling bias across census block groups. Through regression analysis of integrated data, we found that census block groups with higher income, younger population, and fewer Hispanic residents tend towards higher measured internet speeds. Furthermore, we discerned an encouraging trend of internet speed improvement through temporal modeling of Speedtest measurements. Nevertheless, it is essential to approach these findings with caution, as they are susceptible to different biases inherent in Ookla Speedtest measurements. We anticipate that our new methods can be applied to different crowdsourced data and we outline a few directions for further study.

First, while our current investigation primarily concentrates on urban areas, the realm of speed test measurements in rural and sparsely populated subareas within cities remains largely unexplored. Consequently, a comprehensive analysis of internet performance profiles between urban and rural areas presents an intriguing prospect. The principle challenge in this pursuit is the scarcity or absence of crowdsourced samples in rural areas, rendering accurate statistical inference on these regions difficult. However, it is likely that spatial proximity and the similarity of demographic profiles can be used to infer internet speed through interpolation.

Second, our study naturally stimulates further inquiry into internet speed across other cities or states. Given the expansive coverage of crowdsourced measurements across the United States, researchers can leverage data on a larger scale to generalize internet speed characteristics throughout the country. To accommodate a more diverse range of states and cities, one plausible modeling approach involves incorporating mixed effects into the model, accounting for spatial variations. The nature of these mixed effects may vary depending on the hierarchical structure amongst states, counties, and census block groups. Identifying a suitable metric to define correlations between regions may be challenging, but demographic similarity stands as a viable option to gauge correlated structures. This methodology enables the construction of a more comprehensive model alongside city- and state-specific explanatory terms.

Third, our analysis of the temporal progression of internet speed motivates future studies in this direction. Our study employing state space models reveals a notable degree of volatility in the time series of internet speed, suggesting that the distribution of internet speed comprises multiple heterogeneous groups over time. An essential latent factor in this context is the internet subscription plan. To address this, one may consider employing a mixture of Gaussian processes or state space models to analyze the temporal evolution of internet speed while accounting for different subscription plans.

Lastly, it is essential to consider potential sources of bias other than the sampling bias among census block groups. While our study has identified an association between regional sampling bias and demographic disparities, further investigation is needed due to the constraints against the availability of demographic profiles corresponding to individual speed tests. Additional data for calibrating the model can be obtained by anonymous surveys to address this limitation. The collection of paired data encompassing internet speed measurements and demographic data of the speed test taker can be used for regression analysis to understand whether other forms of bias affect the estimation between internet quality and demographic features.

References

[1] Broadband Speed: FCC Map Vs. Experience on the Ground. https://dailyyonder.com/broadband-speed-fcc-map-vs-experience-ground/2018/07/25/, 2018.
[2] Fiber trends: What 2021 promises for the broadband industry. https://www.bbcmag.com/multifamily-broadband/fiber-trends-what-2021-promises-for-the-broadband-industry, 2021.
[3] Broadband availability and access. https://www.rural.pa.gov/publications/broadband, 2022.
[4] Calspeed a home broadband study. https://www.calspeed.net/index.html, 2022.
[5] FAST. https://fast.com/, 2022.
[6] Grow north encourages people to take internet speed test to help improve broadband infrastructure in the region. https://www.wxpr.org/business-economics/2022-03-29/grow-north-encourages-people-to-take-internet-speed-test-to-help-improve-broadband-infrastructure-in-the-region, 2022.
[7] Ingham county asks residents and businesses to participate in survey on broadband internet access and speed. https://www.prnewswire.com/news-releases/ingham-county-asks-residents-and-businesses-to-participate-in-survey-on-broadband-internet-access-and-speed-301518936.html, 2022.
[8] Key facts about the quality of the 2020 census. https://www.pewresearch.org/short-reads/2022/06/08/key-facts-about-the-quality-of-the-2020-census/, 2022.
[9] MLab: Test Your Speed. https://speed.measurementlab.net/#/, 2022.
[10] New broadband maps are finally here. https://www.fcc.gov/news-events/notes/2022/11/18/new-broadband-maps-are-finally-here, 2022.
[11] Speedtest. https://www.speedtest.net/, 2022.
[12] Speedtest by Ookla Global Fixed and Mobile Network Performance Map Tiles. https://github.com/teamookla/ookla-open-data, 2022.
[13] The Speedtest Server Network^™. https://www.ookla.com/network, 2022.
[14] Welcome to speedsurvey: presented by the state of alabama. https://www.google.com/search?q=ctc+alabama+speed+test&oq=ctc+alabama+speed+test&aqs=chrome..69i57j33i160.6405j0j4&sourceid=chrome&ie=UTF-8, 2022.
[15] Xfinity Speed Test. https://speedtest.xfinity.com/, 2022.
[16] About ookla speed test. https://www.speedtest.net/about, 2023.
[17] Akaike, H. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike. Springer, 1998, pp. 199–213.
[18] Brindisi, A. Report on the State of Broadband Access in New York’s 22nd Congressional District, 2020.
[19] Casella, G., and Berger, R. L. Statistical Inference, 2nd ed. Duxbury Press, 2002.
[20] Gu, M., Liu, X., Fang, X., and Tang, S. Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels. Accepted in New England Journal of Statistics in Data Science, arXiv preprint arXiv:2203.08389 (2022).
[21] Gu, M., Wang, X., and Berger, J. O. Robust Gaussian stochastic process emulation. Annals of Statistics 46, 6A (2018), 3038–3066.
[22] Gu, M., and Xu, Y. Fast nonseparable Gaussian stochastic process with application to methylation level interpolation. Journal of Computational and Graphical Statistics 29, 2 (2020), 250–260.
[23] Handcock, M. S., and Stein, M. L. A Bayesian analysis of kriging. Technometrics 35, 4 (1993), 403–410.
[24] Hartikainen, J., and Sarkka, S. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on (2010), IEEE, pp. 379–384.
[25] Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of basic Engineering 82, 1 (1960), 35–45.
[26] MacMillan, K., Mangla, T., Saxon, J., Marwell, N. P., and Feamster, N. A Comparative Analysis of Ookla Speedtest and Measurement Labs Network Diagnostic Test (NDT7). Proceedings of the ACM on Measurement and Analysis of Computing Systems 7, 1 (2023), 1–26.
[27] Major, D., Teixeira, R., and Mayer, J. No wan’s land: Map** u.s. broadband coverage with millions of address queries to isps. In Proceedings of the 2020 Internet Measurement Conference (IMC ’20) (2020).
[28] Mela, C. F., and Kopalle, P. K. The impact of collinearity on regression analysis: the asymmetric effect of negative and positive correlations. Applied Economics 34, 6 (2002), 667–677.
[29] Meng, X.-L. Statistical paradises and paradoxes in big data (i): Law of large populations, big data paradox, and the 2016 us presidential election. The Annals of Applied Statistics 12, 2 (2018), 685–726.
[30] Paul, U., Gunasekaran, V., Liu, J., Narechania, T. N., Gupta, A., and Belding, E. Decoding the Divide: Analyzing Disparities in Broadband Plans Offered by Major US ISPs. ACM SIGCOMM 2023 Conference (2023).
[31] Paul, U., Liu, J., Gu, M., Gupta, A., and Belding, E. The importance of contextualization of crowdsourced active speed test measurements. In Proceedings of the 22nd ACM Internet Measurement Conference (New York, NY, USA, 2022), IMC ’22, Association for Computing Machinery, p. 274–289.
[32] Petris, G., Petrone, S., and Campagnoli, P. Dynamic linear models. In Dynamic linear models with R. Springer, 2009.
[33] Rasmussen, C. E. Gaussian processes for machine learning. MIT Press, 2006.
[34] Rauch, H. E., Tung, F., and Striebel, C. T. Maximum likelihood estimates of linear dynamic systems. AIAA journal 3, 8 (1965), 1445–1450.
[35] Saxon, J., and Black, D. A. What we can learn from selected, unmatched data: Measuring internet inequality in Chicago. Computers, Environment and Urban Systems 98 (2022), 101874.
[36] West, M., and Harrison, P. J. Bayesian Forecasting & Dynamic Models, 2nd ed. Springer Verlag, 1997.

Appendix A Proofs and derivations

A.1 Proof of Lemma 1

Proof.

Consider a fixed $x\in\mathbb{R}$ . Note that the indicator function $I(y_{i}\leq x)$ follows Bernoulli distribution with success probability $\mathbb{P}(y_{i}\leq x)$ , for $i=1,\cdots,k$ . Based on the law of total expectation, we have:

	$\displaystyle\mathbb{P}(y\leq x)$	$\displaystyle=\mathbb{E}\left[I(y\leq x)\right]$
		$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[I(y\leq x)\|\mathbf{z}\right]\right]$
		$\displaystyle=\sum_{i=1}^{k}\mathbb{P}(z_{i}=1)\mathbb{E}\left[I(y\leq x)\|z_{i% }=1\right]$
		$\displaystyle=\sum_{i=1}^{k}\mathbb{P}(z_{i}=1)\mathbb{E}\left[I(y_{i}\leq x)\right]$
		$\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{E}\left[I(y_{i}\leq x)\right]$
		$\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{P}(y_{i}\leq x)$
		$\displaystyle=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x),$

where $\mathbf{z}$ follows a multinomial distribution in Equation (2). ∎

A.2 Proof of unbiased estimator in Equation (7)

Proof.

Consider a fixed $x\in\mathbb{R}$ . Note that $\hat{F}_{i}(x)$ is unbiased for $F_{i}(x)$ for any census block group $i$ , $I(y_{ij}\leq x)$ represents independent Bernoulli process with probability $F_{i}(x)=\mathbb{P}(y_{i}\leq x)$ , i.e.

	$\displaystyle\mathbb{E}\left[\hat{F}_{i}(x)\right]$	$\displaystyle=\mathbb{E}\left[\frac{1}{n_{i}}\sum_{j}^{n_{i}}I\left(y_{ij}\leq x% \right)\right]$
		$\displaystyle=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\mathbb{E}\left[I\left(y_{ij}% \leq x\right)\right]$
		$\displaystyle=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}F_{i}(x)$
		$\displaystyle=\frac{1}{n_{i}}n_{i}F_{i}(x)=F_{i}(x).$

Based on the linearity of expectation operator $\mathbb{E}[\cdot]$ , we then have:

\mathbb{E}\left[\hat{F}_{u}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}\mathbb{E}% \left[\hat{F}_{i}(x)\right]=\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x).

∎

A.3 Proof of Lemma 2

Proof.

By the weak law of large numbers, we have $\hat{F}_{i}(x)\xrightarrow[]{\mathbb{P}}F_{i}(x)$ and $\hat{F}^{*}_{i}(x)\xrightarrow[]{\mathbb{P}}F_{i}(x)$ for any real number $x$ , when $n_{i}\rightarrow\infty$ for $i=1,\cdots,k$ . Since $k$ is a finite number, it follows from Slutsky’s Theorem that:

\hat{F}_{u}(x)=\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x)\xrightarrow[]{% \mathbb{P}}\sum_{i=1}^{k}\frac{N_{i}}{N}F_{i}(x)=F(x),

when $n_{i}\rightarrow\infty$ at each $i$ . It suffices to show that:

\lim_{n\rightarrow\infty}\frac{n_{i}^{*}}{n^{*}}=\frac{N_{i}}{N}.

(28)

By definition of $n_{i}$ in (8), we have:

	$\displaystyle\frac{nN_{i}}{N}-\frac{1}{2}$	$\displaystyle\leq n_{i}^{*}=\left[\frac{nN_{i}}{N}\right]\leq\frac{nN_{i}}{N}+% \frac{1}{2},\text{ so}$		(29)
	$\displaystyle n-\frac{1}{2}k$	$\displaystyle\leq n^{*}=\sum_{i=1}^{k}\left[\frac{nN{i}}{N}\right]\leq n+\frac% {1}{2}k.$		(29)

Then, from (29), we obtain:

\frac{nN_{i}/N-1/2}{n+k/2}\leq\frac{n_{i}^{*}}{n^{*}}\leq\frac{nN_{i}/N+1/2}{n% -k/2}.

(30)

Letting $n\rightarrow\infty$ , we apply the Squeeze Theorem to the inequality in (30) to yield the result of (28). ∎

A.4 Derivation of confidence intervals for the methods in Section 3.2

We derive the confidence intervals of different ways for estimating the CDF. For all methods, we denote $x\in\mathbb{R}$ to be any fixed input.

A.4.1 Empirical CDF from original data

Our empirical CDF for $F_{i}(x)$ is written by:

\hat{F}_{i}(x)=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}I(y_{ij}\leq x).

(31)

For a large number of samples in each region (with sufficiently large $n_{i}$ for every $i$ ), we apply the Central Limit Theorem under Assumptions 1 and 2. Then,

\hat{F}_{i}(x)\sim^{\text{approx}}\mathcal{N}\left(F_{i}(x),\frac{F_{i}(x)(1-F% _{i}(x))}{n_{i}}\right),

(32)

for $i=1,\cdots,k$ . Under Assumption 1 and by the fact that $I(y_{ij}\leq x)$ independently follows a Bernoulli distribution with success probability $F_{i}$ for all $i,j$ , variance of $\hat{F}(x)$ follows:

$\displaystyle\text{Var}(\hat{F}(x))$	$\displaystyle=\text{Var}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}\hat{F}_{i}(x)\right)$	(33)
	$\displaystyle=\sum_{i=1}^{k}\text{Var}\left(\frac{n_{i}}{n}\hat{F}_{i}(x)\right)$
	$\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{2}}{n^{2}}\text{Var}\left(\hat{F}_{i}% (x)\right)$
	$\displaystyle=\sum_{i=1}^{k}\frac{n_{i}^{2}}{n^{2}}\frac{F_{i}(x)(1-F_{i}(x))}% {n_{i}}$
	$\displaystyle=\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}F_{i}(x)(1-F_{i}(x)).$

Based on the asymptotic normality in (32) and independence based on Assumption 1, we have:

\hat{F}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}F_% {i}(x),\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}F_{i}(x)(1-F_{i}(x))\right),

(34)

where its asymptotic variance is obtained by (33). Note that $\hat{F}_{i}(x)$ converges in probability to $F_{i}(x)$ , i.e. $\hat{F}_{i}(x)\xrightarrow{\mathbb{P}}F_{i}(x)$ , under Assumptions 1 and 2 for $i=1,\cdots,k$ . By Slutsky’s theorem, we have the following expressions:

\hat{F}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{n_{i}}{n}F_% {i}(x),\frac{1}{n^{2}}\sum_{i=1}^{k}n_{i}\hat{F}_{i}(x)(1-\hat{F}_{i}(x))% \right).

Therefore, the 95% confidence interval for the simple empirical CDF follows:

\hat{F}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}n_{i}\hat{F}_{i}(x)(1-\hat{F}_{i}(% x))}}{n}.

A.4.2 Re-weighted empirical CDF from original data

Under Assumption 1 and by the fact that $I(y_{ij}\leq x)$ independently follows a Bernoulli distribution with success probability $F_{i}$ for all $i,j$ , the variance of (7) follows:

$\displaystyle\text{Var}\left(\hat{F}_{u}(x)\right)$	$\displaystyle=\text{Var}\left(\sum_{i=1}^{k}\frac{N_{i}}{N}\hat{F}_{i}(x)\right)$	(35)
	$\displaystyle=\sum_{i=1}^{k}\text{Var}\left(\frac{N_{i}}{N}\hat{F}_{i}(x)\right)$
	$\displaystyle=\sum_{i=1}^{k}\frac{N_{i}^{2}}{N^{2}}\text{Var}\left(\hat{F}_{i}% (x)\right)$
	$\displaystyle=\sum_{i=1}^{k}\frac{N_{i}^{2}}{N^{2}}\frac{F_{i}(x)(1-F_{i}(x))}% {n_{i}}$

Since the empirical CDFs from (31) are asymptotically normal, as shown in (32), with independence between regions (Assumption 1), the re-weighted sum of these CDFs are asymptotically normal, i.e.

\hat{F}_{u}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{N_{i}}{% N}F_{i}(x),\frac{1}{N^{2}}\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}F_{i}(x)(1-F_{i% }(x))\right),

(36)

where the asymptotic variance is obtained by (35).

Since $\hat{F}_{i}(x)$ converges in probability to $F_{i}(x)$ , i.e. $\hat{F}_{i}(x)\xrightarrow{\mathbb{P}}F_{i}(x)$ , under Assumptions 1 and 2 for $i=1,\cdots,k$ , we employ Slutsky’s theorem to bridge (36) to obtain:

\hat{F}_{u}(x)\sim^{\text{approx}}\mathcal{N}\left(\sum_{i=1}^{k}\frac{N_{i}}{% N}F_{i}(x),\frac{1}{N^{2}}\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}\hat{F}_{i}(x)(% 1-\hat{F}_{i}(x))\right).

Then the 95% confidence interval for cdf $F(x)$ is obtained by:

\hat{F}_{u}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}\frac{N_{i}^{2}}{n_{i}}\hat{F}% _{i}(x)(1-\hat{F}_{i}(x))}}{N}

A.4.3 Simple empirical CDF from re-sampled data

Following the notation in Section 3.2.2 and the derivation in Appendix A.4.1, we obtain the 95% confidence interval for $\hat{F}^{*}$ as:

\hat{F}^{*}(x)\pm 1.96\frac{\sqrt{\sum_{i=1}^{k}n_{i}^{*}\hat{F}_{i}^{*}(x)(1-% \hat{F}^{*}_{i}(x))}}{n^{*}}.

A.5 Proof of Lemma 3

Proof.

From models (12) and (14), we have $y_{ij}\sim^{\text{indep.}}\mathcal{N}(\mathbf{x}_{i}^{T}\boldsymbol{\beta},% \sigma^{2})$ and $\bar{y}_{i}\sim^{\text{indep.}}\mathcal{N}(\mathbf{x}_{i}^{T}\boldsymbol{\beta% },\sigma^{2}/n_{i})$ for $j=1,\cdots,n_{i}$ and $i=1,\cdots,k$ . First, note that:

\bar{l}(\boldsymbol{\beta},\sigma^{2})=-\frac{k}{2}\log(2\pi\sigma^{2})+\frac{% 1}{2}\sum_{i}^{k}\log n_{i}-\frac{1}{2\sigma^{2}}\sum_{i}^{k}n_{i}(\bar{y}_{i}% -\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}.

(37)

On the other hand, we also have:

\begin{split}&l(\boldsymbol{\beta},\sigma^{2})=-\frac{n}{2}\log(2\pi\sigma^{2}% )-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\mathbf{x}_{i}^% {T}\boldsymbol{\beta})^{2}\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i}+\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta}% )^{2}\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}-\frac{1}{% \sigma^{2}}\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-% \mathbf{x}_{i}^{T}\boldsymbol{\beta})\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}n_{i}(% \bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2}-\frac{1}{\sigma^{2}}\sum% _{i=1}^{k}\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-\mathbf{x}_{i}^{T% }\boldsymbol{\beta})\\ &=-\frac{n}{2}\log(2\pi\sigma^{2})-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})^{2}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{k}n_{i}(% \bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})^{2},\end{split}

(38)

as $\sum_{j=1}^{n_{i}}(y_{ij}-\bar{y}_{i})(\bar{y}_{i}-\mathbf{x}_{i}^{T}% \boldsymbol{\beta})=(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{\beta})\sum_{j=% 1}^{n_{i}}(y_{ij}-\bar{y}_{i})=(\bar{y}_{i}-\mathbf{x}_{i}^{T}\boldsymbol{% \beta})(n_{i}\bar{y}_{i}-n_{i}\bar{y}_{i})=0$ for each $i=1,\cdots,k$ . Comparing with Equation (37), the results follow.

∎

A.6 Proof of Lemma 4

We first introduce the following lemma to derive estimators of $\sigma^{2}$ in the model (12) and (14), and their variability. Here, for the generality, let $\mathbf{x}_{i}=(1,x_{i1},x_{i2},\cdots,x_{ip})^{T}\in\mathbb{R}^{(p+1)}$ represent the predictor vector of $i$ -th region with $p$ different predictors for $i=1,\cdots,k\geq 2$ , and $p>2$ . We assume that $n>>k>p+1$ , indicating that the total number of sample across the collected regions is significantly greater than the number of regions, and the dimension of feature space does not exceed the number of regions.

Lemma 5.

Let $\mathbf{y}$ be a $n$ -dimensional random vector with $y\sim\mathcal{N}(\boldsymbol{\mu},\mathbf{I}_{n})$ , where $\boldsymbol{\mu}\in\mathbb{R}^{n}$ and $\mathbf{I}_{n}$ denotes $n\times n$ identity matrix. If $\mathbf{M}\in\mathbb{R}^{n\times n}$ is an orthogonal projection matrix, then

\mathbf{y}^{T}\mathbf{M}\mathbf{y}\sim\chi^{2}(r(\mathbf{M}),\boldsymbol{\mu}^% {T}\mathbf{M}\boldsymbol{\mu}/2),

where $r(\mathbf{A})$ indicates the rank of a given square matrix $\mathbf{A}$ and $\chi^{2}(d,\gamma)$ refers to the noncentral $\chi^{2}$ distribution with degree of freedom $d$ and noncentrality parameter $\gamma$ . A noncentral chi-squared distribution $\chi^{2}(d,\gamma)$ is generated by a sum of squared independent Gaussian random variables $z_{1},\cdots,z_{d}\sim\mathcal{N}(\mu,1)$ , i.e. $\sum_{i=1}^{d}z_{i}^{2}$ . Here, the noncentrality parameter $\gamma$ is defined by $\gamma=\sum_{i=1}^{d}\mu_{i}^{2}/2$ .

Proof.

Let $r(\mathbf{M})=r$ and let $\mathbf{b}_{1},\cdots,\mathbf{b}_{r}\in\mathbb{R}^{n}$ be an orthonormal basis for the column space of $\mathbf{M}$ , say $\mathcal{C}(\mathbf{M})$ . Let $\mathbf{B}=[\mathbf{b}_{1},\cdots,\mathbf{b}_{r}]\in\mathbf{R}^{n\times r}$ so that $\mathbf{M}=\mathbf{B}\mathbf{B}^{T}$ . We have $\mathbf{y}^{T}\mathbf{M}\mathbf{y}=\mathbf{y}\mathbf{B}\mathbf{B}^{T}\mathbf{y% }=(\mathbf{B}^{T}\mathbf{y})^{T}(\mathbf{B}^{T}\mathbf{y})$ , where $\mathbf{B}^{T}\mathbf{y}\sim\mathcal{N}(\mathbf{B}^{T}\boldsymbol{\mu},\mathbf% {B}^{T}\mathbf{B})$ . Since the columns of $\mathbf{B}$ are orthonormal, $\mathbf{B}^{T}\mathbf{B}=\mathbf{I}$ . By definition of noncentral $\chi^{2}$ distribution, $(\mathbf{B}^{T}\mathbf{y})^{T}(\mathbf{B}^{T}\mathbf{y})\sim\chi^{2}(r,% \boldsymbol{\mu}\mathbf{B}\mathbf{B}^{T}\boldsymbol{\mu}/2$ ) where $\boldsymbol{\mu}^{T}\mathbf{B}\mathbf{B}^{T}\boldsymbol{\mu}=\boldsymbol{\mu}^% {T}\mathbf{M}\boldsymbol{\mu}$ . ∎

A.6.1 Unbiasedness of $\hat{\sigma}^{2}$

W.l.o.g., we only consider $\mathbf{V}$ has full column rank, given that distinctive features among regions. Define $\mathbf{J}=\mathbf{V}(\mathbf{V}^{T}\mathbf{V})^{-1}\mathbf{V}^{T}$ as the orthogonal projection matrix onto $\mathcal{C}(\mathbf{V})$ with $r(\mathbf{J})=p+1$ . It follows that $(\mathbf{I}_{n}-\mathbf{V})$ is the orthogonal projection matrix onto $\mathcal{C}(\mathbf{V})^{\perp}$ with $r(\mathbf{I}_{n}-\mathbf{J})=n-p-1$ . By Lemma 5, we have

\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/\sigma^{2}\sim\chi^{2}(n-p% -1),

where $\chi^{2}(n-p-1)$ is a chi-sqaured distribution with degree of freedom $n-p-1$ and non-central parameter being zero, because $(\mathbf{I}_{n}-\mathbf{J})\mathbf{V}=0$ . Then, one can construct an unbiased estimator:

\hat{\sigma}^{2}=\frac{\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}}{n-% p-1},

(39)

for $\sigma^{2}$ as $\mathbb{E}\left[\mathbf{y}^{T}(\mathbf{I}_{n}-\mathbf{J})\mathbf{y}/\sigma^{2}% \right]=n-p-1$ .

A.6.2 Unbiasedness of $\hat{\sigma}^{2}_{\text{agg}}$

From the model (15), we have $\mathbf{W}\mathbf{y}\sim\mathcal{N}(\mathbf{W}\mathbf{X}\boldsymbol{\beta},% \sigma^{2}\mathbf{I}_{k})$ . Assume that $\mathbf{X}$ is of full column rank. Define $\mathbf{H}=\mathbf{W}\mathbf{X}(\mathbf{X}\mathbf{W}^{2}\mathbf{X})^{-1}% \mathbf{X}^{T}\mathbf{W}$ so that $\mathbf{H}$ is the orthogonal projection matrix onto $\mathcal{C}(\mathbf{WX}$ ) with $r(\mathbf{WX})=p+1$ . Similarly, $(\mathbf{I}_{k}-\mathbf{H})$ is an orthogonal matrix onto $\mathcal{C}(\mathbf{WX})^{\perp}$ with $r(\mathbf{I}_{k}-\mathbf{H})=k-p-1$ . By Lemma 5, it follows that

\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}\mathbf{y}/\sigma% ^{2}\sim\chi^{2}(k-p-1),

as $(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}\mathbf{X}=0$ . Then, we can use an unbiased estimator

\hat{\sigma}^{2}_{\text{agg}}=\frac{\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-% \mathbf{H})\mathbf{W}\mathbf{y}}{k-p-1}

(40)

for estimate $\sigma^{2}$ since $\mathbb{E}\left[\mathbf{y}^{T}\mathbf{W}(\mathbf{I}_{k}-\mathbf{H})\mathbf{W}% \mathbf{y}/\sigma^{2}\right]=k-p-1$ .

A.6.3 Quantification of efficiency

Note that if a random variable $W$ follows a $\chi^{2}$ distribution with degree of freedom $d$ , then $\text{Var}(W)=2d$ . As such, we can conclude that even though the estimators in (39) and (40) are both unbiased for $\sigma^{2}$ , the estimator in (39) is more efficient than one in (40) because $\text{Var}(\hat{\sigma}^{2})=2\sigma^{4}/(n-p-1)<2\sigma^{4}/(k-p-1)=\text{Var% }\left(\hat{\sigma}^{2}_{\text{agg}}\right)$ , and this difference is noticeable since the number of regions $k$ is much smaller than the total number of samples $n$ .

Appendix B Other cities and devices types

B.1 Regional sampling bias detection and correction

Here we provide additional numerical results of the regional sampling bias in two other cities, referred to as City C and City D. We also examine whether the regional sampling bias affects the estimation of internet speed and its association with regional demographic profiles, as discussed in Section 3. We compare the cumulative distribution of sample sizes with population in Fig. 8 for City C and D, and the difference seems to be more noticeable for City C.

Table 5: The result from

\chi^{2}

homogeneity test for City C and D.

City	# of regions	Device type	Test statistics $W$	p-value
C	525	iOS	780,884	$<10^{-16}$
C	525	Android	128,545	$<10^{-16}$
D	397	iOS	88,953	$<10^{-16}$
D	397	Android	142,934	$<10^{-16}$

This deviance is statistically evaluated by the chi-squared homogeneity test as shown in Table 5. For each device type in either City C or D, we find that the proportion of sample sizes is significantly different from the proportion of the population. We further compare the cumulative distribution functions of internet speed in Fig. 9 based on three different empirical CDFs introduced in Section 3.2. Neither re-weighing and re-sampling methods show a notable difference from the empirical CDF with regional sampling bias, which is consistent with the findings for City A and B.

The demographic disparity between the over- and under-sampled census block groups of City C and City D for iOS devices is visualized in Fig. 10. We find that in both City C and D that the over-sampled regions tend to have higher income, greater age, a larger proportion of population with a bachelor’s degree or higher, a greater percentage of households with internet subscription plans, higher representation of white and Asian residents; and lower representation of black and Hispanic residents. The differences by demographic profiles between over-sampled and under-sampled regions across all four cities generally agree with each other.

The comparison of Android devices for City C and D is given in Fig. 11. We find that the over-sampled census block groups tend to have similar demographic characteristics to those of the iOS devices for each city. The difference of demographic profiles between over-sampled regions and under-sampled regions for the Android devices tends to be smaller than those of the iOS devices.

B.2 Correlating internet quality with demographic variables

We applied the regression models (12) and (14) in Section 4 for both iOS and Android devices in City C and D with backward variable selection based on AIC. The estimated coefficients from the selected variables and $95\%$ confidence intervals of the estimation are shown in Fig. 12. First, income is positively correlated with internet speed for both cities and devices types, whereas the coefficients of income from the re-sampled data have a smaller impact on the internet speed. Second, the coefficients of age are negative from both the original and re-sampled data in City C and D, suggesting that the regions with younger age tend to have faster internet. Third, we observe in part (a) of Fig. 14 that the proportion of Hispanic residents is negatively correlated with the percentage of bachelor’s degree or higher, which can partially explain the increased coefficient of Hispanic after re-sampling along with the coefficient of bachelor’s degree, particularly for iOS devices. Furthermore, the internet penetration rate has a positive correlation with income and proportion of individuals with bachelor’s degrees or higher, as shown in Fig. 14 and Fig. 15. Such correlation can potentially explain the coefficients of the internet penetration rates for iOS devices for City C and City D in the re-sampled data, as positive coefficients of bachelor and income offsets the impact of internet penetration rates.

The results from the impact of demographic profiles in Cities C and D are generally consistent with Cities A and B. Census block groups with higher income and younger population tend to have higher internet speed. Such effects can be offset by other positively correlated variables, such as internet penetration rates.

B.3 Temporal progression of internet speed

We analyze the temporal progression of internet speed for Cities C and D based on methods discussed in Section 5, where in this case the available data spans from 01-01-2021 to 12-31-2021 for both cities. The estimated mean and $95\%$ confidence intervals from both linear models and state space models are shown in Fig. 16. All models indicate that internet download speed increases over the 1-year period. Estimation from original data and re-sampled data seems to be similar for these two cities, indicating that the regional sampling bias across census block groups does not lead to large difference in estimation of temporal trend. The data for both cities contains large variability, potentially due to the different internet subscription plans. The availability of fiber internet plan can be one of the main reasons to drive the increase of internet speed.

Table 6: Linear trends of measured internet download speed (Mbps) per day for Cities C and D.

City	Device type	Estimates $\hat{\beta}_{1t}$ (95% CI)	Estimates $\hat{\beta}_{1t}^{r}$ (95% CI)
C	iOS	0.1384 (0.1310, 0.1458)	0.1509 (0.1403, 0.1615)
C	Android	0.2944 (0.2827, 0.3062)	0.2824 (0.2659, 0.2988)
D	iOS	0.1052 (0.0950, 0.1154)	0.0919 (0.0786, 0.1053)
D	Android	0.0552 (0.0428, 0.0676)	0.0633 (0.0460, 0.0805)

We further compare the estimated linear coefficients $\hat{\beta}_{1t}$ and $\hat{\beta}_{1t}^{r}$ of the linear trend of internet download speed for City C and City D in Table 6. All estimates are positive for both device types, whereas City C may have a faster increasing trend than City D.

Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Abstract

1 Introduction

2 Description of data

2.1 Ookla’s Speedtest

2.2 Demographic data from the U.S. Census Bureau

3 Regional sampling bias detection and correction

3.1 Comparison of sample and population distributions at the census block group level

Assumption 1.

Assumption 2.

3.2 Cumulative distribution functions of internet speed: before and after regional bias correction

Lemma 1.

3.2.1 Re-weighing the sample for bias correction

3.2.2 Re-sampling for bias correction

Lemma 2.

3.3 Association of regional bias with demographic variables

3.3.1 Characterization and comparison of regions

4 Correlating internet quality with demographic variables

4.1 Multiple linear regression

Lemma 3.

Lemma 4.

4.2 Model selection

4.3 Disparity of internet quality among demographic groups

5 Temporal progression of internet speed

5.1 Assessing the linear trend of internet speed

5.2 Modeling the internet speed by state space models

5.3 Temporal progression of measured internet speeds

6 Conclusion

References

Appendix A Proofs and derivations

A.1 Proof of Lemma 1

Proof.

A.2 Proof of unbiased estimator in Equation (7)

Proof.

A.3 Proof of Lemma 2

Proof.

A.4 Derivation of confidence intervals for the methods in Section 3.2

A.4.1 Empirical CDF from original data

A.4.2 Re-weighted empirical CDF from original data

A.4.3 Simple empirical CDF from re-sampled data

A.5 Proof of Lemma 3

Proof.

A.6 Proof of Lemma 4

Lemma 5.

Proof.

A.6.1 Unbiasedness of σ^2superscript^𝜎2\hat{\sigma}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

A.6.2 Unbiasedness of σ^agg2subscriptsuperscript^𝜎2agg\hat{\sigma}^{2}_{\text{agg}}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT agg end_POSTSUBSCRIPT

A.6.3 Quantification of efficiency

Appendix B Other cities and devices types

B.1 Regional sampling bias detection and correction

B.2 Correlating internet quality with demographic variables

B.3 Temporal progression of internet speed

A.6.1 Unbiasedness of $\hat{\sigma}^{2}$

A.6.2 Unbiasedness of $\hat{\sigma}^{2}_{\text{agg}}$