-
EM Estimation of the B-spline Copula with Penalized Log-Likelihood Function
Authors:
Xiaoling Dou,
Satoshi Kuriki,
Gwo Dong Lin,
Donald Richards
Abstract:
The B-spline copula function is defined by a linear combination of elements of the normalized B-spline basis. We develop a modified EM algorithm, to maximize the penalized log-likelihood function, wherein we use the smoothly clipped absolute deviation (SCAD) penalty function for the penalization term. We conduct simulation studies to demonstrate the stability of the proposed numerical procedure, s…
▽ More
The B-spline copula function is defined by a linear combination of elements of the normalized B-spline basis. We develop a modified EM algorithm, to maximize the penalized log-likelihood function, wherein we use the smoothly clipped absolute deviation (SCAD) penalty function for the penalization term. We conduct simulation studies to demonstrate the stability of the proposed numerical procedure, show that penalization yields estimates with smaller mean-square errors when the true parameter matrix is sparse, and provide methods for determining tuning parameters and for model selection. We analyze as an example a data set consisting of birth and death rates from 237 countries, available at the website, ''Our World in Data,'' and we estimate the marginal density and distribution functions of those rates together with all parameters of our B-spline copula model.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Sparse Gaussian Processes with Spherical Harmonic Features Revisited
Authors:
Stefanos Eleftheriadis,
Dominic Richards,
James Hensman
Abstract:
We revisit the Gaussian process model with spherical harmonic features and study connections between the associated RKHS, its eigenstructure and deep models. Based on this, we introduce a new class of kernels which correspond to deep models of continuous depth. In our formulation, depth can be estimated as a kernel hyper-parameter by optimizing the evidence lower bound. Further, we introduce spars…
▽ More
We revisit the Gaussian process model with spherical harmonic features and study connections between the associated RKHS, its eigenstructure and deep models. Based on this, we introduce a new class of kernels which correspond to deep models of continuous depth. In our formulation, depth can be estimated as a kernel hyper-parameter by optimizing the evidence lower bound. Further, we introduce sparseness in the eigenbasis by variational learning of the spherical harmonic phases. This enables scaling to larger input dimensions than previously, while also allowing for learning of high frequency variations. We validate our approach on machine learning benchmark datasets.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
A Continuous-Time Markov Chain Model for the Spread of COVID-19
Authors:
Armine Bagyan,
Donald Richards
Abstract:
Since late 2019 the novel coronavirus, also known as COVID-19, has caused a pandemic that persists. This paper shows how a continuous-time Markov chain model for the spread of COVID-19 can be used to explain, and justify to undergraduate students, strategies now being used in attempts to control the virus. The material in the paper is written at the level of students who are taking an introductory…
▽ More
Since late 2019 the novel coronavirus, also known as COVID-19, has caused a pandemic that persists. This paper shows how a continuous-time Markov chain model for the spread of COVID-19 can be used to explain, and justify to undergraduate students, strategies now being used in attempts to control the virus. The material in the paper is written at the level of students who are taking an introductory course on the theory and applications of stochastic processes.
△ Less
Submitted 21 June, 2022;
originally announced June 2022.
-
Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?
Authors:
Dominic Richards,
Edgar Dobriban,
Patrick Rebeschini
Abstract:
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performanc…
▽ More
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performance of the best method in the class. We consider the central problem of linear regression-with a random isotropic ground truth-and investigate the estimation performance of two fundamental methods, gradient descent and ridge regression. We unveil the following phenomena. (1) For general designs, constant stepsize gradient descent outperforms ridge regression when the eigenvalues of the empirical data covariance matrix decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we show that ridge regression outperforms gradient descent. (2) For orthogonal designs, we compute the exact minimax optimal class of estimators (achieving min-max-min optimality), showing it is equivalent to gradient descent with decaying learning rate. We find the sub-optimality of ridge regression and gradient descent with constant step size. Our results highlight that statistical performance can depend strongly on tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our setting, it can be outperformed by gradient descent by an arbitrary/unbounded amount when both methods are only tuned over finitely many regularization parameters.
△ Less
Submitted 12 June, 2022; v1 submitted 26 August, 2021;
originally announced August 2021.
-
Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel
Authors:
Dominic Richards,
Ilja Kuzborskij
Abstract:
We revisit on-average algorithmic stability of GD for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the NTK or PL assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an int…
▽ More
We revisit on-average algorithmic stability of GD for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the NTK or PL assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the smallest relative norm). While this was known for kernelised interpolants, our proof applies directly to networks trained by GD without intermediate kernelisation. At the same time, by relaxing oracle inequalities developed here we recover existing NTK-based risk bounds in a straightforward way, which demonstrates that our analysis is tighter. Finally, unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stop** is consistent.
△ Less
Submitted 9 November, 2021; v1 submitted 27 July, 2021;
originally announced July 2021.
-
Learning with Gradient Descent and Weakly Convex Losses
Authors:
Dominic Richards,
Mike Rabbat
Abstract:
We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample gua…
▽ More
We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian's smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings.
△ Less
Submitted 1 June, 2021; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Decentralised Learning with Random Features and Distributed Gradient Descent
Authors:
Dominic Richards,
Patrick Rebeschini,
Lorenzo Rosasco
Abstract:
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation acro…
▽ More
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Asymptotics of Ridge (less) Regression under General Source Condition
Authors:
Dominic Richards,
Jaouad Mourtada,
Lorenzo Rosasco
Abstract:
We analyze the prediction error of ridge regression in an asymptotic regime where the sample size and dimension go to infinity at a proportional rate. In particular, we consider the role played by the structure of the true regression parameter. We observe that the case of a general deterministic parameter can be reduced to the case of a random parameter from a structured prior. The latter assumpti…
▽ More
We analyze the prediction error of ridge regression in an asymptotic regime where the sample size and dimension go to infinity at a proportional rate. In particular, we consider the role played by the structure of the true regression parameter. We observe that the case of a general deterministic parameter can be reduced to the case of a random parameter from a structured prior. The latter assumption is a natural adaptation of classic smoothness assumptions in nonparametric regression, which are known as source conditions in the the context of regularization theory for inverse problems. Roughly speaking, we assume the large coefficients of the parameter are in correspondence to the principal components. In this setting a precise characterisation of the test error is obtained, depending on the inputs covariance and regression parameter structure. We illustrate this characterisation in a simplified setting to investigate the influence of the true parameter on optimal regularisation for overparameterized models. We show that interpolation (no regularisation) can be optimal even with bounded signal-to-noise ratio (SNR), provided that the parameter coefficients are larger on high-variance directions of the data, corresponding to a more regular function than posited by the regularization term. This contrasts with previous work considering ridge regression with isotropic prior, in which case interpolation is only optimal in the limit of infinite SNR.
△ Less
Submitted 8 March, 2021; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Distributed Machine Learning with Sparse Heterogeneous Data
Authors:
Dominic Richards,
Sahand N. Negahban,
Patrick Rebeschini
Abstract:
Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their…
▽ More
Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their solutions is also sparse. We propose a method based on Basis Pursuit Denoising with a total variation penalty, and provide finite sample guarantees for sub-Gaussian design matrices. Taking the root of the tree as a reference node, we show that if the sparsity of the differences across nodes is smaller than the sparsity at the root, then recovery is successful with fewer samples than by solving the problems independently, or by using methods that rely on a large overlap in the signal supports, such as the group Lasso. We consider both the noiseless and noisy setting, and numerically investigate the performance of distributed methods based on Distributed Alternating Direction Methods of Multipliers (ADMM) and hyperspectral unmixing.
△ Less
Submitted 27 November, 2021; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up
Authors:
Dominic Richards,
Patrick Rebeschini
Abstract:
We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations tha…
▽ More
We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a "big data" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.
△ Less
Submitted 13 November, 2019; v1 submitted 8 May, 2019;
originally announced May 2019.
-
Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient Descent
Authors:
Dominic Richards,
Patrick Rebeschini
Abstract:
We propose graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, centralised statistical guarantees through implicit regularisation…
▽ More
We propose graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, centralised statistical guarantees through implicit regularisation (step size tuning and early stop**) with appropriate dependence on the graph topology. Our approach avoids the need for explicit regularisation in decentralised learning problems, such as adding constraints to the empirical risk minimisation rule. Particularly for distributed methods, the use of implicit regularisation allows the algorithm to remain simple, without projections or dual methods. To prove our results, we establish graph-independent generalisation bounds for Distributed SGD that match the centralised setting (using algorithmic stability), and we establish graph-dependent optimisation bounds that are of independent interest. We present numerical experiments to show that the qualitative nature of the upper bounds we derive can be representative of real behaviours.
△ Less
Submitted 18 September, 2018;
originally announced September 2018.
-
Long-Term Implications of the Revenue Transfer Methodology in the Affordable Care Act
Authors:
Ishan Muzumdar,
Donald Richards
Abstract:
The Affordable Care Act introduced a revenue transfer formula that requires insurance plans with generally healthier enrollees to pay funds into a revenue transfer pool for to reimburse plans with generally less healthy enrollees. For a given plan, the issue arises of whether the plan will be a payer into or a receiver from the pool in a chosen future year. To examine that issue, we analyze data f…
▽ More
The Affordable Care Act introduced a revenue transfer formula that requires insurance plans with generally healthier enrollees to pay funds into a revenue transfer pool for to reimburse plans with generally less healthy enrollees. For a given plan, the issue arises of whether the plan will be a payer into or a receiver from the pool in a chosen future year. To examine that issue, we analyze data from The Actuary Magazine on transfer payments for 2014-2015, and we infer strong evidence of a statistical relationship between year-to-year transfer payments. We also apply to the data a Markov transition model to study annual changes in the payer-receiver statuses of insurance plans. We estimate that the limiting conditional probability that an insurance plan will pay into the pool, given that the plan had paid into the pool in 2014, is 55.6 percent. Further, that limiting probability is attained quickly because the conditional probability that an insurance plan will pay into the pool in 2024, given that the plan had paid into the pool in 2014, is estimated to be 55.7 percent. We also find the revenue transfer system to have the disturbing feature that once a plan enters the "state" of paying into the pool then it will stay in that state for an average period of 4.87 years; moreover, once a plan has received funds from the pool then it will stay in that state for an average period of 3.89 years.
△ Less
Submitted 19 March, 2019; v1 submitted 3 March, 2018;
originally announced March 2018.
-
Distance Correlation: A New Tool for Detecting Association and Measuring Correlation Between Data Sets
Authors:
Donald St. P. Richards
Abstract:
The difficulties of detecting association, measuring correlation, and establishing cause and effect have fascinated mankind since time immemorial. Democritus, the Greek philosopher, underscored well the importance and the difficulty of proving causality when he wrote, "I would rather discover one cause than gain the kingdom of Persia." To address the difficulties of relating cause and effect, stat…
▽ More
The difficulties of detecting association, measuring correlation, and establishing cause and effect have fascinated mankind since time immemorial. Democritus, the Greek philosopher, underscored well the importance and the difficulty of proving causality when he wrote, "I would rather discover one cause than gain the kingdom of Persia." To address the difficulties of relating cause and effect, statisticians have developed many inferential techniques. Perhaps the most well-known method stems from Karl Pearson's coefficient of correlation, which Pearson introduced in the late 19th century based on ideas of Francis Galton.
I will describe in this lecture the recently-devised distance correlation coefficient and describe its advantages over the Pearson and other classical measures of correlation. We will examine an application of the distance correlation coefficient to data drawn from large astrophysical databases, where it is desired to classify galaxies according to various types. Further, the lecture will analyze data arising in the ongoing national discussion of the relationship between state-by-state homicide rates and the stringency of state laws governing firearm ownership.
The lecture will also describe a remarkable singular integral which lies at the core of the theory of the distance correlation coefficient. We will see that this singular integral admits generalizations to the truncated Maclaurin expansions of the cosine function and to the theory of spherical functions on symmetric cones.
△ Less
Submitted 14 August, 2017;
originally announced September 2017.
-
Statistical Implications of the Revenue Transfer Methodology in the Affordable Care Act
Authors:
Michelle Li,
Donald Richards
Abstract:
The Affordable Care Act (ACA) includes a permanent revenue transfer methodology which provides financial incentives to health insurance plans that have higher than average actuarial risk. In this paper, we derive some statistical implications of the revenue transfer methodology in the ACA. We treat as random variables the revenue transfers between individual insurance plans in a given marketplace,…
▽ More
The Affordable Care Act (ACA) includes a permanent revenue transfer methodology which provides financial incentives to health insurance plans that have higher than average actuarial risk. In this paper, we derive some statistical implications of the revenue transfer methodology in the ACA. We treat as random variables the revenue transfers between individual insurance plans in a given marketplace, where each plan's revenue transfer amount is measured as a percentage of the plan's total premium. We analyze the means and variances of those random variables, and deduce from the zero sum nature of the revenue transfers that there is no limit to the magnitude of revenue transfer payments relative to plans' total premiums. Using data provided by the American Academy of Actuaries and by the Centers for Medicare and Medicaid Services, we obtain an explanation for empirical phenomena that revenue transfers were more variable and can be substantially greater for insurance plans with smaller market shares. We show that it is often the case that an insurer which has decreasing market share will also have increased volatility in its revenue transfers.
△ Less
Submitted 7 June, 2018; v1 submitted 2 March, 2017;
originally announced March 2017.
-
Gaussian Random Particles with Flexible Hausdorff Dimension
Authors:
Linda V. Hansen,
Thordis L. Thorarinsdottir,
Evgeni Ovcharov,
Tilmann Gneiting,
Donald Richards
Abstract:
Gaussian particles provide a flexible framework for modelling and simulating three-dimensional star-shaped random sets. In our framework, the radial function of the particle arises from a kernel smoothing, and is associated with an isotropic random field on the sphere. If the kernel is a von Mises--Fisher density, or uniform on a spherical cap, the correlation function of the associated random fie…
▽ More
Gaussian particles provide a flexible framework for modelling and simulating three-dimensional star-shaped random sets. In our framework, the radial function of the particle arises from a kernel smoothing, and is associated with an isotropic random field on the sphere. If the kernel is a von Mises--Fisher density, or uniform on a spherical cap, the correlation function of the associated random field admits a closed form expression. The Hausdorff dimension of the surface of the Gaussian particle reflects the decay of the correlation function at the origin, as quantified by the fractal index. Under power kernels we obtain particles with boundaries of any Hausdorff dimension between 2 and 3.
△ Less
Submitted 12 February, 2015; v1 submitted 5 February, 2015;
originally announced February 2015.
-
Distance Correlation Methods for Discovering Associations in Large Astrophysical Databases
Authors:
Elizabeth Martinez-Gomez,
Mercedes T. Richards,
Donald St. P. Richards
Abstract:
High-dimensional, large-sample astrophysical databases of galaxy clusters, such as the Chandra Deep Field South COMBO-17 database, provide measurements on many variables for thousands of galaxies and a range of redshifts. Current understanding of galaxy formation and evolution rests sensitively on relationships between different astrophysical variables; hence an ability to detect and verify associ…
▽ More
High-dimensional, large-sample astrophysical databases of galaxy clusters, such as the Chandra Deep Field South COMBO-17 database, provide measurements on many variables for thousands of galaxies and a range of redshifts. Current understanding of galaxy formation and evolution rests sensitively on relationships between different astrophysical variables; hence an ability to detect and verify associations or correlations between variables is important in astrophysical research. In this paper, we apply a recently defined statistical measure called the distance correlation coefficient which can be used to identify new associations and correlations between astrophysical variables. The distance correlation coefficient applies to variables of any dimension; it can be used to determine smaller sets of variables that provide equivalent astrophysical information; it is zero only when variables are independent; and it is capable of detecting nonlinear associations that are undetectable by the classical Pearson correlation coefficient. Hence, the distance correlation coefficient provides more information than the Pearson coefficient. We analyze numerous pairs of variables in the COMBO-17 database with the distance correlation method and with the maximal information coefficient. We show that the Pearson coefficient can be estimated with higher accuracy from the corresponding distance correlation coefficient than from the maximal information coefficient. For given values of the Pearson coefficient, the distance correlation method has a greater ability than the maximal information coefficient to resolve astrophysical data into highly concentrated V-shapes, which enhances classification and pattern identification. These results are observed over a range of redshifts beyond the local universe and for galaxies from elliptical to spiral.
△ Less
Submitted 3 December, 2013; v1 submitted 19 August, 2013;
originally announced August 2013.
-
EM algorithms for estimating the Bernstein copula
Authors:
Xiaoling Dou,
Satoshi Kuriki,
Gwo Dong Lin,
Donald Richards
Abstract:
A method that uses order statistics to construct multivariate distributions with fixed marginals and which utilizes a representation of the Bernstein copula in terms of a finite mixture distribution is proposed. Expectation-maximization (EM) algorithms to estimate the Bernstein copula are proposed, and a local convergence property is proved. Moreover, asymptotic properties of the proposed semipara…
▽ More
A method that uses order statistics to construct multivariate distributions with fixed marginals and which utilizes a representation of the Bernstein copula in terms of a finite mixture distribution is proposed. Expectation-maximization (EM) algorithms to estimate the Bernstein copula are proposed, and a local convergence property is proved. Moreover, asymptotic properties of the proposed semiparametric estimators are provided. Illustrative examples are presented using three real data sets and a 3-dimensional simulated data set. These studies show that the Bernstein copula is able to represent various distributions flexibly and that the proposed EM algorithms work well for such data.
△ Less
Submitted 15 January, 2014; v1 submitted 12 January, 2013;
originally announced January 2013.
-
Counting and Locating the Solutions of Polynomial Systems of Maximum Likelihood Equations, II: The Behrens-Fisher Problem
Authors:
Max-Louis G. Buot,
Serkan Hosten,
Donald St. P. Richards
Abstract:
Let $μ$ be a $p$-dimensional vector, and let $Σ_1$ and $Σ_2$ be $p \times p$ positive definite covariance matrices. On being given random samples of sizes $N_1$ and $N_2$ from independent multivariate normal populations $N_p(μ,Σ_1)$ and $N_p(μ,Σ_2)$, respectively, the Behrens-Fisher problem is to solve the likelihood equations for estimating the unknown parameters $μ$, $Σ_1$, and $Σ_2$. We shall…
▽ More
Let $μ$ be a $p$-dimensional vector, and let $Σ_1$ and $Σ_2$ be $p \times p$ positive definite covariance matrices. On being given random samples of sizes $N_1$ and $N_2$ from independent multivariate normal populations $N_p(μ,Σ_1)$ and $N_p(μ,Σ_2)$, respectively, the Behrens-Fisher problem is to solve the likelihood equations for estimating the unknown parameters $μ$, $Σ_1$, and $Σ_2$. We shall prove that for $N_1, N_2 > p$ there are, almost surely, exactly $2p+1$ complex solutions of the likelihood equations. For the case in which $p = 2$, we utilize Monte Carlo simulation to estimate the relative frequency with which a typical Behrens-Fisher problem has multiple real solutions; we find that multiple real solutions occur infrequently.
△ Less
Submitted 6 September, 2007;
originally announced September 2007.