next up previous contents
Next: Transient behaviour of Large Up: No Title Previous: References

Infinite-dimensional Likelihood Methods in Statistics

 Research Programme:             Statistics

Researcher: A.W. van der Vaart

Much of the statistical theory developed before 1980 was concerned with so-called parametric models. These are models which allow only finitely many degrees of freedom (unknowns) to the phenomenon that is being modelled. Thus, they tend to fit the phenomenon badly, unless this is observed under closely controlled and previously studied conditions. One of the most important directions in current statistical research is the study of infinite-dimensional models, for which there is both practical and theoretical motivation. Many large or badly structured data-sets simply cannot be reliably analyzed with the classical techniques. In particular, data that result from observational, rather than experimentally controlled studies, and/or are subject to several types of ``censoring'' (missing or partially observed data). An intrinsic mathematical motivation is that the research leads to interesting mathematics. The revolution in computing power in the past years was a precondition for these new techniques, because, typically, statistical techniques for infinite-dimensional models are computer-intensive.

If we restrict ourselves to independent replications of an experiment, leading to observations tex2html_wrap_inline4190 , then a model is precisely the set tex2html_wrap_inline4192 of possible probability distributions P of a single observation. For a classical parametric model this set of distributions is ``nicely'' parametrized by a Euclidean vector. The simplest type of infinite-dimensional model is the nonparametric model, in which we observe a random sample from a completely unknown distribution. Then tex2html_wrap_inline4192 is the collection of all probability measures on the sample space, and, as is intuitively clear, the empirical distribution tex2html_wrap_inline4198 (with tex2html_wrap_inline4200 the Dirac measure at x) is an optimal estimator for the underlying distribution. More interesting are the intermediate models, which are not ``nicely'' parametrized by a Euclidean parameter, as are the standard classical models, but do restrict the distribution in an important way. Such models are often parametrized by infinite-dimensional parameters, such as distribution functions or densities, that express the structure under study. In particular, the model may have a natural parametrization tex2html_wrap_inline4204 , where tex2html_wrap_inline4206 is a Euclidean parameter and tex2html_wrap_inline3992 runs through a nonparametric class of distributions, or some other infinite-dimensional set. This gives a semiparametric model, in which we aim at estimating tex2html_wrap_inline4206 and consider tex2html_wrap_inline3992 as a nuisance parameter. More generally, we focus on estimating the value tex2html_wrap_inline4214 of some function tex2html_wrap_inline4216 on the model, with values in the real line or in some other Banach space.

The precise study of the properties of statistical experiments for a fixed number n of observations is often intractable, and therefore many theoretical investigations concern asymptotics as tex2html_wrap_inline4220 . This is particularly true for infinite-dimensional models. In estimation theory we wish to find functions tex2html_wrap_inline4222 that approximate the quantity of interest tex2html_wrap_inline4214 as well as possible. Asymptotically we should at least have that tex2html_wrap_inline4226 converges in probability to tex2html_wrap_inline4214 if the probabilities are calculated according to P, for every tex2html_wrap_inline4232 . This consistency property is comforting, but in practice we wish to know much more, for instance a rate of convergence tex2html_wrap_inline4234 , and preferably limits of probabilities of the type tex2html_wrap_inline4236 . These will lead to confidence statements of the form: tex2html_wrap_inline4214 is in the ball of radius tex2html_wrap_inline4240 around tex2html_wrap_inline4226 , with probability (or confidence) tex2html_wrap_inline4244 . The number tex2html_wrap_inline4246 is determined from the limit distribution of tex2html_wrap_inline4248 .

For classical parametric models the most important general method of constructing estimators tex2html_wrap_inline4226 is the method of maximum likelihood. The likelihood function is the joint density of the observations (relative to some dominating measure) viewed as function of the parameter, and the maximum likelihood estimator is the point of maximum of this function. The asymptotic theory of this estimator for parametric models is well-known. In the most common case the scaling rate tex2html_wrap_inline4234 is equal to tex2html_wrap_inline4254 and the probabilities tex2html_wrap_inline4256 converge to probabilities under the normal distribution, with a certain variance that can be expressed in the Fisher information tex2html_wrap_inline4258 . A good first introduction to statistics should introduce approximate tex2html_wrap_inline4244 confidence statements of the type tex2html_wrap_inline4262 , based on the maximum likelihood estimator tex2html_wrap_inline4264 .

Given the results for the classical, parametric models, it is natural to try the same maximum likelihood recipe for infinite-dimensional models. Here the situation turns out to be much more complicated, and a lot is still unknown.

To begin with, it is not always clear how a ``likelihood function'' should be defined. Many infinite-dimensional models are not dominated, in the sense that every tex2html_wrap_inline4232 has a density relative to a certain fixed measure. If it is, then it may happen that the supremum over all tex2html_wrap_inline4232 is infinite, and a maximum likelihood estimator does not exist. One way to overcome such problems is to use the empirical likelihood, defined as the map


with domain tex2html_wrap_inline4192 , tex2html_wrap_inline4274 denoting the probability of the point x under P. Other possibilities are to introduce a penalty term J(P) in the likelihood, which disqualifies the P that were causing trouble before; or to restrict the maximization to an approximating set tex2html_wrap_inline4284 , which will need to grow with n to induce consistency and ensure a good rate of convergence.

Once a likelihood is defined, we wish to compute its point of maximum. Analytic solutions are rare, so we search for an efficient numerical algorithm. This may be nontrivial because the function to optimize may be of high dimension. Next, we wish to obtain confidence statements using the maximum likelihood estimator. Again the situation is considerably more complicated than for classical models. Some aspects tex2html_wrap_inline4288 of the maximum likelihood estimator resemble the approximation properties of classical maximum likelihood estimators. In particular, their convergence rate is tex2html_wrap_inline4254 , and asymptotically their distribution follows the normal distribution, with variance involving a generalization of the Fisher information. However, other aspects may show a totally different and novel behaviour, of which very little is known at the present time.

In any case, the mathematical arguments to derive these results involve completely different tools and arguments. Very important ones are entropy calculations of the statistical models. Roughly the size of tex2html_wrap_inline4192 is measured by the number of balls of a fixed radius tex2html_wrap_inline4294 , in a suitable metric, needed to cover tex2html_wrap_inline4192 . It will be required that this number is finite for every tex2html_wrap_inline4298 (at least locally), so that tex2html_wrap_inline4192 must be totally bounded, and the speed at which the numbers go up as tex2html_wrap_inline4294 decreases to 0 should be bounded as well (by roughly tex2html_wrap_inline4304 ) for some constant K. The rate at which the entropy grows is a measure of the size of the model, and is connected to the rate of convergence of the maximum likelihood estimator.

Likelihood inference in statistics is not limited to the maximum likelihood estimator. In practice the likelihood ratio statistic is perhaps considered even more important. This is the ratio of the likelihood function at a given point and its maximum value. Fortunately, as for classical models, the asymptotic behaviour of this statistic is closely related to the behaviour of the maximum likelihood estimator. The statistic is used both for testing certain hypotheses, and as an alternative for obtaining confidence statements.

The study of the efficiency of likelihood methods is of considerable interest. Again, for classical parametric models, this question has been solved, and the likelihood methods are efficient in an asymptotic sense. This means that for large n no better estimators or confidence statements are possible, by any method. Much progress has been made for infinite-dimensional models, but much is still unknown. It is not excluded that maximum likelihood is not optimal for certain purposes, even though it has already gained a strong position in practice.

As an example consider the proportional odds model, which is used in the analysis of life times. The observations are a random sample from the distribution of tex2html_wrap_inline4310 , where, given Z, the variables T and C are independent with unspecified probability distributions, apart from the requirement that the conditional distribution function tex2html_wrap_inline4318 of T given Z satisfies


The left side is the conditional odds given z of survival until t. The unknown parameter tex2html_wrap_inline3992 is a nondecreasing, cadlag function from tex2html_wrap_inline4332 into itself with tex2html_wrap_inline4334 . It is the odds of survival when tex2html_wrap_inline4336 and T is independent of Z. In a classical parametric model this function would have been modelled, for instance, linearly, or as a power function, but presently we only impose monotonicity.

In this example we cannot use the density of the observations as a likelihood, for the supremum will be infinite unless we restrict tex2html_wrap_inline3992 in an important way. Instead, we use the empirical likelihood. The probability that X=x is given by


For likelihood inference concerning tex2html_wrap_inline4348 only, we may drop the terms involving tex2html_wrap_inline4350 and tex2html_wrap_inline4352 , and define the likelihood for one observation as


The numerical problems is to compute the maximizer of the function tex2html_wrap_inline4356 , given fixed observations tex2html_wrap_inline4358 . The mathematical problem is to characterize probabilities of the type tex2html_wrap_inline4360 .


The figure shows levels of the profile likelihood function


(for a two-dimensional tex2html_wrap_inline4206 in this case), for a given data-set from study of the survival of lung cancer patients. The centre of the ellipses is an estimate for tex2html_wrap_inline4206 , and the ellipses are confidence sets that give an indication of the precision of the estimate. For instance, one should allow for the true value of the parameter to be outside the dark contour ellips with ``probability'' 5 %. The two coordinates of tex2html_wrap_inline4206 correspond to tumor type (horizontally) and general condition (vertically), and the plot shows that with large confidence the second has a small negative effect while the first a larger positive effect (with the signs relative to the measurement scales).

Statistics for infinite-dimensional parameters is a unifying theme for a large part of the research by the Programme ``Statistics'' of the Stieltjes Institute. Likelihood based methods, in all its varieties and in different settings, form an important subtheme in this research.

next up previous contents
Next: Transient behaviour of Large Up: No Title Previous: References

Fri Mar 20 16:01:06 MET 1998