An Introduction to Bayesian Analysis: Theory and Methods

Springer Texts in Statistics Advisors:

George Casella

Stephen Fienberg

Ingram Olkin

Springer Texts in Statistics Alfred: Elements of Statistics for the Life and Social Sciences Berger: An Introduction to Probability and Stochastic Processes Bilodeau and Brenner: Theory of Multivariate Statistics Blom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting, Second

Edition Carmona: Statistical Analysis of Financial Data in S-Plus Chow and Teicher: Probability Theory: Independence, Interchangeability,

Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and

Spatial Data-Nonparametric Regression and Response Surface Maximization, Second Edition Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory of Linear Models, Third Edition Creighton: A First Course in Probability Models and Statistical Inference Davis: Statistical Methods for the Analysis of Repeated Measurements Dean and Voss: Design and Analysis of Experiments du Toit, Steyn, and Stumpf Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Ghosh, Delampady and Samanta: An Introduction to Bayesian Analysis: Theory and Methods Gut: Probability: A Graduate Course Heiberger and Holland: Statistical Analysis and Data Display: An Intermediate Course with Examples inS-PLUS, R, and SAS Jobson: Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design Jobson: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods Kalbfleisch: Probability and Statistical Inference, Volume I: Probability, Second Edition Kalbfleisch: Probability and Statistical Inference, Volume II: Statistical Inference, Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statistical Inference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lange: Optimization Lehmann: Elements of Large-Sample Theory (continued after index)

Jayanta K. Ghosh Mohan Delampady Tapas Samanta

An Introduction to Bayesian Analysis Theory and Methods

With 13 Illustrations

~Springer

Jayanta K. Ghosh Department of Statistics Purdue University 150 N. University Street West Lafayette, IN 47907-2067 USA ghosh@stat. purdue.edu and Indian Statistical Institute 203 B.T. Road Kolkata 700 I 08, India [emailprotected]

Editorial Board George Casella Department of Statistics University of Florida Gainesville, FL 32611-8545 USA

Mohan Delampady Indian Statistical Institute, 8th Mile, Mysore Road, R.V. College Post, Bangalore 560059, India [emailprotected]

Tapas Samanta Indian Statistical Institute 203 B.T. Road Kolkata 700108, India [emailprotected]

Stephen Fienberg Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890 USA

Ingram Olkin Department of Statistics Stanford University Stanford, CA 94305 USA

Library of Congress Control Number: 2006922766 ISBN-10: 0-387-40084-2 ISBN-13: 978-0387-40084-6

e-ISBN: 0-387-35433-6

Printed on acid-free paper. ©2006 Springer Science+ Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+ Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America.

9 8 76 54 32 I springer. com

(MVY)

To Ira, Shobha, and Shampa

Preface

Though there are many recent additions to graduate-level introductory books on Bayesian analysis, none has quite our blend of theory, methods, and applications. We believe a beginning graduate student taking a Bayesian course or just trying to find out what it means to be a Bayesian ought to have some familiarity with all three aspects. More specialization can come later. Each of us has taught a course like this at Indian Statistical Institute or Purdue. In fact, at least partly, the book grew out of those courses. We would also like to refer to the review (Ghosh and Samanta (2002b)) that first made us think of writing a book. The book contains somewhat more material than can be covered in a single semester. We have done this intentionally, so that an instructor has some choice as to what to cover as well as which of the three aspects to emphasize. Such a choice is essential for the instructor. The topics include several results or methods that have not appeared in a graduate text before. In fact, the book can be used also as a second course in Bayesian analysis if the instructor supplies more details. Chapter 1 provides a quick review of classical statistical inference. Some knowledge of this is assumed when we compare different paradigms. Following this, an introduction to Bayesian inference is given in Chapter 2 emphasizing the need for the Bayesian approach to statistics. Objective priors and objective Bayesian analysis are also introduced here. We use the terms objective and nonsubjective interchangeably. After briefly reviewing an axiomatic development of utility and prior, a detailed discussion on Bayesian robustness is provided in Chapter 3. Chapter 4 is mainly on convergence of posterior quantities and large sample approximations. In Chapter 5, we discuss Bayesian inference for problems with low-dimensional parameters, specifically objective priors and objective Bayesian analysis for such problems. This covers a whole range of possibilities including uniform priors, Jeffreys' prior, other invariant objective priors, and reference priors. After this, in Chapter 6 we discuss some aspects of testing and model selection, treating these two problems as equivalent. This mostly involves Bayes factors and bounds on these computed over large classes of priors. Comparison with classical P-value is

\Till

J>reface

also made whenever appropriate. Bayesian J>-value and nonsubjective Bayes factors such as the intrinsic and fractional Bayes factors are also introduced. Chapter 7 is on Bayesian computations. Analytic approximation and the E-M algorithm are covered here, but most of the emphasis is on Markov chain based Monte Carlo methods including the M-H algorithm and Gibbs sampler, which are currently the most popular techniques. Follwing this, in Chapter 8 we cover the Bayesian approach to some standard problems in statistics. The next chapter covers more complex problems, namely, hierarchical Bayesian (HB) point and interval estimation in high-dimensional problems and parametric empirical Bayes (J>EB) methods. Superiority of HB and J>EB methods to classical methods and advantages of HB methods over J>EB methods are discussed in detail. Akaike information criterion (AI C), Bayes information criterion (BIC), and other generalized Bayesian model selection criteria, highdimensional testing problems, microarrays, and multiple comparisons are also covered here. The last chapter consists of three major methodological applications along with the required methodology. We have marked those sections that are either very technical or are very specialized. These may be omitted at first reading, and also they need not be part of a standard one-semester course. Several problems have been provided at the end of each chapter. More problems and other material will be placed at http: I /www. isical. ac. in;tapas/book Many people have helped - our mentors, both friends and critics, from whom we have learnt, our family and students at lSI and J>urdue, and the anonymous referees of the book. Special mention must be made of Arijit Chakrabarti for Sections 9.7 and 9.8, Sudipto Banerjee for Section 10.1, J>artha J>. Majumder for Appendix D, and Kajal Dihidar and Avranil Sarkar for help in several computations. We alone are responsible for our philosophical views, however tentatively held, as well as presentation. Thanks to John Kimmel, whose encouragement and support, as well as advice, were invaluable.

Indian Statistical Institute and J>urdue University Indian Statistical Institute Indian Statistical Institute February 2006

Jayanta K. Ghosh Mohan Delampady Tapas Samanta

Contents

Statistical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Common Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Location-Scale Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Regular Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Sufficient Statistics and Ancillary Statistics . . . . . . . . . . . . . . . . . 1.4 Three Basic Problems of Inference in Classical Statistics. . . . . . 1.4.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Testing Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Inference as a Statistical Decision Problem. . . . . . . . . . . . . . . . . . 1.6 The Changing Face of Classical Inference . . . . . . . . . . . . . . . . . . . 1. 7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Inference and Decision Theory . . . . . . . . . . . . . . . . . . . 2.1 Subjective and Frequentist Probability ..................... 2.2 Bayesian Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Advantages of Being a Bayesian ........................... 2.4 Paradoxes in Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Elements of Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . 2.6 Improper Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. 7 Common Problems of Bayesian Inference . . . . . . . . . . . . . . . . . . . 2.7.1 Point Estimates ................................... 2. 7.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. 7.3 Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. 7.4 Testing of a Sharp Null Hypothesis Through Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Prediction of a Future Observation . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Examples of Cox and Welch Revisited. . . . . . . . . . . . . . . . . . . . . . 2.10 Elimination of Nuisance Parameters .......................

1 4 5 6 7 9 11 11 16 20 21 23 24 29 29 30 35 37 38 40 41 41 42 48 49 50 51 51

Contents

2.11 A High-dimensional Example ............................. 2.12 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 Normative and Descriptive Aspects of Bayesian Analysis, Elicitation of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Objective Priors and Objective Bayesian Analysis . . . . . . . . . . . 2.15 Other Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.16 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.17 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 54

Utility, Prior, and Bayesian Robustness .................... 3.1 Utility, Prior, and Rational Preference . . . . . . . . . . . . . . . . . . . . . 3.2 Utility and Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Rationality Axioms Leading to the Bayesian Approach ....... 3.4 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Bayesian Analysis with Subjective Prior . . . . . . . . . . . . . . . . . . . . 3.6 Robustness and Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. 7 Classes of Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Conjugate Class ................................... 3.7.2 Neighborhood Class ............................... 3.7.3 Density Ratio Class ................................ 3.8 Posterior Robustness: Measures and Techniques . . . . . . . . . . . . . 3.8.1 Global Measures of Sensitivity . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Belief Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.3 Interactive Robust Bayesian Analysis . . . . . . . . . . . . . . . . 3.8.4 Other Global Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.5 Local Measures of Sensitivity . . . . . . . . . . . . . . . . . . . . . . . 3.9 Inherently Robust Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Loss Robustness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Model Robustness ....................................... 3.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 65 67 68 70 71 72 74 74 75 75 76 76 81 83 84 84 91 92 93 94

Large Saunple ~ethods .................................... 99 4.1 Limit of Posterior Distribution ............................ 100 4.1.1 Consistency of Posterior Distribution ................ 100 4.1.2 Asymptotic Normality of Posterior Distribution ....... 101 4.2 Asymptotic Expansion of Posterior Distribution ............. 107 4.2.1 Determination of Sample Size in Testing ............. 109 4.3 Laplace Approximation .................................. 113 4.3.1 Laplace's Method ................................. 113 4.3.2 Tierney-Kadane-Kass Refinements ................... 115 4.4 Exercises ............................................... 119

55 55 57 57 58

Contents

Choice of Priors for Low-dimensional Parameters .......... 121 5.1 Different Methods of Construction of Objective Priors ........ 122 5.1.1 Uniform Distribution and Its Criticisms .............. 123 5.1.2 Jeffreys Prior as a Uniform Distribution .............. 125 5.1.3 Jeffreys Prior as a Minimizer of Information .......... 126 5.1.4 Jeffreys Prior as a Probability Matching Prior ......... 129 5.1.5 Conjugate Priors and Mixtures ...................... 132 5.1.6 Invariant Objective Priors for Location-Scale Families .. 135 5.1. 7 Left and Right Invariant Priors ..................... 136 5.1.8 Properties of the Right Invariant Prior for Location-Scale Families ............................ 138 5.1.9 General Group Families ............................ 139 5.1.10 Reference Priors ................................... 140 5.1.11 Reference Priors Without Entropy Maximization ...... 145 5.1.12 Objective Priors with Partial Information ............ 146 5.2 Discussion of Objective Priors ............................. 147 5.3 Exchangeability ......................................... 149 5.4 Elicitation of Hyperparameters for Prior .................... 149 5.5 A New Objective Bayes Methodology Using Correlation ...... 155 5.6 Exercises ............................................... 156

Hypothesis Testing and Model Selection ................... 159 6.1 Preliminaries ............................................ 159 6.1.1 BIC Revisited ..................................... 161 6.2 P-value and Posterior Probability of H 0 as Measures of Evidence Against the Null ................................ 163 6.3 Bounds on Bayes Factors and Posterior Probabilities ......... 164 6.3.1 Introduction ...................................... 164 6.3.2 Choice of Classes of Priors .......................... 165 6.3.3 Multiparameter Problems .......................... 168 6.3.4 Invariant Tests .................................... 172 6.3.5 Interval Null Hypotheses and One-sided Tests ......... 176 6.4 Role of the Choice of an Asymptotic Framework ............. 176 6.4.1 Comparison of Decisions via P-values and Bayes Factors in Bahadur's Asymptotics ................... 178 6.4.2 Pitman Alternative and Rescaled Priors .............. 179 6.5 Bayesian P-value ........................................ 179 6.6 Robust Bayesian Outlier Detection ........................ 185 6. 7 Nonsubjective Bayes Factors .............................. 188 6.7.1 The Intrinsic Bayes Factor .......................... 190 6. 7.2 The Fractional Bayes Factor ........................ 191 6.7.3 Intrinsic Priors .................................... 194 6.8 Exercises ............................................... 199

XII

Contents

Bayesian Computations .................................... 205 7.1 Analytic Approximation .................................. 207 7.2 The E-M Algorithm ..................................... 208 7.3 Monte Carlo Sampling ................................... 211 7.4 Markov Chain Monte Carlo Methods ....................... 215 7.4.1 Introduction ...................................... 215 7.4.2 Markov Chains in MCMC .......................... 216 7.4.3 Metropolis-Hastings Algorithm ...................... 218 7.4.4 Gibbs Sampling ................................... 220 7.4.5 Rao-Blackwellization ............................... 223 7.4.6 Examples ........................................ 225 7.4. 7 Convergence Issues ................................ 231 7. 5 Exercises . . . . . . . . . . . . . . . . . . . . . . . ........................ 233

Some Common Problems in Inference ..................... 239 8.1 Comparing Two Normal Means ........................... 239 8.2 Linear Regression ....................................... 241 8.3 Logit Model, Pro bit Model, and Logistic Regression .......... 245 8.3.1 The Logit Model .................................. 246 8.3.2 The Probit Model ................................. 251 8.4 Exercises ............................................... 252

High-dimensional Problems ................................ 255 9.1 Exchangeability, Hierarchical Priors, Approximation to Posterior for Large p, and MCMC ......................... 256 9.1.1 MCMC and E-M Algorithm ........................ 259 9.2 Parametric Empirical Bayes .............................. 260 9.2.1 PEB and HB Interval Estimates ..................... 262 9.3 Linear Models for High-dimensional Parameters ............. 263 9.4 Stein's Frequentist Approach to a High-dimensional Problem .. 264 9.5 Comparison of High-dimensional and Low-dimensional Problems ............................................... 268 9.6 High-dimensional Multiple Testing (PEB) .................. 269 9.6.1 Nonparametric Empirical Bayes Multiple Testing ...... 271 9.6.2 False Discovery Rate (FDR) ........................ 272 9. 7 Testing of a High-dimensional Null as a Model Selection Problem ................................................ 273 9.8 High-dimensional Estimation and Prediction Based on Model Selection or Model Averaging ............................. 276 9.9 Discussion .............................................. 284 9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 285

Contents

XIII

10 Some Applications ......................................... 289 10.1 Disease Mapping ........................................ 289 10.2 Bayesian Nonparametric Regression Using Wavelets .......... 292 10.2.1 A Brief Overview of Wavelets ....................... 293 10.2.2 Hierarchical Prior Structure and Posterior Computations ..................................... 296 10.3 Estimation of Regression Function Using Dirichlet Multinomial Allocation ................................... 299 10.4 Exercises ............................................... 302 A

Common Statistical Densities .............................. 303 A.1 Continuous Models ...................................... 303 A.2 Discrete Models ......................................... 306

Birnbaum's Theorem on Likelihood Principle .............. 307

Coherence ................................................. 311

Microarray ................................................ 313

Bayes Sufficiency ......................................... . 315

References ..................................................... 317 Author Index .................................................. 339 Subject Index ................................................. 345

Statistical Preliminaries

We review briefly some of the background that is common to both classical statistics and Bayesian analysis. More details are available in Casella and Berger (1990), Lehmann and Casella (1998), and Bickel and Doksum (2001). The reader interested in Bayesian analysis can go directly to Chapter 2 after reading Section 1.1.

1.1 Common Models A statistician, who has been given some data for analysis, begins by providing a probabilistic model of the way his data have been generated. Usually the data can be treated as generated by random sampling or some other random mechanism. Once a model is chosen, the data are treated as a random vector X= (X 1 ,X2 , ••. ,Xn)· The probability distribution of X is specified by f(x[8) which stands for a joint density (or a probability mass function), and 8 is an unknown constant or a vector of unknown constants called a parameter. The parameter 8 may be the unknown mean and variance of a population from which X is a random sample, e.g., the mean life of an electric bulb or the probability of doing something, vide Examples 1.1, 1.2, and 1.3 below. Often the data X are collected to learn about 8, i.e., the modeling precedes collection of data. The set of possible values of 8, called the parameter space, is denoted bye, which is usually a p-dimensional Euclidean space RP or some subset of it, p being a positive integer. Our usual notation for data vector and parameter vector are X and 8, respectively, but we may use X and B if there is no fear of confusion.

Example 1.1. (normal distribution). X 1 , X 2 , ... , Xn are heights of n adults (all males or all females) selected at random from some population. A common model is that they are independently, normally distributed with mean Jl and variance IJ 2 , where -oo < f.l < oo and IJ 2 > 0, i.e., with 8 = (Jl, 1J 2 ),

1 Statistical Preliminaries

We write this as Xi's are i.i.d. (independently and identically distributed) N(Jl, a 2 ). If one samples both genders the model would be much more complicated ~ Xi's would be i.i.d. but the distribution of each Xi would be a mixture of two normals N(Jlp,a}) and N(J1M,a'iu) where F and M refer to females and males. Example 1.2. (exponential distribution). Suppose a factory is producing some electric bulbs or electronic components, say, switches. If the data are a random sample of lifetimes of one kind of items being produced, we may model them as i.i.d. with common exponential density

!( x,·IB) -- 1 e ~x.;o , x,. > o, B > o.

Example 1.3. (Bernoulli, binomial distribution). Suppose we haven students in a class with X· •

={1 0

if ith student has passed a test; otherwise.

We model Xi's as i.i.d. with the Bernoulli distribution:

f(xiiB)

= {

()

~f Xi=

1 - (} 1f Xi = 0,

which may be written more compactly as ()Xi (1 - B) 1 ~xi. The parameter () is the probability of passing. The joint probability function of X 1 , X 2 , ... , Xn is

f(xiB) If Y

B)n~y,

i=l

=II f(xiiB) =II {Bxi(1- B)l~xi}, () E (0, 1).

2:~ Xi, the number of students who pass, then P(Y = y) = (~)BY(1which is a binomial distribution, denoted B( n, B).

Example 1.4. (binomial distribution with unknown n's and unknownp). Suppose Y1 , Y2, ... , Yk are the number ofreported burglaries in a place in k years. One may model Yi's as independent B(ni,p), where ni is the number of actual burglaries (some reported, some not) in ith year and pis the probability that a burglary is reported. Here 0 is (n 1 , ... , nk, p). Example 1.5. (Poisson distribution). Let X 1 , X 2 , ... , Xn be the number of accidents on a given street inn years. Xi's are modeled as i.i.d P(>..), i.e., Poisson with mean>..,

).._Xi P(Xi =xi)= f(xii>..)

exp( ->..)-1 , Xi= 0, 1, 2, ... ,

Xi.

>.. > 0.

1.1 Common Models

Example 1.6. (relation between binomial and Poisson). It is known B(n,p) is well approximated by P(>.) if n is large, p is small but np = ).. is nearly constant, or, more precisely, n-+ oo,p-+ 0 in such a way that np-+ >... This is used in modeling distribution of defective items among some particular products, e.g., bulbs or switches or clothes. Suppose a lot size n is large. Then the number of defective items, say X, is assumed to have a Poisson distribution. Closely related to the binomial are three other distributions, namely, geometric, negative binomial, which includes the geometric distribution, and the multinomial. All three, specially the last, are important.

Example 1. 7. (geometric). Consider an experiment or trial with two possible outcomes ~ success with probability p and failure with probability 1 - p. For example, one may be trying to hit a hull's eye with a dart. Let X be the number of failures in a sequence of independent trials until the first success is observed. Then P{X = x} = (1- p)xp, X= 0, 1, ... This is a discrete analogue of the exponential distribution.

Example 1.8. (Negative binomial). In the same setup as above, let k be given and X be the number of failures until k successes are observed. Then kP{X=x}= ( X+ k1

p (1-p), x=0,1, ...

This is the negative binomial distribution. The geometric distribution is a special case.

Example 1.9. (multinomial). Suppose an urn has N balls of k colors, the number of balls of jth color is Nj = N Pj where 0 ::; Pj ::; 1, 2:::~ Pj = 1. We take a random sample of n balls, one by one and with replacement of the drawn ball before the next draw. Let Xi = j if the ith ball drawn is of jth color and let nj = frequency of balls of the jth color in the sample. Then the joint probability function of X 1 , X 2 , ... , Xn is k

f(xlp)

ITP7\ j=l

and the joint probability function of n 1 , ...

, nk

The latter is called a multinomial distribution. We would also refer to the joint distribution of X's as multinomial.

1 Statistical Preliminaries

Instead of considering specific models, we introduce now three families of models that unify many theoretical discussions. In the following X is a kdimensional random vector unless it is stated otherwise, and f has the same connotation as before. 1.1.1 Exponential Families Consider a family of probability models specified by f(xl9), 9 E 8. The family is said to be an exponential family if f(xl9) has the representation

f(xl9)

~ exp { e(9) + t,t;(x)A;(9)} h(x),

(1.1)

where c(.), A 1 (.) depend only on 9 and t1 (.) depends only on x. Note that the support of f(xl9), namely, the set of x where f(xl9) > 0, is the same as the set where h(x) > 0 and hence does not depend on 9. To avoid trivialities, we assume that the support does not reduce to a single point. Problem 1 invites you to verify that Examples 1.1 through 1.3 and Example 1.5 are exponential families. It is easy to verify that if Xi, i = 1, ... , n, are i.i.d. with density f(xl9), then their joint density is also exponential:

with T1 = :L~=l tj(xi)· There are two convenient reparameterizations. Using new parameters we may assume A1 (9) = 1 . Then

f(xl9)

~ exp { e( 9) + t , t; (x )0;} h(x ).

(1.2)

The general theory of exponential families, see, e.g., Brown (1986), ensures one can interchange differentiation and integration. Differentiation once under the integral sign leads to 0 =Eo (

8~1 log J(XIO)) = :~ + E9t (X), j = 1, ... 1

,p.

(1.3)

In a similar way, (1.4)

1.1 Common Models

In the second parameterization, we set r/j = Eo(tj(X)), i.e., 'r/j

oc . 1 ae.J ' 1 = ' · · · 'P·

(1.5)

In Problem 3, you are asked to verify for p = 1 that 'r/I is a one-one function of e. A similar argument shows 11 = ('r/I, ... , 'r/p) is a one-one function of 9. The parameters (J are convenient mathematically, while the usual statistical parameters are closer to '11· You may wish to calculate ry's and verify this for Examples 1.1 through 1.3 and Example 1.5.

1.1.2 Location-Scale Families Definition 1.10. Let X be a real- valued random variable, with density 1 (XJ.L) , f(x[J.L, a-)= -;g -a--

where g is also a density function, -oo < J.L < oo, a- > 0. The parameters J.L and a- are called location and scale parameters. With X as above, Z =(X- J.L)/o- has density g. The normal N(J.L, o- 2) is a location-scale family with Z being the standard normal, N(O, 1). Example 1.2 is a scale family with J.L = 0, a- = e. We can make it a location-scale family if we set

f(x[

J.L,

o-)={~exp(-x~ll) forx~J.L; 0

otherwise.

but then it ceases to be an exponential family for its range depends on J.L. The other examples, namely, Bernoulli, binomial, and Poisson are not locationscale families.

Example 1.11. Let X have uniform distribution over (ei, e 2) so that f(x[O) = {

e2~e1 0

if ei <X.< e2; otherwise.

This is also a location-scale family, with a reparameterization, which is not an exponential family.

Example 1.12. The Cauchy distribution specified by the density 1

f(x[J.L,o-)

1f0"

+ (X-J.L )2'

-00

<X
0, does not depend on 9, f is k times continuously differentiable with respect to 9 (with k usually equal to two or three) and one can differentiate under the integral sign as indicated below for real-valued(/:

1.2 Likelihood Function

Ee (

:0 log f(XIB)) = /_: { :0 log f(xiB)} f(xiB) dx =

-oo

d dBf(xiB) dx

d dB

-oo

f(xiB) dx = 0, (1.6)

and similarly,

Ee ( ::2 log f(XIB))

= - /_: (

:0 log f(xiB))

f(xiB) dx.

(1.7)

The condition that the support of f(·IB) is free of B is required for the last two relations to hold. The results of Chapter 4 require regularity conditions of this kind. The exponential families satisfy these regularity conditions. Location-scale families may or may not satisfy, usually the critical assumption is that relating to the support of f. Thus the Cauchy location-scale family satisfies these conditions but not the uniform or the exponential density

f(xlp, cr)

x-p) , x > p. =-;;:1 exp ( --cr-

1.2 Likelihood Function A concept of fundamental importance is the likelihood function. Informally, for fixed x, the joint density or probability mass function (p.m.f.) f(xiO), regarded as a function of 0, is called the likelihood function. When we think off as the likelihood function we often suppress x and write f as L(O). The likelihood function is not unique in that for any c( x) > 0 that may depend on x but not on 0, c(x)f(xiO) is also a likelihood function. What is unique are the likelihood ratios L( 0 2 ) / L( 0 1 ), which indicate how plausible is 82, relative to Ot, in the light of the given data x. In particular, if the ratio is large, we have a lot of confidence in 0 2 relative to 0 1 and the reverse situation holds if the ratio is small. Of course the threshold for what is large or small isn't easy to determine. It is important to note that the likelihood is a point function. It can provide information on relative plausibility of two points 0 1 and 0 2 , but not of two 0-sets, say, two non-degenerate intervals. If the sample size n is large, usually the likelihood function has a sharp peak as shown in the following figure. Let the value of 0 where the maximum is attained be denoted as the maximum likelihood estimate (MLE) 0; we define it formally later. In situations like this, one feels iJ is very plausible as an estimate of 0 relative to any other points outside a small interval around iJ. One would then expect iJ to be a good estimate of the unknown 0, at least in the sense of being close to it in some way (e.g., of being consistent, i.e, converging to 0 in probability). We discuss these things more carefully below.

1 Statistical Preliminaries 1'--

,.-i

1'--

,.-i

(;0

-0.5

0.5

1.5

Fig. 1.2. L(B) for the double exponential model when data is normal mixture.

Classical statistics also asserts that under regularity conditions and for large n, the maximum likelihood estimate minimizes the variance approximately within certain classes of estimates. Problem 10 provides a counter-example due to Basu (1988) when regularity conditions do not hold.

Definition 1.14. The maximum likelihood estimate (MLE) 0 is a value of (J where the likelihood function L(fJ) = f(x[fJ) attains its supremum, i.e., sup f(x[fJ) = f(x[O). 9

Usually, the MLE can be found by solving the likelihood equation

&~·log f(x[fJ) =

0, j = 1, ... ,p.

(1.8)

In Problem 4(b), you are asked to show the likelihood function is logconcave, i.e., its logarithm is a concave function. In this case, if (1.8) has a solution, it is unique and provides a global maximum. There are well-known theorems, see, e.g., Rao (1973), which show the existence of a solution of (1.8) which converges in probability to the unknown true (J if the dimension is fixed and Cramer-Rao type regularity conditions hold. If (1.8) has multiple roots, one has to be careful. A simple solution is to first find a fo-consistent estimate Tn, i.e., an estimate Tn such that fo(Tn- 8) is bounded in probability. Then choose a solution that is nearest to Tn.

1.3 Sufficient Statistics and Ancillary Statistics

1.3 Sufficient Statistics and Ancillary Statistics Given the importance of likelihood function, it is interesting and useful to know what is the smallest set of statistics T 1 ( x), ... , T m ( x) in terms of which one can write down the likelihood function. As expected this makes it necessary to introduce sufficient statistics.

Definition 1.15. Let X be distributed with density f(xiO). Then T = T(X) = (T1 (X), ... , Tm (X)) is sufficient for 0 if the conditional distribution of X given T is free of 0. A basic fact for verifying whether T is sufficient is the following factorization theorem: T is sufficient for 0 iff f (xiO) = g(T1 (x), ... , T m ( x), O)h( x). Using this, you are invited to prove (Problem 20) that the likelihood function can be written in terms ofT iff T is sufficient. Thus the problem of finding the smallest T in terms of which one can write down the likelihood function reduces to the problem of finding what are called minimal sufficient statistics.

Definition 1.16. A sufficient statistic T 0 is minimal sufficient (or smallest among sufficient statistics) if T 0 is a function of every sufficient statistic. Clearly, a one-one function of a minimal sufficient statistic is also minimal sufficient. In spite of the somewhat abstract definition, minimal sufficient statistics are usually easy to find by inspection. Most examples in this book would be covered by the following fact (Problem 19).

Fact. Suppose Xi, i = 1, 2, ... , n are i.i.d. from exponential family. Then (Tj = 2:::~ 1 tj(Xi),j = 1, ... ,p) together form a minimal sufficient statistics and hence is the smallest set of statistics in terms of which we may write down the likelihood function. Using this, you can prove (2:::~ Xi, 2:::~ X?) is minimal sufficient for p, and 17 2 if X 1 ,X2, ... ,Xn are i.i.d. N(p,,17 2). This in turn implies (X,s 2 = n~l l::(Xi - X) 2) is also minimal sufficient for (p,, 17 2), being a one-one function of (2:::~ Xi, 2:::~ X?). In the same way, X is minimal sufficient for both i.i.d. B(1,p) and P(>..). In Problem 10, one has to show X(l) = min(X 1 , X2, ... , Xn) and X(n) = max(X1 , X 2, ... , Xn) are together minimal sufficient for U(B, 20). A bad case is that of i.i.d. Cauchy(p,, 17 2). It is known (see, e.g., Lehmann and Casella (1998)) that the minimal sufficient statistic is the set of all order statistics (X(l), X( 2), ... , X(n)) where X(l) and X(n) have been defined earlier and X(r) is the rth X when the Xi's are arranged in ascending order (assuming all Xi's are distinct). This is a bad case because the order statistics together are always sufficient when Xi's are i.i.d., and so if this is the minimal sufficient statistic, it means the density is so complicated that the likelihood cannot be expressed in terms of a smaller set of statistics. The advantage of sufficiency is that we can replace the original data set x by the minimal

1 Statistical Preliminaries

sufficient statistic. Such reduction works well for i.i.d. random variables with an exponential family of distributions or special examples like U(B 1 , 02 ). It doesn't work well in other cases including location-scale families. There are various results in classical statistics that show a sufficient statistic contains all the information about () in the data X. At the other end is a statistic whose distribution does not depend on () and so contains no information about B. Such a statistic is called ancillary. Ancillary statistics are easy to exhibit if X 1 , ... ,Xn are i.i.d. with a location-scale family of densities. In fact, for any four integers a, b, c, and d, the ratio X(a)- X(b) Zeal- Z(b) X(c)- X(d)

Z(c)- Z(d)

is ancillary because the right-hand side is expressed in terms of order statistics of Zi's where Zi =(Xi- J.L)/a, i = 1, ... ,n are i.i.d. with a distribution free of J.L and a. There is an interesting technical theorem, due to Basu, which establishes independence of a sufficient statistic and an ancillary statistic. The result is useful in many calculations. Before we state Basu's theorem, we need to introduce the notion of completeness.

Definition 1.17. A statistic T or its distribution is said to be complete if for any real valued function 'lj;(T), Eo'l/J(T(X)) = 0 V () implies 'lj;(T(X)) = 0 (with probability one under all()).

Suppose T is discrete. The condition then simply means the family of p.m.f.'s fT(tiB) ofT is rich enough that there is no non-zero 'lj;(t) that is orthogonal to fT(tiB) for all() in the sense Lt 'lj;(t)fT(tiB) = 0 for all 0.

Theorem 1.18. (Basu). Suppose T is a complete sufficient statistic and U is any ancillary statistic. Then T and U are independent for all (). Proof. Because Tis sufficient, the conditional probability of U being in some set B given T is free of () and may be written as Po(U E BIT) = ¢(T). Since U is ancillary, Eo(¢(T)) = Po(U E B) = c, where c is a constant. Let 'lj;(T) = ¢(T) -c. Then Eo'l/J(T) = 0 for all 0, implying 'lj;(T) = 0 (with probability one), i.e., Po(U E BIT) = Po(U E B). D It can be shown that a complete sufficient statistic is minimal sufficient. In general, the converse isn't true. For exponential families, the minimal sufficient statistic (T1 , ••• ,Tp) = (L:~t 1 (Xi), ... ,L:~tp(Xi)) is complete. For X 1 , X2, ... , Xn i.i.d. U(B 1 , B2), (Xc 1 l, X(n)) is a complete sufficient statistic. Here are a couple of applications of Basu's theorem.

1.4 Three Basic Problems of Inference in Classical Statistics

Example 1.19. Suppose X 1 , X 2 , ... , Xn are i.i.d. N(p,, a 2 ). Then X and 8 2 = n~l 2:(Xi - X) 2 are independent. To prove this, treat a 2 as fixed to start with and p, as the parameter. Then X is complete sufficient and 8 2 is ancillary. Hence X and 8 2 are independent by Basu's theorem. Example 1.20. Suppose X 1 ,X2 , ... ,Xn are i.i.d U(e 1 ,e2 ). Then for any 1 < = (X(r)- X(l))/(X(n)- X(lJ) is independent of (X(l),X(n))· This follows because Y is ancillary.

r < n, Y

A somewhat different notion of sufficiency appears in Bayesian analysis. Its usefulness and relation to (classical) sufficiency is discussed in Appendix E.

1.4 Three Basic Problems of Inference in Classical Statistics

For simplicity, we take p = 1, so is a real-valued parameter. Informally, inference is an attempt to learn about e. There are three natural things one may wish to do. One may wish to estimate by a single number. A classical estimate used in large samples is the MLE {j. Secondly, one may wish to choose an interval that covers with high probability. Thirdly, one may test hypotheses about e.g., test what is called a null hypothesis H 0 : = 0 against a two-sided alternative H 1 : #- 0. More generally, one can test H0 : e = eo against H 1 : e #- eo where e0 is a value of some importance. For example, is the effect of some new drug on one of the two blood pressures, or 0 is the effect of an alternative drug in the market and one is trying to test whether the new drug has different effects. If one wants to test whether the new drug is better then instead of H 1 : #- 0 , one may like to consider one-sided alternatives H 1 : e < e0 or H 1 : e > e0 .

e e

1.4.1 Point Estimates In principle, any statistic T(X) is an estimate though the context usually suggests some special reasonable candidates like sample mean X or sample median for a population mean like p, of N(p,, a 2 ). To choose a satisfactory or optimal estimate one looks at the properties of its distribution. The two most important quantities associated with a distribution are its mean and variance or mean and the standard deviation, usually called the standard error of the estimate. One would usually report a good estimate and estimate of the standard error. So one judges an estimate T by its mean E(TJe) and variance Var(TJe). If we are trying to estimate e, we calculate the bias E(TJe) -e. One prefers small absolute values of bias, one possibility is to consider only unbiased estimates of e and so one requires E(TJe) =e. Problem 17 requires you to show both X and the sample median are unbiased estimates for p, in

1 Statistical Preliminaries

N(J.L, o- 2 ). If the object is to estimate some real-valued function 7(B) of B, one would require E(TIB) = 7(B). For unbiased estimates of 7, Var(TIB) = E{(T- 7(B)) 2 IB} measures how dispersed Tis around 7(B). The smaller the variance the better, so one may search for an unbiased estimate that minimizes the variance. Because B is not known, one would have to try to minimize variance for all B. This is a very strong condition but there is a good theory that applies to several classical examples. In general, however it would be unrealistic to expect that such an optimal estimate exists. We will see the same difficulty in other problems of classical inference. We now summarize the basic theory in a somewhat informal manner.

Theorem 1.21. Cramer-Rao Inequality (information inequality). Let T be an unbiased estimate of 7( B). Suppose we can interchange differentiation and integration to get d dBE(TIB)

!00 !00 -oo · · ·

-oo

T(x)f'(xiB) dx,

and

Then, 2

Var(TIB) > [7 '(B)] - In(B) ' where the 1 in 7 and f indicates a derivative with respect to B and In (B) is Fisher information in x, namely,

Proof. Let '1/J(X, B) = ddli log f(XIB). The second relation above implies E('ljJ(X, B)IB) = 0 and then, Var('ljJ(X, B)IB) = In(B). The first relation implies Cov(T, '1/J(X, B) I B)= 7 1 (B). It then follows that

Var(TIB) > [Cov(T, 7/J(X, B)IB)] Var('l/J(X,B)IB)

> 7 (B) -

In(B).

If X 1 , ... ,Xn are i.i.d. f(xiB), then

In(B)

nl(B)

where I(B) is the Fisher information in a single observation,

1.4 Three Basic Problems of Inference in Classical Statistics

To get a feeling for In(B), consider an extreme case where f(xiB) is free of e. Clearly, in this case there can be no information about e in X. On the other hand, if In( B) is large, then on an average a small change in e leads to a big change in log f(xiB), i.e., f depends strongly on e and one expects there is a lot that can be learned about e and hence T(B). A large value of In(B) diminishes the lower bound making it plausible that one may be able to get an unbiased estimate with small variance. Finally, if the lower bound is attained at all B by T, then clearly T is a uniformly minimum variance unbiased (UMVUE) estimate. We would call them best unbiased estimates. A more powerful method of getting best unbiased estimates is via the Rao-Blackwell theorem. Theorem 1.22. (Rao-Blackwell). IfT is an unbiased estimate ofT(B) and S is a sufficient statistic, the T' = E(TIS) is also unbiased for T(B) and

Var(T'IB) :S: Var(TIB) VB. Corollary 1.23. IfT is complete and sufficient, then T' as constructed above is the best unbiased estimate for T( B).

Proof. By the property of conditional expectations, E (T'IB)

E { E(TIS) I B}

E (TIB).

(You may want to verify this at least for the discrete case.) Also, Var(TIB)

= E [{(T- T') + (T'- T(B))} 2 I e] = E { (T- r') 2 e} + E { (T'- T(B)) 2 e}, 1

because Cov {T- T', T'- T(B) I B}

= E {(T- T')(T'- T(B) I B} = E [E {(T'- T(B))(T- T') IS} IB] = E [(T'- T(B))E(T- T'IS) I B] = 0.

The decomposition of Var(TIB) above shows that it is greater than or equal to Var(T'IB). o The theorem implies that in our search for the best unbiased estimate, we may confine attention to unbiased estimates of T(B) based on S. However, under completeness, T' is the only such estimate.

1 Statistical Preliminaries

Example 1.24. Consider a random sample from N(tJ, a 2 ), a 2 assumed known. Note that by either of the two previous results, X is the best unbiased estimate for fJ· The best unbiased estimate for {J 2 is X 2 - a 2 jn by the Rao-Blackwell theorem. You can show it does not attain the Cramer-Rao lower bound. If a T attains the Cramer-Rao lower bound, it has to be a linear function (with () = tJ), d T(x) =a(())+ b(()) d() log f(xle),

i.e., must be of the form T(x)

= c(e) + d(())x.

But T, being a statistic, this means T(x) =c+dx,

where c, d are constants. A similar argument holds for any exponential family. Conversely, suppose a parametric model f(xl()) allows a statistic T to attain the Cramer-Rao lower bound. Then, d T(x) =a(())+ b(()) d() log f(xl()), which implies T(x)- a(()) b(())

.!!:.._ 1

d() og

f( I()) x

Integrating both sides with respect to(), T(x)

j (b(()))- d()- j a(())(b(()))- d() + d(x) =log f(xle), 1

where d(x) is the constant of integration. If we write A(()) = J(b(()))- 1 d(), c(()) = J a(())b(e)- 1 d() and d(x) =log h(x), we get an exponential family. The Cramer-Rao inequality remains important because it provides information about variance ofT. Also, even if a best unbiased estimate can't be found, one may be able to find an unbiased estimate with variance close to the lower bound. A fascinating recent application is Liu and Brown (1992). An unpleasant feature of the inequality as formulated above is that it involves conditions on T rather than only conditions on f(xl()). A considerably more technical version without this drawback may be found in Pitman (1979). The theory for getting best unbiased estimates breaks down when there is no complete sufficient statistic. Except for the examples we have already seen, complete sufficient statistics rarely exist. Even when a complete sufficient statistic exists, one has to find an unbiased estimate based on the complete sufficient statistic S. This can be hard. Two heuristic methods work sometimes. One is the method of indicator functions, illustrated in Problem 5.

1.4 Three Basic Problems of Inference in Classical Statistics

The other is to start with a plausible estimate and then make a suitable adjustment to make it unbiased. Thus to get an unbiased estimate for J-L 2 for N(J-L,0' 2 ), one would start with X 2 . We know for sure X 2 can't be unbiased since E(X 2 [J-L, 0' 2 ) = J-L 2 + 0' 2 jn. So if 0' 2 is known, we can use X 2 - 0' 2 jn. If 0' 2 is unknown, we can use X 2 - s2 jn, where s2 = 2:(Xi- X) 2 j(n- 1) is an unbiased estimate of 0' 2 . Note that X 2 - s 2 jn is a function of the complete, sufficient statistic (X, s 2 ) but may be negative even though J-L 2 is a positive quantity. For all these reasons, unbiasedness isn't important in classical statistics as it used to be. Exceptions are in unbiased estimation of risk (see Berger and Robert (1990), Lu and Berger (1989a, b)) with various applications and occasionally in variance estimation, specially in high-dimensional problems. See Chapter 9 for an application. We note finally that for relatively small values of p and relatively large values of n, it is easy to find estimates that are approximately unbiased and approximately attain the Cramer-Rao lower bound in a somewhat weak sense. An informal introduction to such results appears below. Under regularity conditions, it can be shown that

This implies Bis approximately normal with mean and variance (ni(e))- 1 , which is the Cramer-Rao lower bound when we are estimating e. Thus B is approximately normal with expectation equal to and variance equal to the Cramer-Rao lower bound for T(e) = e. For a general differentiable T(e), we show T(B) has similar properties. Observe that T(B) = T(e) + (B- e)T'(e)+ smaller terms, which exhibits T(B) as an approximately linear function of Hence T( B) is also approximately normal with

mean variance

= T(e) +

(approximate) mean of

= (T'(e)) 2 x

(e- e) T'(e) = T(e),

approximate variance of

and

(e- e)= (T'(e)) 2 ni~e).

The last expression is the Cramer-Rao lower bound for T(e). The method of approximating T(B) by a linear function based on Taylor expansion is called the delta method. For N(J-L,0' 2 ) and fixed x, let T(e) = T(e,x) = P{X :::; x[J-L,O'}. An approximately best unbiased estimate is P{ X :::; x[{l, a'} = ( x--;,x) where

s' = ~ 2:(Xi - X) 2 and 80 . In this formulation, the null hypothesis represents status quo as in the drug example. It could also mean an accepted scientific hypothesis, e.g., on the value of the gravitational constant or velocity of light in some medium. This suggests that one should not reject the null hypothesis unless there is compelling evidence in the data in favor of H 1 . This fact will be used below. A test is a rule that tells us for each possible data set (under our model f(xiB)) whether to accept or reject H 0 . Let W be the set of x's for which a given test rejects H 0 and we be the set where the test accepts H 0 . The region W, called a critical region or rejection region, completely specifies the test. Sometimes one works with the indicator of W rather than W itself. The collection of all subsets Win Rn or their indicators correspond to all possible tests. How does one evaluate them in principle or choose one in some optimal manner? The error committed by rejecting H 0 when H 0 is true is called the error of first kind. Avoiding this is considered to be more important than the so called second kind of error committed when H 0 is accepted even though H 1 is true. For any given W, Probability of error of first kind

= Po 0 (X

E W)

= Eo 0 (I(X)),

where I(x) is the indicator of W,

I (X)

={1

if X E W; 0 if X EWe.

Probability of error of second kind = Po(X E we) = 1 - E 0 (I(X)), for B as in H 1. One also defines the power of detecting H1 as 1- Po(X EWe) = Eo(I(X)) forB as in H 1 . It turns out that in general if one tries to reduce one error probability the other error probability goes up, so one cannot reduce both simultaneously. Because probability of error of first kind is more important, one first makes it small, (1.9) Eo 0 (I(X))::::; a, where a, conventionally .05, .01, etc., is taken to be a small number. Among all tests satisfying this, one then tries to minimize the probability of committing error of second kind or equivalently, to maximize the power uniformly for all Bas in H 1 . You can see the similarity of (1.9) with restriction to unbiased estimates and the optimization problem subject to (1.9) as the problem of

1.4 Three Basic Problems of Inference in Classical Statistics

minimizing variance among unbiased estimates. The best test is called uniformly most powerful (UMP) after Neyman and Pearson who developed this theory. It turns out that for exponential families and some special cases like U(O, B) or U(B -1,B + 1), one can find UMP tests for one-sided alternatives. The basic tool is the following result about a simple alternative H 1 : (} = 81 .

Lemma 1.25. (Neyman-Pearson). Consider H 0 : (} = (} 0 versus H 1 : (} = 81. Fix 0 ::::; a ::::; 1. A. Suppose there exists a non-negative k and a test given by the indicator function Io such that 10 (x)

if f(xiBl) > kf(xiBo); 0 if f(xiBl) < kf(xiBo),

= { 1

with no restriction on Io if f(xiBl) = kf(xiBo)), such that Eg 0 (Io(X)) =a. Then Eg,(Io(X)) 2:: Eg,(h(X)) for all indicators h satisfying

i.e., the test given by I 0 is MP among all tests satisfying the previous inequality. B. Suppose g is a given integrable function and we want all tests to satisfy Eg 0 (I(X)) =a and

j ···j I(x)g(x) dx = c (same for all I). (1.10)

Then among all such I, Eg 1 (I(X)) is maximum at

where k 1 and k 2 are two constants such that I 0 satisfies the two constraints given in (1.10). C. If I 0 exists in A orB and h is an indicator having the same maximizing property as I 0 under the same constraints, then I 0 ( x) and h (x) are same if f(xiBl)- kf(xiBo) =/= 0, in case of A and f(xiBl)- kd(xiBo)- k2g(x) -=10, in case of B. Proof. A. By definition of I 0 and the fact that 0 ::::; h (x) ::::; 1 for all h, we have that

{(Io(x)- h(x)) (f(xiBl)- kf(xiBo))} dx 2::0,

which implies

(1.11)

1 Statistical Preliminaries

{Io(x)- h(x)} f(x[BI) dx;::: k

Io(x)f(x[Bo) dx- k

;::: ka- ka

11 (x)f(x[Bo) dx

= 0.

B. The proof is similar to that of A except that one starts with

(Io- I) {f(x[BI)- kif(x[Bo)- kzg(x)} dx;::: 0.

C. Suppose Io is as in A and 11 maximizes fx I f(x[BI) dx. i.e.,

Iof(x[BI) dx =

IIf(x[BI) dx

subjected to

L L

Iof(x[Bo) dx =a, and

Then,

hf(x[Bo) dx =a.

{Io- h}{f(x[BI)- kf(x[Bo)}dx = 0.

But the integrand {Io(x)- I 1 (x)}{f(x[BI)- kf(x[Box)} is non-negative for all x. Hence Io(x) = h(x) if f(x[BI)- kf(x[Bo) -=1- 0. This completes the proof.

Remark 1. 26. Part A is called the sufficiency part of the lemma. Part B is a generalization of A. Part C is a kind of necessary condition for h to be MP provided ! 0 as specified in A or B exists. If Xi's are i.i.d. N(J-L, a 2 ), then { x : f(x[BI) = kf(x[B 0 )} has probability zero. This is usually the case for continuous random variables. Then the MP test, if it exists, is unique. It fails for some continuous random variables like Xi's that are i.i.d. U(O,B) and for discrete random variables. In such cases the MP test need not be unique. Using A of the lemma we show that for N(J-L, a 2 ), a 2 known, the UMP test of H 0 : J-L = J-Lo for a one-sided alternative, say, H 1 : J-L > f-Lo is given by _ { 1 if X

Io-

> f-LO + Z 0

0 1"f-X < f-LO

Jn;

a + Zo .jTi,

where Z 0 is such that P{Z > Z 0 } =a with Z rv N(O, 1). Fix f-Ll > f-LO· Note that f(x[J-Ld/ f(x[J-Lo) is an increasing function of x. Hence for any k in A, there is a constant c such that

f(x[J-Ld > kf(x[J-Lo)

if and only if

x > c.

1.4 Three Basic Problems of Inference in Classical Statistics

So the MP test is given by the indicator I, = { 0

1 if 0 if

x > c; x < c,

where c is such that EJLo(Io) = a. It is easy to verify c = JLo + ZaCT I yn does have this property. Because this test does not depend on the value of f.Ll, it is MP for all JL1 > JLo and hence it is UMP for H1 : JL > JLo· In the same way, one can find the UMP test of H 1 : JL < JLo and verify that the test now rejects Ho if x < JLo - ZaCT I yn. How about H 0 : JL :S JLo versus H 1 : JL > JLo? Here we consider all tests with the property

PJL,rr2(Ho is rejected) :Sa for alltt :S JLo· Using Problem 6 (or 7), it is easy to verify that the UMP test of H 0 : JL = JLo versus H1 : JL > JLo is also UMP when the null is changed to H 0 : JL :S JLo· One consequence of these calculations and the uniqueness of MP tests (Part C) is that there is no UMP test against two-sided alternatives. Each of the two UMP tests does well for its H 1 but very badly at other 8 1 , e.g., the UMP test Io for H 0 : JL = JLo versus H 1 : JL > JLo obtained above has EJL1 (Io) ---+ 0 as f.Ll ---+ -oo. To avoid such poor behavior at some 8's, one may require that the power cannot be smaller than a. Then Ee 0 (I) :Sa, and Ee (I) ~ a, 8 =J 80 imply Ee 0 (I) = a and Ee (I) has a global and hence a local minimum at 8 = 80 . Tests of this kind were first considered by Neyman and Pearson who called them unbiased. There is a similarity with unbiased estimates that was later pointed out by Lehmann (1986) (see Chapter 1 there). Because every unbiased I satisfies conditions of Part B with g = f'(x[8 0 ), one can show that the MP test for any 81 =J 80 satisfies conditions for ! 0 . With a little more effort, it can be shown that the MP test is in fact

for suitable

and

c2.

The given constraints can be satisfied if and

This is the UMP unbiased test. We have so far discussed how to control a, the probability of error of first kind and then, subject to this and other constraints, minimize (3(8), the probability of error of second kind. But how do we bring (3(8) to a level that is desired? This is usually done by choosing an appropriate sample size n, see Problem 8. The general theory for exponential families is similar with T = L~ t(xi) or TIn taking on the role of x. However, the distribution ofT may be discrete, as in the case of binomial or Poisson. Then it may not be possible to find the

1 Statistical Preliminaries

constants cor c 1 , c 2 . Try, for example, the case of B(5,p), H 0 : p = ~ versus H 1 : p > ~' a= .05. In practice one chooses an a' < a and as close to a as possible for which the constants can be found and the lemma applied with a' instead of a. A different option of some theoretical interest only is to extend the class of tests to what are called randomized tests. A randomized test is given by a function 0 (x) 1, which we interpret as the probability of rejecting H 0 given x. By setting equal to an indicator we get back the non-randomized tests. With this extension, one can find a UMP test for binomial or Poisson of the form 1 if T > c; .¢ 1 + (1- >-.)¢ 2 , 0 c} is an increasing function of JL· 7. X 1, X 2, ... , Xn are said to have a family of densities f(xiB) with monotone likelihood ratio (MLR) in T(x) if there exists a sufficient statistic T(x) such that j(xiB2)/ f(xiBI) is a non-decreasing function of T(x) if 82 > 01 . (a) Verity that exponential families have this property. (b) If f(xiB) has MLR in T, show that Pe{T(X) > c} is non-decreasing in e. 8. Let X 1, X2, ... , Xn be i.i.d. N(p, 1). Let 0 0. (a) For H 0 : JL = JLo versus H 1 : JL > p 0 , show the smallest sample size n for which the UMP test has probability of error of first kind equal to a and probability of error of second kind~ (3 for JL ~ 11o + Ll is (approximately) 2 the integer part of ((za + Z[3)/Ll) +1. Evaluate n numerically when Ll = .5, a= .01, (3 = .05. (b) For H 0 : JL = JLo versus H 1 : JL cJ JLo, show the smallest n such that UMPU test has probability of error of first kind equal to a and probability of error of second kind~ (3 for IJL- Pol ~ Ll is (approximately) the solution of cJ> (za/2- y'nLl) + cJ> (za/2 + y'nLl) = 1 + (3. Evaluate n numerically when Ll = 0.5, a = .01 and (3 = .05. 9. Let X 1, X 2, ... , Xn be i.i.d U(O, B), B > 0. Find the smallest n such that the UMP test of H 0 : B = 00 against H 1 : B > 00 has probability of error of first kind equal to a and probability of error of second kind ~ (3 for B ~ 01, with 81 > Ba. 10. (Basu (1988, p.1)) Let X 1, X2, ... , Xn be i.i.d U(B, 20), B > 0. (a) What is the likelihood function of e? (b) What is the minimal sufficient statistic in this problem? (c) Find{}, the MLE of B. (d) Let Xul = min(X 1 , ... , Xn) and T = (4e + Xcl))/5. Show that E ((T- 0) 2) jE 0) 2) is always less than 1, and further,

((e-

E ((T- B?)

_-',-'----__---,'-- -----+ E ( ( {j _ B)

as n --+ oo.

11. Suppose X 1 , X 2, ... , Xn are i.i.d N(p, 1). A statistician has to test H 0 : JL = 0; he selects his alternative depending on data. If X < 0, he tests against H 1 : JL < 0. If X > 0, his alternative is H 1 : JL > 0. (a) If the statistician has taken a= .05, what is his real a? (b) Calculate his power at JL = ±1 when his nominal a= .05, n = 25. Will this power be smaller than the power of the UMPU test with a= .05?

1 Statistical Preliminaries

12. Consider n patients who have received a new drug that has reduced their blood pressure by amounts X 1 , X 2 , ... , Xn. It may be assumed that X 1 , X 2 , ... , Xn are i.i.d. N(p,, a 2 ) where a 2 is assumed known for simplicity. On the other hand, for a standard drug in the market it is known that the average reduction in blood pressure is p, 0 . The company producing the new drug claims p, = p, 0 , i.e., it does what the old drug does (and probably costing much less). Discuss what should he H 0 and H 1 here. (This is a problem of bio-equivalence.) 13. (P-values) The error probabilities of a test do not provide a measure of the strength of evidence against H 0 in a particular data set. The P-values defined below try to capture that. Suppose H 0 : e = e0 and your test is to reject H 0 for large values of a test statistic W(X), say, you reject H 0 if W > Wa. Then, when X = x is observed, the P-value is defined as

P(x)

1- F~ (W(x)),

where Fe~ = distribution function of Wunder e0 . (a) Show that ifF~ is continuous then P(X) has uniform distribution on (0,1). (b) Suppose you are not given the value of W but you know P. How will you decide whether to accept or reject H 0 ? (c) Let X 1 , X 2 , ... , Xn be i.i.d N(p,, 1). You are testing H 0 : f.l = p, 0 versus H 1 : p, =f. p, 0 . Define P-value for the UMPU test. Calculate El-' 0 (P) and EI-' 0 (PJP:::; o:). 14. (a) Let f(xje 0 ), f(xje 1 ) and ! 0 be as in Part A of the Neyman-Pearson Lemma. The constant k is chosen not from given o: but such that

Then show that ! 0 is minimax, i.e., ! 0 minimizes the maximum error probability, max(Ee 0 (Jo), 1- Ee 1 (Jo)):::; max(Ee 0 (J), 1- Ee 1 (I). (b) Let X 1 , X 2 , ... , Xn be i.i.d. N(p,, 1). Using (a) find the minimax test of H 0 : f.l = -1 versus H 1 : f.l = + 1. 15. (a) Let X have density f(xJe) and 8 = {e0 ,ei}. The null hypothesis is H 0 : e = e0 , the alternative is H 1 : e = e1 . Suppose the error probabilities of each randomized test¢> is denoted by (o:¢, !3¢) and S= the collection of all points (o:¢,!3¢)· Sis called the risk set. Show that Sis convex. (b) Let X be B(2,p), p = ~ (corresponding with H 0 ) or (corresponding with H 1 ). Plot the risk set S as a subset of the unit square. (Hint. Identify the lower boundary of S as a polygon with vertices corresponding with non-randomized most powerful tests. The upper boundary connects vertices corresponding with least powerful tests that are similar to ! 0 in the N-P lemma but with reverse inequalities.)

1. 7 Exercises

16. (a) Suppose 80 is a decision rule that has constant risk and is Bayes in the limit (as defined in Section 1.5). Show that 80 is minimax. (b) Consider i.i.d. observations X 1 , ..• , Xn from N(JJ, 1). Using a normal prior distribution for JJ, show that X is a minimax estimate for fJ under squared error loss. 17. Let X 1 , X 2 , ... , Xn be i.i.d. N(JJ, a 2 ). Consider estimating M· (a) Show that both X and the sample median Mare unbiased estimators of M· (b) Further, show that both of them are consistent and asymptotically normal. (c) Discuss why you would prefer one over the other. 18. Let X 1 , X2, ... , Xn be i.i.d. N(JJ, a 2 ), Y 1 , Y2, ... , Ym be i.i.d. N(TJ, T 2 ) and let these two samples be independent also. Find the set of minimal sufficient statistics when (a) -oo < JJ,TJ < oo, a 2 > 0 and T 2 > 0. (b) fJ = TJ, -oo < fJ < oo, a 2 > 0 and T 2 > 0. (c) -oo < JJ, TJ < oo, a 2 = T 2 , and a 2 > 0. (d) fJ = T}, a 2 = T 2 , -oo < fJ < oo, and a 2 > 0. 19. Suppose Xi, i = 1, 2, ... , n are i.i.d. from the exponential family with density (1.2) having full rank, i.e., the parameter space contains a pdimensional open rectangle. Then show that (Tj = :Z:::~= 1 t j (Xi), j = 1, ... , p) together form a minimal sufficient statistic. 20. Refer to the 'factorization theorem' in Section 1.3. Show that a statistic u is sufficient if and only if for every pair el' e2' the ratio f (X Ie2) If (X Iel) is a function of U(x).

2 Bayesian Inference and Decision Theory

This chapter is an introduction to basic concepts and implementation of Bayesian analysis. We begin with subjective probability as distinct from classical or objective probability of an uncertain event based on the long run relative frequency of its occurrence. Subjective probability, along with utility or loss function, leads to Bayesian inference and decision theory, e.g., estimation, testing, prediction, etc. Elicitation of subjective probability is relatively easy when the observations are exchangeable. We discuss exchangeability, its role in Bayesian analysis, and its importance for science as a whole. In most cases in practice, quantification of subjective belief or judgment is not easily available. It is then common to choose from among conventional priors on the basis of some relatively simple subjective judgments about the problem and the conventional probability model for the data. Such priors are called objective or noninformative. These priors have been criticized for various reasons. For example, they depend on the form of the likelihood function and usually are improper, i.e., the total probability of the parameter space is infinity. Here in Chapter 2, we discuss how they are applied; some answers to the criticisms are given in Chapter 5. In Section 2.3 of this chapter, there is a brief discussion of the many advantages of being a Bayesian.

2.1 Subjective and Frequentist Probability Probability has various connotations. Historically, it has been connected with both personal evaluation of uncertainty, as in gambling or other decision making under uncertainty, and predictions about proportion of occurrence of some uncertain event. Thus when a person says the probability is half that this particular coin will turn up a head, then it will usually mean that in many tosses about half the time it will be a head (a version of the law of large numbers). But it can also mean that if someone puts this bet on head - if head he wins

2 Bayesian Inference and Decision Theory

a dollar, if not he loses a dollar - the gamble is fair. The first interpretation is frequentist, the second subjective. Similarly one can have both interpretations in mind when a weather forecast says there is a probability of 60% of rain, but the subjective interpretation matters more. It helps you decide if you will take an umbrella. Finally, one can think up situations, e.g., election of a particular candidate or success of a particular student in a particular test, where only the subjective interpretation is valid. Some scientists and philosophers, notably Jeffreys and Carnap, have argued that there may be a third kind of probability that applies to scientific hypotheses. It may be called objective or conventional or non-subjective in the sense that it represents a shared belief or shared convention rather than an expression of one person's subjective uncertainty. Fortunately, the probability calculus remains the same, no matter which kind of probability one uses. A Bayesian takes the view that all unknown quantities, namely the unknown parameter and the data before observation, have a probability distribution. For the data, the distribution, given 8, comes from a model that arises from past experience in handling similar data as well as subjective judgment. The distribution of(} arises as a quantification of the Bayesian's knowledge and belief. If her knowledge and belief are weak, she may fall back on a common objective distribution in such situations. Excellent expositions of subjective and objective Bayes approaches are Savage (1954, 1972), Jeffreys (1961), DeGroot (1970), Box and Tiao (1973), and Berger (1985a). Important relatively recent additions to the literature are Bernardo and Smith (1994), O'Hagan (1994), Gelman et al. (1995), Carlin and Louis (1996), Leonard and Hsu (1999), Robert (2001), and Congdon (2001).

2.2 Bayesian Inference Informally, to make inference about (} is to learn about the unknown (} from data X, i.e., based on the data, explore which values of(} are probable, what might be plausible numbers as estimates of different components of(} and the extent of uncertainty associated with such estimates. In addition to having a model f(x[8) and a likelihood function, the Bayesian needs a distribution for 8. The distribution is called a prior distribution or simply a prior because it quantifies her uncertainty about (} prior to seeing data. The prior may represent a blending of her subjective belief and knowledge, in which case it would be a subjective prior. Alternatively, it could be a conventional prior supposed to represent small or no information. Such a prior is called an objective prior. We discuss construction of objective priors in Chapter 5 (and in Section 6.7.3 to some extent). An example of elicitation of subjective prior is given in Section 5.4. Given all the above ingredients, the Bayesian calculates the conditional probability density of (} given X = x by Bayes formula

2.2 Bayesian Inference

x _

1r( I ) -

1r(O)j(xiO)

fe 7r(O')f(xl0')d0'

(2.1)

where 1r(O) is the prior density function and f(xiO) is the density of X, interpreted as the conditional density of X given 0. The numerator is the joint density of 0 and X and the denominator is the marginal density of X. The symbol 0 now represents both a random variable and its value. When the parameter 0 is discrete, the integral in the denominator of (2.1) is replaced by a sum. The conditional density 1r(Oix) of 0 given X =xis called the posterior density, a quantification of our uncertainty about 0 in the light of data. The transition from 1r(O) to 1r(Oix) is what we have learnt from the data. A Bayesian can simply report her posterior distribution, or she could report summary descriptive measures associated with her posterior distribution. For example, for a real valued parameter 0, she could report the posterior mean

E(Oix) = /_: 01r(Oix)dO and the posterior variance Var (Oix) = E{(O- E(Oix)) 2 lx} = /_: (0-

E(Oix)) 2 7r(Oix)d0

or the posterior standard deviation. Finally, she could use the posterior distribution to answer more structured problems like estimation and testing. In the case of estimation of one would report the above summary measures. In the case of testing one would report the posterior odds of the relevant hypotheses.

Example 2.1. We illustrate these ideas with an example of inference about f.1 for normally distributed data (N(JL,a 2 )) with mean f.1 and variance a 2 . The data consist of i.i.d. observations X 1 , X 2 , · · ·, Xn from this distribution. To keep the example simple we assume n = 10 and a 2 is known. A mathematically convenient and reasonably flexible prior distribution for f.1 is a normal distribution with suitable prior mean and variance, which we denote by T/ and 2 T . To fix ideas we take T/ = 100. The prior variance T 2 is a measure of the strength of our belief in the prior mean T/ = 100 in the sense that the larger the value of T 2 , the less sure we are about our prior guess about ry. Jeffreys (1961) has suggested we can calibrate T 2 by comparing with a 2 . For example, setting T 2 = a 2 /m would amount to saying information about T/ is about as strong as the information in m observations in data. Some support for this interpretation is provided in Chapter 5. By way of illustration, we take m = 1. With a little algebra (vide Problem 2), the posterior distribution can be shown to be normal with posterior mean

2 Bayesian Inference and Decision Theory

and posterior variance

(2.3) i.e., in the light of the data, p, shifts from prior guess TJ towards a weighted average of the prior guess about p, and X, while the variability reduces from a 2 to a 2 /11. If the prior information is small, implying large T 2 or there are lots of data, i.e., n is large, the posterior mean is close to the MLE X. We will see later that we can quantify how much we have learnt from the data by comparing n(p,) and n(p,IX). The posterior depends on both the prior and the data. As data increase the influence of data tends to wash away the prior. Our second example goes back in principle to Bayes, Laplace, and Karl Pearson (The Grammar of Science, 1892). Example 2.2. Consider an urn with Np red and N(1 - p) black balls, p is unknown but N is a known large number. Balls are drawn at random one by one and with replacement, selection is stopped after n draws. For i = 1, 2, ... , n, let X. = { 1 if the ith ball drawn is red; • 0 otherwise. Then Xi's are i.i.d B(1,p), i.e., Bernoulli with probability of success p. Let p have a prior distribution n(p). We will consider a family of priors for p that simplifies the calculation of posterior and then consider some commonly used priors from this family. Let )/3-1 - r(a + /3) a-1( 1 n (P) - r(a)r(f3)P ' - P

0 :::; p :::; 1; a > 0, f3 > 0.

(2.4)

This is called a Beta distribution. (Note that for convenience we take p to assume all values between 0 and 1, rather than only 0, 1/N, 2/N, etc.) The prior mean and variance are aj(a+/3) and aj3j{(a+f3) 2 (a+f3+1)}, respectively. By Bayes formula, the posterior density can be written as n(piX

C(x)pa+r-1(1- p)f3+(n-r)-1

(2.5)

where r = 2::::~= 1 Xi= number of red balls, and (C(x))- 1 is the denominator in the Bayes formula. A comparison with (2.4) shows the posterior is also a Beta density with a + r in place of a and f3 + (n - r) for f3 and C(x)

r(a

+ f3 + n)j{r(a + r)r(j3 + n- r)}.

The posterior mean and variance are

+ f3 + n),

E(plx)

(a+ r)j(a

x (PI )

(a+r)(f3+n-r) (a+f3+n)2(a+f3+n+1)"

Var

(2.6)

2.2 Bayesian Inference

As indicated earlier, a Bayesian analyst may just report the posterior (2.5), and the posterior mean and variance, which provide an idea of the center and dispersion of the posterior distribution. It will not escape one's attention that if n is large then the posterior mean is approximately equal to the MLE, p = r/n and the posterior variance is quite small, so the posterior is concentrated around p for large n. We can interpret this as an illustration of a fact mentioned before when we have lots of data, the data tend to wash away the influence of the prior. The posterior mean can be rewritten as a weighted average of the prior mean and MLE.

(a+,6) a n r (a+,6+n) (a+,6) + (a+,6+n)n· Once again, the importance of both the prior and the data comes out, the relative importance of the prior and the data being measured by (a + ,6) and n. Suppose we want to predict the probability of getting a red ball in a new (n + 1)-st draw given the above data. This has been called a fundamental problem of science. It would be natural to use E(plx), the same estimate as above. We list below a number of commonly used priors and the corresponding value of E(piX1, X2, · · ·, Xn)The uniform prior corresponds with a = ,6 = 1, with posterior mean equal to (2::~ Xi+ 1)/(n + 2). This was a favorite of Laplace and Bayes but not so popular anymore. If a= ,6 = ~'we have the Jeffreys prior with posterior mean cz=~ xi+ ~)I (n + 1). This prior is very popular in the case of one-dimensional B as here. It is also a reference prior due to Bernardo (1979). Reference priors are very popular. If we take a Beta density with a = 0, ,6 = 0, it integrates to infinity. Such a prior is called improper. If we still use the Bayes formula to produce a posterior density, the posterior is proper unless r = 0 or n. The posterior mean is exactly equal to the MLE. Objective priors are usually improper. To be usable they must have proper posteriors. It is argued in Chapter 5 that improper priors are best understood through the posteriors they produce. One might examine whether the posterior seems reasonable. Suppose we think of the problem as a representation of production of defective and non-defective items in a factory producing switches, we would take red to mean defective and black to mean a good switch. In this context, there would be some prior information available from the engineers. They may be able to pinpoint the likely value of p, which may be set equal to the prior mean a/ (a+ ,6). If one has some knowledge of prior variability also, one would have two equations from which to determine a and ,6. In this particular context, the Jeffreys prior with a lot of mass at the two end points might be adequate if the process maintains a high level of quality (small p) except when it is out of control and has high values of p. The peak of the prior near p = 1

2 Bayesian Inference and Decision Theory Table 2.1. An Epidemiological Study

Food Eaten Crabmeat No Crabmeat Potato Salad No Potato Salad Potato Salad No Potato Salad 22 Ill 120 4 0 Not Ill 80 31 24 23

could reflect frequent occurrence of lack of control or a pessimistic prior belief to cope with disasters. It is worth noting that the uniform, Jeffreys prior, and reference priors are examples of objective priors and that all of them produce a posterior mean that is very close to the MLE even for small n. Also all of them make better sense than the MLE in the extreme case when p = 0. In most contexts the estimate p = 0 is absurd, the objective Bayes estimates move it a little towards p = ~, which corresponds with total ignorance in some sense. Such a movement is called a shrinkage. Agresti and Caffo (2000) and Brown et al. (2003) have shown that such estimates lead to confidence intervals with closer agreement between nominal and true coverage probability than the usual confidence intervals based on normal approximation to p or inversion of tests. In other words, the Bayesian approach seems to lead to a more reasonable point estimate as well as a more reliable confidence interval than the common classical answers based on MLE.

Example 2.3. This example illustrates the advantages of a Bayesian interpretation of probability of making a wrong inference for given data as opposed to classical error probabilities over repetitions. In this epidemiological study repetitions don't make sense. The data in Table 2.1 on food poisoning at an outing are taken from Bishop et al. (1975) who provide the original source of the study. Altogether 320 people attended the outing, 304 responded to questionnaires. There was other food also but only two items, potato salad and crabmeat, attracted suspicion. We focus on the main suspect, namely, potato salad. A partial Bayesian analysis of this example will be presented later in Chapter 4. Example 2.4. Let X1,X2 , ... ,Xn be i.i.d N(tJ,a 2 ) and assume for simplicity a 2 is known. As in Chapter 1, fJ may be the expected reduction of blood pressure due to a new drug. You want to test H 0 : fJ :S fJo versus H 1 : fJ > tJo, where fJo corresponds with a standard drug already in the market. Let 7r(tJ) be the prior. First calculate the posterior density 7r(tJ!X). Then calculate J.Lo

-oo

and

Ir(tJ!X)dtJ = P{Ho!X},

2.3 Advantages of Being a Bayesian

{oo 7r(fliX)dfl = 1- P{HoiX}

= P{H1IX}.

J1"0

One may simply report these numbers or choose one of the two hypotheses if one of the two probabilities is substantially bigger. We provide some calculations when the prior for fl is N(7], T 2 ). We recall from Example 2.1 that the posterior for fl is normal with mean and variance given by equations (2.2) and (2.3). If follows that 7r(fl

:s;

flo IX)=

flo IX)= 1- 0 and f3 > 0. Show that this is possible iff the prior on p is Beta( a, {3).

19. Suppose (N1 , ... , Nk) have the multinomial distribution with density

Let p have the Dirichlet prior with density

(a) Find the posterior distribution of p. (b) Find the posterior mean vector and the dispersion matrix of p. (c) Construct a 100(1- a)% HPD credible interval for p 1 and also for P1

+ pz.

2.17 Exercises

20. Let p denote the probability of success with a particular drug for some disease. Consider two different experiments to estimate p. In the first experiment, n randomly chosen patients are treated with this drug and let X denote the number of successes. In the other experiment patients are treated with this drug, one after the other until r successes are observed. In this experiment, let Y denote the total number of patients treated with this drug. (a) Construct 100(1-a)% HPD credible intervals for p under U(O, 1) and Jeffreys prior when X= xis observed. (b) Construct 100(1- a)% HPD credible intervals for p under U (0, 1) and Jeffreys prior when Y = y is observed. (c) Suppose n = 16, x = 6, r = 6, andy= 16. Now compare (a) and (b) and comment with reference to LP. 21. Let X 1 , ... , Xn be i.i.d. N(f-l, a 2 ), a 2 known. Suppose we want to test Ho: f-l =f-lo versus H1 : f-l =/=f-lo· Let 7ro = P(Ho) = 1/2 and under H1, let 1-l rv N(f-lo, T 2 ). Show that, unlike in the case of a one-sided alternative, P-value and the posterior probability of H 0 can be drastically different here. 22. Let X 1 , ... , Xn be i.i.d. N (f-l, a 2 ), where both 1-l and a 2 are unknown. Take the prior 7r(f-l, a 2 ) ex a- 2 . Consider testing H o : f-l :::; f-lo versus H 1 : f-l

> f-lo.

Compute the P-value. Compare it with the posterior probability of H 0 . Compute the Bayes factor BF01 .

Utility, Prior, and Bayesian Robustness

We begin this chapter with a discussion of rationality axioms for preference and how one may deduce the existence of a utility and prior. Later we explore how robust or sensitive is Bayesian analysis to the choice of prior, utility, and model. In the process, we introduce and examine various quantitative evaluations of robustness.

3.1 Utility, Prior, and Rational Preference We have introduced in Chapter 2 problems of estimation and testing as Bayesian decision problems. We recall the components of a general Bayesian decision problem. Let X be the sample space, 8 the parameter space, j(xiB) the density of X and 1r(B) prior probability density. Moreover, there is a space A of actions "a" and a loss function L(B, a). The decision maker (DM) chooses "a" to minimize the posterior risk

~(alx) =

j L(B, a)1r(Bix) dB,

(3.1)

where 1r(Bix) is the posterior density of given x. Note that given the loss function and the prior, there is a natural preference ordering a1 :::5. a2 (i.e., a2 is at least as good as al) iff ~(a 2 lx)::; ~(a 1 lx). There is a long tradition of foundational study dating back to Ramsey (1926), in which one starts with such a preference relation on A x A satisfying certain rational axioms (i.e., axioms modeling rational behavior) like transitivity. It can then be shown that such a relation can only be induced as above via a loss function and a prior. i.e., -::JL and 1r such that

In other words, from an objectively verifiable rational preference relation, one can recover the subjective loss function and prior. If there is no sample data,

3 Utility, Prior, and Bayesian Robustness

then 1r would qualify as a subjective prior for the DM. If we have data x, a likelihood function f(x[B) is given and we are examining a preference relation given x, then also one can deduce the existence of L and 1r such that

under appropriate axioms. In Section 3.2, we explore the elicitation or construction of a loss function given certain rational preference relations. In the next couple of sections, we discuss a result that shows we must have a (subjective) prior if our preference among actions satisfies certain axioms about rational behavior. Together, they justify (3.2) and throw some light on (3.3). In the remaining sections we examine different aspects of sensitivity of Bayesian analysis with respect to the prior. Suppose one thinks of the prior as only an approximate quantification of prior belief. In principle, one would have a whole family of such priors, all approximately quantifying one's prior belief. How much would the Bayesian analysis change as the prior varies over this class? This is a basic question in the study of Bayesian robustness. It turns out that there are some preference relations weaker than those of Section 3.3 that lead to a situation like what was mentioned above. i.e., one can show the existence of a class of priors such that

for all 1r in the given class. This preference relation is only a partial ordering, i.e., not all pairs a 1 , a2 can be ordered. The Bayes rule a(x) minimizing ~(a[x) also minimizes the integrated risk of decision rules 8(x),

r(1r, 8) =

R(B, 8)1r(B) dB,

where R(B, 8) is the risk of 8 under B, namely, fx L(B, 8(x))f(x[B) dx. Given a pair of decision rules, we can define a preference relation (3.5) One can examine a converse of (3.5) in the same way as we did with (3.2) through (3.4). One can start with a preference relation that orders decision rules (rather than actions) and look for rationality axioms which would guarantee existence of L and 1r. For (3.2), (3.3) and (3.5) a good reference is Schervish (1995) or Ferguson (1967). Classic references are Savage (1954) and DeGroot (1970); other references are given later. For (3.4) a good reference is Kadane et al. (1999). A similar but different approach to subjective probability is via coherence, due to de Finetti (1972). We take this up in Section 3.4.

3.2 Utility and Loss

3.2 Utility and Loss It is tempting to think of the loss function L(B, a) and a utility function u(B,a) = -L(B,a) as conceptually a mirror image of each other. French and Rios Insua (2000) point out that there can be important differences that depend on the context. In most statistical problems the DM (decision maker) is really trying to learn from data rather than implement a decision in the real world that has monetary consequences. For convenience we refer to these as decision problems of Type 1 and Type 2. In Type 1 problems, i.e., problems without monetary consequences (see Examples 2.1-2.3) for each B there is usually a correct decision a(B) that depends on B, and L(B,a) is a measure of how far "a" is away from a(B) or a penalty for deviation from a( B). In a problem of estimating e, the correct decision a(B) is e itself. Common losses are (a- 8) 2 , Ia- Bl, etc. In the problem of estimating T(B), a(B) equals T(B) and common losses are (T(B)- a) 2, IT(B)- al, etc. In testing a null hypothesis H 0 against HI, the 0-1 loss assigns no penalty for a correct decision and a unit penalty for an incorrect decision. In Type 2 problems, there is a similarity with gambles where one must evaluate the consequence of a risky decision. Historically, in such contexts one talks of utility rather than loss, even though either could be used. We consider below an axiomatic approach to existence of a utility for Type 2 problems but we use the notations for a statistical decision problem by way of illustration. We follow Ferguson (1967) here as well as in the next section. Let P denote the space of all consequences like (B, a). It is customary to regard them as non-numerical pay-offs. Let P* be the space of all probability distributions on P that put mass on a finite number of points. The set P* represents risky decisions with uncertainty quantified by a known element of P*. Suppose the DM has a preference relation on P*, namely a total order, i.e., given any pair PI,P2 E P*, either PI :::=: P2 (P2 is preferred) or P2 :::=:PI (PI is preferred) or both. Suppose also the preference relation is transitive, i.e., if PI :::=: P2 and p 2 :::=: p 3, then PI :::=: p 3. We refer the reader to French and Rios Insua (2000) for a discussion of how compelling are these conditions. It is clear that one can embed P as subset of P* by treating each element of P as a degenerate element of P*. Thus the preference relation is also well-defined on P. Suppose the relation satisfies axioms HI and H2. HI If PI, P2 and q E P* and 0 ..i 2': 0 and L::7: 1 Ai =

1, u(p*) is defined to be the average

u(p*)

.>..iu(pi)·

(3.6)

i=1

The main idea of the proof, which may also be used for eliciting the utility, is to start with a pair Pi -< P2, i.e., Pi ~ P2 but Pi rf P2. (Here "' denotes the equivalence relation that the DM is indifferent between the two elements.) Consider all Pi ~ p* ~ P2· Then by the assumptions H 1 and H 2 , one can find 0 :::; .>. * :::; 1 such that the DM would be indifferent between p* and (1- .A*)pi + .>. *P2· One can write)...* = u(p*) and verify that Pi ~ P3 ~ P4 ~ P2 iff .>..3 u(p3) :S: u(p4) .>..4 as well as the relation (3.6) above. For P2 ~ p*, by a similar argument one can find a 0:::; p,* :::; 1 such that

from which one gets

p* "'(1- >.*)pi+ .A*p2, where)...*= 1/p,*. Set)...*= u(p*) as before. In a similar way, one can find a .>. * for p* ~ Pi and set u(p*) = .>. *. In principle, .>. * can be elicited for each p*. Incidentally, utility is not unique. It is unique up to a change in origin and scale. Our version is chosen so that u(pi) = 0, u(p2) = 1. French and Rfos Insua (2000) point out that most axiomatic approaches to the existence of a utility first exhibit a utility on P* and then restrict it to P, whereas intuitively, one would want to define u(.) first on P and then extend it toP*. They discuss how this can be done.

3.3 Rationality Axioms Leading to the Bayesian Approach 1 Consider a decision problem with all the ingredients discussed in Section 3.1 except the prior. If the sample space and the action space are finite, then the number of decision functions (i.e., functions from X to A) is finite. In this case, the decision maker (DM) may be able to order any pair of given decision functions according to her rational preference of one to the other taking into account consequences of actions and all inherent uncertainties. Consider a randomized decision rule defined by

where 61 , 62, ... , 6k constitute a listing of all the non-randomized decision functions and (p 1 ,p2, ... ,Pk) is a probability vector, i.e., Pi 2': 0 and 1 Pi=

2::7=

Section 3.3 may be omitted at first reading.

3.3 Rationality Axioms Leading to the Bayesian Approach

1. The representation means that for each x, the probability distribution 6(x) in the action space is the same as choosing the action 6i(x) with probability Pi· Suppose the DM can order any pair of randomized decision functions also in a rational way and it reduces to her earlier ordering if the randomized decision functions being compared are in fact non-randomized with one Pi equal to 1 and other Pi's equal to zero. Under certain axioms that we explore below, there exists a prior 1r(B) such that 6; :::S 62, i.e., the DM prefers 6; to 62 if and only if

(},a

where Po(al6*) is the probability of choosing the action "a" when(} is the value of the parameter and 6* is used, i.e., using the representation 6* = ~i Pi 6i,

and Ii is the indicator function

We need to work a little to move from here to the starting point of Ferguson (1967). As far as the preference is concerned, it is only the risk function of 6 that matters. Also 6 appears in the risk function only through Po (ai6) which, for each 6, is a probability distribution on the action space. Somewhat trivially, for each (} 0 E 8, one can also think of it as a probability distribution q on the space P of all (B, a), (} E 8, a E A such that n

q ( u,a

)={Po 0 (ai6) ifB=Bo; 0 if B # Bo.

As in Section 3.2, let the set of probability distributions putting probability on a finite number of points in P be P*. The DM can think of the choice of a 6 as a somewhat abstract gamble with pay-off (Po(a 1 !6), Po(a 2 16), · · ·) if B is true. This pay-off sits on (B, a 1 ), (B, a 2 ) .... Let g be the set of all gambles of this form [p1, ... , Pm] where Pi is a probability distribution on P that is the pay-off corresponding to the ith point Bi in 8 = { 8 1 , Bz, ... , Brn}. Further, let Q* be the set of all probability distributions putting mass on a finite number of points in Q. The DM can embed her 6 in g and suppose she can extend her preference relation tog and Q*. If axioms H 1 and H 2 of Section 3.2 hold, then there exists a utility function u 9 on Q* that induces the preference relation :::S 9 on Q*. We assume the preference relations :::S on P* and :::S 9 on Q* are connected as follows vide Ferguson (1967).

A1 If Pi :::S p~, i = 1, ... , m, then [p1, ... ,Prn]

:::S 9 [p~,

...

,p~].

3 Utility, Prior, and Bayesian Robustness

A2 Ifp- o.

The idea is that if q reflects her uncertainty about B, then this combination of bets is fair and so acceptable to her. However any rational choice of q should avoid sure loss as defined above. Such a choice is said to be coherent if no finite combination of acceptable bets can lead to sure loss. The basic result of Freedman and Purves (1969) and Heath and Sudderth (1978) is that in order to be coherent, the DM must act like a Bayesian with a (finitely additive) prior and q must be the resulting posterior. A similar result is proved by Berti et al. (1991).

3.5 Bayesian Analysis with Subjective Prior We have already discussed basics of subjective prior Bayesian inference in Chapter 2. In the following, we shall concentrate on some issues related to robustness of Bayesian inference. The notations used will be mostly as given in Chapter 2, but some of those will be recalled and a few additional notations will be introduced here as needed. Let X be the sample space and 8 be the parameter space. As before, suppose X has (model) density f(xiB) and B has (prior) probability density 1r(B). Then the joint density of (X, B), for x E X and BE 8, is

h(x, B)= j(xiB)1r(B). The marginal density of X corresponding with this joint density is

m7r(x) = m(xl7r) =

1 e

f(xiB d1r(B).

Note that this can be expressed as

( ) _ { fe j(xiB)1r(B) dB m7r

x -

Le j(xiB)1r(B)

if X is continuous, if X is discrete.

Often we shall use m(x) for m7r(x), especially if the prior 1r which is being used is clear from the context. Recall that the posterior density of B given X is given by

1r(Bix) = h(x, B) = j(xiB)1r(B). m7r(x) m7r(x) The posterior mean and posterior variance with respect to prior 1r will be denoted by E7r(Bix) and V7r(Bix), respectively. Similarly, the posterior probability of a set A c 8 given x will be denoted by PTC(Aix).

3 Utility, Prior, and Bayesian Robustness

3.6 Robustness and Sensitivity Intuitively, robustness means lack of sensitivity of the decision or inference to assumptions in the analysis that may involve a certain degree of uncertainty. In an inference problem, the assumptions usually involve choice of the model and prior, whereas in a decision problem there is the additional assumption involving the choice of the loss or utility function. An analysis to measure the sensitivity is called sensitivity analysis. Clearly, robustness with respect to all three of these components is desirable. That is to say that reasonable variations from the choice used in the analysis for the model, prior, and loss function do not lead to unreasonable variations in the conclusions arrived at. We shall not, however, discuss robustness with respect to model and loss function here in any great detail. Instead, we would like to mention that there is substantial literature on this and references can be found in sources such as Berger (1984, 1985a, 1990, 1994), Berger et al. (1996), Kadane (1984), Leamer (1978), Rfos Insua and Ruggeri (2000), and Wasserman (1992). Because justification from the viewpoint of rational behavior is usually desired for inferential procedures, we would like to cite the work of Nobel laureate Kahneman on Bayesian robustness here. In his joint paper with Tversky (see Tversky et al. (1981) and Kahneman et al. (1982)), it was shown in psychological studies that seemingly inconsequential changes in the formulation of choice problems caused significance shifts of preference. These 'inconsistencies' were traced to all the components of decision making. This probably means that robustness of inference cannot be taken for granted but needs to be earned. The following example illustrates why sensitivity to the choice of prior can be an important consideration. Example 3.1. Suppose we observe X, which follows Poisson( B) distribution. Further, it is felt a priori that (;I has a continuous distribution with median 2 and upper quartile 4. i.e. P"(O:::; 2) = 0.5 = P"(B 2: 2) and P"(B 2: 4) = 0.25. If these are the only prior inputs available, the following three are candidates for such a prior: (i) 1r1: 0"' exponential(a) with a= log(2)/2; (ii) n 2 : log(O) "'N(log(2), (log(2)/z. 25 ) 2 ); and (iii) n 3 : log( B)"' Cauchy(log(2), log(2)). Then (i) under n 1 , Blx"' Gamma(a + 1,x + 1), so that the posterior mean is (a+ 1)/(x + 1); (ii) under n 2 , if we let "( = log(O), and 7 = log(2)/z. 25 = log(2)/0.675, we obtain

E" 2 (Bix) = E" 2 (exp('Y)Ix) J~= exp( -e"~) exp('Y(x + 1)) exp( -h -log(2)) 2 /(27 2 )) d"f J~= exp( -e"~) exp('Yx) exp( -("(- log(2)) 2 /(27 2 )) d"f

and (iii) under n 3 , again if let 'Y =log( B), we get

3.6 Robustness and Sensitivity Table 3.1. Posterior Means under

1r1, 1r2,

and

1r3

'lrl 7r2 7r3

.749 1.485 2.228 2.971 3.713 4.456 8.169 11.882 15.595 37.874 .950 1.480 2.106 2.806 3.559 4.353 8.660 13.241 17.945 47.017 .761 1.562 2.094 2.633 3.250 3.980 8.867 14.067 19.178 49.402

E7f 3 ( Blx)

= E7f 3 ( exp(r) lx)

f.~oo exp( -e~') exp(/(X + 1)) [1 + (1'~~(~~2 ) )2rl f()()oo exp( -e'Y) exp( /X) [ 1 + ('Y~~(w) )2] - l

To see if the choice of prior matters, simply examine the posterior means under the three different priors in Table 3.1. For small or moderate x (x S 10), there is robustness: the choice of prior does not seem to matter too much. For large values of x, the choice does matter. The inference that a conjugate prior obtains then is quite different from what a heavier tailed prior would obtain. It is now clear that there are situations where it does matter what prior one chooses from a class of priors, each of which is considered reasonable given the available prior information. The above example indicates that there is no escape from investigating prior robustness formally. How does one then reconcile this with the single prior Bayesian argument? It is certainly true that if one has a utility /loss function and a prior distribution there are compelling reasons for a Bayesian analysis using these. However, this assumes the existence of these two entities, and so it is of interest to know if one can justify the Bayesian viewpoint for statistics without this assumption. Various axiomatic systems for statistics can be developed (see Fishburn (1981)) involving a preference ordering for statistical procedures together with a set of axioms that any 'coherent' preference ordering must satisfy. Justification for the Bayesian approach then follows from the fact that any rational or coherent preference ordering corresponds to a Bayesian preference ordering (see Berger ( 1985a)). This means that there must be a loss function and a prior distribution such that this axiom system is compatible with the Bayesian approach corresponding to these. However, even then there are no compelling reasons to be a die-hard single prior Bayesian. The reason is that it is impractical to arrive at a total preference ordering. If we stop short of this and we are only able to come up with a partial preference ordering (see Seidenfeld et al. (1995) and Kadane et al. (1999)), the result will be a Bayesian analysis (again) using a class of prior distributions (and a class of utilities). This is the philosophical justification for a "robust Bayesian" as noted in Berger's book (Berger (1985a) ). One could,

3 Utility, Prior, and Bayesian Robustness

of course, argue that a second stage of prior on the class r of possible priors is the natural solution to arrive at a single prior, but it is not clear how to arrive at this second stage prior.

3. 7 Classes of Priors There is a vast literature on how to choose a class, r of priors to model prior uncertainty appropriately. The goals (see Berger ( 1994)) are clearly (i) to ensure that as many 'reasonable' priors as possible are included, (ii) to try to eliminate 'unreasonable' priors, (iii) to ensure that r does not require prior information which is difficult to elicit, and (iv) to be able to compute measures of robustness without much difficulty. As can be seen, (i) is needed to ensure robustness and (ii) to ensure that one does not erroneously conclude lack ofrobustness. The above mentioned are competing goals and hence can only be given weights which are appropriate in the given context. The following example from Berger (1994) is illuminating.

Example 3.2. Suppose e is a real-valued parameter, prior beliefs about which indicate that it should have a continuous prior distribution, symmetric about 0 and having the third quartile, Q3 , between 1 and 2. Consider, then T 1 = { N(O, 7 2 ), 2.19 < 7 2 < 8.76} and r2 = { symmetric priors with 1 < Q3 < 2 }. Even though T 1 can be appropriate in some cases, it will mostly be considered "rather small" because it contains only sharp-tailed distributions. On the other hand, r2 will typically be "too large," containing priors, shapes of some of which will be considered unreasonable. Starting with T 2 and imposing reasonable constraints such as unimodality on the priors can lead to sensible classes such as

= {

unimodal symmetric priors with 1 < Q3 < 2 } ::) rl·

It will be seen that computing measures of robustness is not very difficult for any of these three classes.

3.7.1 Conjugate Class The class consisting of conjugate priors (discussed in some detail in Chapter 5) is one of the easiest classes of priors to work with. If X '"" N(B, a 2 ) with known a 2 , the conjugate priors for e are the normal priors N (J.l, 7 2 ). So one could consider

rc = {N(J.1,7 2),J.ll::::: 11::::: 112,7~::::: 7 2 ::::: 7n for some specified values of J.ll, 11 2, 7f, and 7?. The advantage with the conjugate class is that posterior quantities can be calculated in closed form

3. 7 Classes of Priors

e "'

(for natural conjugate priors). In the above case, if N(tJ, 7 2 ), then 2 2 2 2 BIX = x "' N(tJ*(x), 5 ), where M*(x) = (7 /( 7 + 0" ) )x + (0" 2/( 7 2 + 0"2))tJ and 52 = 7 20" 2/(7 2 + 0" 2). Minimizing and maximizing posterior quantities then becomes an easy task (see Leamer (1978), Leamer (1982), and Polasek (1985) ). The crucial drawback of the conjugate class is that it is usually "too small" to provide robustness. Further, tails of these prior densities are similar to those of the likelihood function, and hence prior moments greatly influence posterior inferences. Thus, even when the data is in conflict with the specified prior information the conjugate priors used can have very pronounced effect (which may be undesirable if data is to be trusted more). Details on this can be found in Berger (1984, 1985a, 1994). It must be added here that mixtures of conjugate priors, on the other hand, can provide robust inferences. In particular, the Student's t prior, which is a scale mixture of normals, having fiat tails can be a good choice in some cases. We discuss some of these details later (see Section 3.9).

3. 7.2 Neighborhood Class If n 0 is a single elicited prior, then uncertainty in this elicitation can be modeled using the class FN

= {n which are in the neighborhood of no}.

A natural and well studied class is the E-contamination class,

= {7r

: 7r

= (1

~ E)7ro

+ Eq, q E Q},

E reflecting the uncertainty in n 0 and Q specifying the contaminations. Some choices for Q are, all distributions q, all unimodal distributions with mode 80 , and all unimodal symmetric distributions with mode 80 . TheE-contamination class with appropriate choice of Q can provide good robustness as we will see later.

3. 7.3 Density Ratio Class Assuming the existence of densities for all the priors in the class, the density ratio class is defined as

{n: L(B) ~ cm(B) ~ U(B) for some a> 0} L(B) n(B) U(B) '} = { n: U(B') ~ n(B') ~ L(B') for all B, e ,

rDR =

(3.7)

for specified non-negative functions L and U (see DeRobertis and Hartigan (1981)). If we take L 1 and U c, then we get

FDR

{

7r: c- ~

n(B) ~ c for all 8,8 '} . n(B')

3 Utility, Prior, and Bayesian Robustness

Some other classes have also been studied. For example, the sub-sigma field class is obtained by defining the prior on a sub-sigma field of sets. See Berger (1990) for details and references. Because many distributions are determined by their moments, once the distributional form is specified, sometimes bounds are specified on their moments to arrive at a class of priors (see Berger (1990)).

3.8 Posterior Robustness: Measures and Techniques Measures of sensitivity are needed to examine the robustness of inference procedures (or decisions) when a class r of priors are under consideration. In recent years two types of these measures have been studied. Global measures of sensitivity such as the range of posterior quantities and local measures such as the derivatives (in a sense to be made clear later) of these quantities. Attempts have also been made to derive robust priors and robust procedures using these measures.

3.8.1 Global Measures of Sensitivity

Example 3.3. Suppose X 1 , X 2 , .•• , Xn are i.i.d. N(B, a 2 ), with a 2 known and let r be all N(O, T 2 ), T 2 > 0, priors for (). Then the variation in the posterior mean is simply (infr2> 0 E(B[x),supr2> 0 E(B[x)). Because, for fixed T 2 , E( B[x) = ( T 2 j (T 2 + a 2 ) )x, this range can easily be seen to be (0, x) or (x, 0) according as x 2 0 or x < 0. If x is small in magnitude, this range will be small. Thus the robustness of the procedure of using posterior mean as the Bayes estimate of () will depend crucially on the magnitude of the observed value of x. As can be seen from the above example, a natural global measure of sensitivity of the Bayesian quantity to the choice of prior is the range of this quantity as the prior varies in the class of priors of interest. Further, as explained in Berger (1990), typically there are three categories of Bayesian quantities of interest. (i) Linear functionals of the prior: p(1r) = fe h(B) 1r(dB), where his a given function. If h is taken to be the likelihood function l, we get an important linear functional, the marginal density of data, i.e., m(1r) = fe l(B) 1r(dB). (ii) Ratio of linear functionals of the prior: p(1r) = m(7r) fe h(B)l(B) 1r(dB) for some given function h. If we take h(B) = (), p(1r) is the posterior mean. For h(B) = Ic(B), the indicator function of the set C, we get the posterior probability of C. (iii) Ratio of nonlinear functionals: p(1r) = m(7r) fe h(B, ¢(1r))l(B) 1r(dB) for some given h. For h( (), ¢(1r)) = ( ()- ,u(1r) ) 2 , where ,u( 1r) is the posterior mean, we get p( 1r) = the posterior variance. Note that extreme values of linear functionals of the prior as it varies in a class r are easy to compute if the extreme points of r can be identified.

3.8 Posterior Robustness: Measures and Techniques

Example 3.4. Suppose X"' N(B, 0" 2 ), with est is

known and the class

of inter-

Fsu = { all symmetric unimodal distributions with mode B0 }.

x·;/

Then ¢ denoting the standard normal density, m( 1r) = I~oo ~¢( )1r( B) dB. Note that any unimodal symmetric (about B0 ) density 1r is a mixture of uniform densities symmetric about B0 . Thus the extreme points of Fsu are U(Bo- r,B 0 + r) distributions. Therefore, X- B -¢(--)dB

1 1&o+r 1

inf m(1r) = inf1rETsu r>O 2r = inf r>O

~ 2r

&o~r

{.P(Bo

+ r- x)- .P(Bo- r- x)}' (}"

(}"

(3.8)

1 1&o+r 1 X - B sup m(1r) =sup-¢(--)dB r>O 2r & 0 ~r 0" 0"

1rETsu

=sup~ {.P(Bo + r- x) _ .P(Bo- r- x)}. r>O

(3 _9 )

In empirical Bayes problems (to be discussed later), for example, maximization of the above kind is needed to select a prior. This is called Type II maximum likelihood (see Good (1965)). To study ratio-linear functionals the following results from Sivaganesan and Berger (1989) are useful. Lemma 3.5. Suppose Cr is a set of probability measures on the real line given by Cr = {vt : t E T}, T c Rd, and let C be the convex hull of Cr. Further suppose h 1 and h 2 are real-valued functions defined on R such that I lh1(x)l dF(x) < oo for all F E C, and K + h2(x) > 0 for all x for some constant K. Then, for any k,

(3.10) (3.11) Proof. Because I h 1 (x) dF(x) = I h 1 (x) Ir Vt(dx)f.L(dt), for some probability measure 1-L on T, using Fubini's theorem, k+

h 1 (x)dF(x)

= J(k+h 1 (x)) lvt(dx)f.L(dt) = l (J(k + h 1 (x))vt(dx)) f.L(dt) =l

[

(f(~: ~12~~j~t~~~)) J(K + h 2(x))vt(dx)] f.L(dt)

I(k + h (x))vt(dx)) : ; ( ~~0 I(K + h2(x))vt(dx) 1

(

h2(x) dF(x)

)

3 Utility, Prior, and Bayesian Robustness

Therefore,

k+ Jh 1 (x)dF(x) J(k+h 1 (x))vt(dx) <sup . K + f hz(x) dF(x) - tET j(K + hz(x))vt(dx)

sup FEC

However, because C :> Cr, sup FEC

k + J h 1 (x) dF(x) J(k + h 1 (x))vt(dx) >sup . K + f hz(x) dF(x) - tET j(K + hz(x))vt(dx)

Hence the proof for the supremum, and the proof for the infimum is along the same lines. 0 Theorem 3.6. Consider the class Fsu of all symmetric unimodal prior dis-

tributions with mode B0 . Then it follows that r!Jo+r (B)f( [B) dB

g sup E 1r( g (B)[ x ) = sup 2T JIJo-r 0 +

1rErsu

°_; f(x[B) dB

r>O

r!Jo+r (B)f( [B) dB x _l_ r!Jo+r f(x[B) dB 2r Joo-r

inf E1r(g(B)[x) = inf 2T Joo-r g

1rErsu

r>O

(3.12)

ir J00

(3.13)

g(IJ)f(xi!J) d1r(IJ) . . , where f(x[B) 1s the dens1ty f(x!IJ) d1r(IJ) of the data x. Now Lemma 3.5 can be applied by recalling that any unimodal symmetric distribution is a mixture of symmetric uniform distributions. D Proof. Note that E1r(g(B)[x) =

Example 3. 7. Suppose X[B"" N(B, a 2 ) and robustness of the posterior mean with respect to Fsu is of interest. Then, range of posterior mean over this class can be easily computed using Theorem 3.6. We thus obtain, l

r!Jo+r

sup E 1r (B[ x ) =sup 2T J IJo-r 0

1rE rsu

r>O

I} "'(

x-IJ) dB

a'1-' c;--

_l_ f o+r _l "-( 2r JIJo-r a'~-'

x-IJ) dB a

1. Then 2 m7r0 is the density of N(O, r + 1). Upper bounds for C(q) (denoted by C*) calculated using (3.38) are listed in Table 3.4 for selected values ofT and x. The extremely large values of C* corresponding with T = 1.1 and x = 3, 4 indicate that these data are not compatible with n 0 . However, the same data are compatible with n 0 if T has a larger value, say 2.0. Some kind of calibration, however, is needed to precisely establish what magnitudes of curvature can be considered extreme. Table 3.4. Bounds on Curvature for Different Values of r and x r

jxjC*

1.1 2 909.3 3 2.08225 4 1.06395 1.5 2 1.0918 3 13.7237 4 454.3244 2.0,3,1.1186 4 7.0946

10 8 16 X 10 X

3.9 Inherently Robust Procedures

Before we conclude this discussion, it should be mentioned that there is a large amount of literature on gamma minimax estimation, which is a frequentist approach to Bayesian robustness. The idea here is to look for the minimax estimator, but the class of priors considered (for minimaxity) being the one identified for Bayesian robustness consideration. Let us take a very brief look. Recall that for any decision rule 6, its frequentist risk function is given by R(B, 6) = EL(B, 6(X)), where L is the loss function and the expectation is with respect to the distribution of XIB. If 1r is any prior distribution on e, the Bayes risk of 6 with respect ton is r(n, 6) = E" R(B, 6). The decision rule 6", which minimizes the Bayes risk r( 1r, 6), is the Bayes rule with respect to Jr. Under the minimax principle, the optimal decision rule 5M (minimax rule) is that which minimizes the maximum of the frequentist risk R(B, 6). Equivalently, 5M minimizes the maximum of the Bayes risk r( 1r, 6) over the class of all priors Jr. Under the gamma minimax principle, if 1r is constrained to lie in a class r, the optimal rule 69 (gamma-minimax rule) minimizes sup1rET r(n, 6). Even though there are many attractive results in this topic, we will not be discussing them. Extensive discussion can be found in Berger (1984, 1985a), and further material in Ickstadt (1992) and Vidakovic (2000).

3.9 Inherently Robust Procedures It is natural to look for priors and the resulting Bayesian procedures that are inherently robust. Adopting this approach will eliminate the need for checking robustness at the end by building robustness into the analysis at the beginning itself. Further, practitioners can demand "default" Bayesian procedures with built-in robustness that do not require specific sensitivity analyses requiring sophisticated tools. Accumulated evidence indicates that priors with flatter tails than those of the likelihood tend to be more robust than easier choices such as conjugate priors. Literature here includes Dawid (1973), Box and Tiao (1973), Berger (1984, 1985a), O'Hagan (1988, 1990), Angers and Berger (1991), Fan and Berger (1992), and Geweke (1999). The following example from Berger (1994) illustrates some of these ideas.

Example 3.20. Let X 1 , ... , Xn be a random sample from a measurement error model, so that Xi= + Ei, i = 1, ... , n where Ei are the measurement errors. Ei 's can then be reasonably assumed to be i.i.d. having a symmetric unimodal distribution with mean 0 and unknown variance CT 2 • The location parameter is of inferential interest with the prior information that it is symmetric about 0 and has quartiles of ±1, whereas CT 2 is a nuisance parameter with little prior information. The simple "standard" analysis would assume that XiiB, CT 2 are i.i.d. N( e, CT 2 ) and n( e, CT 2 ) ex ,}2 n 1 (B) where under n 1 , the prior distribution of

3 Utility, Prior, and Bayesian Robustness

(} is N(O, 2.19). (This may be contrasted with Jeffreys analysis discussed in Section 2.7.2.) This conjugate prior analysis suffers from nonrobustness as mentioned previously. Instead, assume that Xi!(}, a 2 are i.i.d. t 4 ((}, a 2 ), and likewise assume that under n 1 , the prior distribution of(} is Cauchy(O, 1). This analysis would achieve certain robustness lacking in the previous approach. Any outliers in the data will be adequately handled by the Student's t model, and further, if the prior and the data are in conflict, the prior information will be mostly ignored. There are certain computational issues to be addressed here. The "standard" analysis is very easy whereas the robust approach is computationally intensive. However, the MCMC techniques that will be discussed later in the context of hierarchical Bayesian analysis can handle these problems. O'Hagan (1990) and Angers (2000) discuss some of these issues formally using concepts that they call credence and p-credence that compare the tail behavior of the posterior distribution with that of heavy tailed distributions such as Student's t and exponential power density. Further discussion of robust priors and robust procedures will be deferred to Chapters 4 and 5 where we shall consider default and reference priors that are improper priors.

3.10 Loss Robustness Given the same decision problem, it is possible that different decision makers have different assessments for the consequences of their actions and hence may have different loss functions. In such a situation, it may be necessary to evaluate the sensitivity of Bayesian procedures to the choice of loss.

Example 3. 21. Suppose X is Poisson((}) and (} has the prior distribution of exponential with mean 1. Suppose x = 0 is observed. Then the posterior distribution of(} is exponential with mean 1/2. Therefore, the Bayes estimator of (} under squared error loss is 1/2 which is the posterior mean, whereas the Bayes estimator under absolute error loss is 0.3465, the posterior median. These are clearly different, and this difference may have some significant impact depending on the use to which the estimator is being put. It is possible to provide a Bayesian approach to the study of loss robustness exactly as we have done for the prior distribution. In particular, if a class of loss functions is available, range of posterior expected losses can be computed and examined as was done in Dey et al. (1998) and Dey and Micheas (2000). There are also other approaches, such as that of computing non-dominated alternatives, which is outlined in Martin et al. (1998).

3.11 Model Robustness

3.11 Model Robustness The model for the observables is the most important component of statistical inference, and hence imprecisions in the specification of the model that can lead to inaccurate inferences must be viewed with great concern. There has been a lot of work in classical statistics in this regard, but most of that only addresses the problem of influence of outliers with respect to a specified target model. In principle, Bayesian approach to model robustness need not be any different from that for prior robustness or loss robustness. However, the problem gets complicated because the mapping of likelihood function to posterior density is not ratio-linear, and hence different techniques need to be employed to assess the sensitivity. If only a finite set of models need to be considered, the problem is a simple one and one simply needs to check the inferences obtained under the different models for the given data. It needs to be kept in mind that, even in this case, different models may be based on different parameters with different interpretations, and hence the specification of prior distributions may be a complicated problem. The following example which illustrates some of the possibilities is similar to Example 1 of Shyamalkumar (2000). (See Pericchi and Perez (1994) and Berger et al. (2000) also.)

Example 3.22. Suppose the quantity of inferential interest is the median of the model. Model uncertainty is represented by considering the set of two models, M = {N(B, 1), Cauchy(B, 0.675)}, where 0.675 above is the scale parameter of the Cauchy distribution. In other words, X is either N(B, 1) or Cauchy(B, 0.675). Since is the median of the model in either case, it is not difficult to specify its prior distribution. Suppose the prior 7r lies in the class of N(O, T 2 ), 1 ~ T 2 ~ 10. The range of posterior means are as shown in Table 3.5. As can be seen, model robustness is also dependent on the observed x, just like prior or loss robustness. In many situations, this robustness will be absent, and there is no solution other than providing further input on model refinements.

Model robustness does have a long history even though the material is not very extensive. Box and Tiao (1962) have considered this problem in a simple setup. Lavine (1991) and Fernandez et al. (2001) have used a nonparametric class of models, and Bayarri and Berger (1998b) have studied robustness in selection models. These can be considered global robustness approaches as compared with the approach of local robustness adopted by Cuevas and Sanz (1988), Sivaganesan (1993), and Dey et al. (1996). Extrema of functional derivative of the posterior quantities are studied by these authors. This is similar to the local robustness approach for prior distributions. Some of the frequentist approaches such as Huber (1964, 1981) are also somewhat relevant.

3 Utility, Prior, and Bayesian Robustness Table 3.5. Range of Posterior Means for Different Models

x=4 x=6 x=2 Likelihood inf E(Ofx) supE(Ofx) inf E(Ofx) sup E(Ofx) inf E(Ofx) supE(Ofx) Normal 2.000 3.636 1.000 1.818 3.000 5.455 Cauchy 0.914 1.689 0.621 3.228 0.362 4.433

3.12 Exercises 1. (St. Petersburg paradox). Suppose you are invited to play the follow-

4. 5.

ing game. A fair coin is tossed repeatedly until it comes up heads. The reward will be 2n (in some unit of currency) if it takes n tosses until a head first appears. How much would you be willing to pay to play this game? Show that the expected monetary return is oo, but few would be willing to pay very much to play the game. Consider a lottery where it costs $1 to buy a ticket. If you win the lottery you get $1000. If the probability of winning the lottery is 0.0001, decide what you should do under each of the following utility functions, u(x), x being the monetary gain: (a) u(x) = x; (b) u(x) = loge(.3 + x); (c) u(x) = exp(1 + x/100). A mango grower owns three orchards. Orchard I yields 50% of his total produce, II provides 30% and III provides the rest. Even though they are all of a single variety, 2% of the mangoes from I, 1% each from II and III are excessively sour tasting. (a) What is the probability that a mango randomly selected from the total produce is excessively sour? (b) What is the probability that a randomly selected mango that is found to be excessively sour came from orchard II? (c) Consider a box of 100 mangoes all of which came from a single orchard, but we don't know which one. A mango is selected randomly from this box and is found to be sour. What is the probability that a second mango randomly selected from the remaining 99 is also sour? Show that the Student's t density can be expressed as a scale mixture of normal densities. Refer to Example 3.1. Suppose that the prior for () has median 1, and upper quartile 2. Consider the priors, (i) () ,. . ., exponential, (ii) log(()) ,. . ., normal and (iii) log(()) ,. . ., Cauchy. (a) Determine the hyperparameters of the three priors. (b) Plot the posterior mean E.,.. (()jx) for the three priors when x lies in the range, 0 :::; x :::; 50. Let X1, X 2 , · · ·, Xn be a random sample from Poisson( e), where estimation of() is of interest. (a) Derive the range of posterior means when the prior lies in the class of Gamma distributions with prior mean () 0 .

3.12 Exercises

(b) Compute the range of posterior means if the prior density is known only to be a continuous non-increasing function. 7. Let X 1 , X 2 , ... , Xn be i.i.d. N(B, a 2 ), a 2 known. Consider the following class of conjugate priors forB: r = { N(O, 7 2 ), 7 2 > 0 }. (a) Find the range of posterior means. (b) Find the range of posterior variances. (c) Suppose x > 0. Plot the range of 95% HPD credible intervals. (d) Suppose a 2 = 10 and n = 10. Further, suppose that an i; of large magnitude is observed. If, now, a N(O, 1) prior is assumed (in which case prior mean is far from the sample mean but prior variance and sample variance are equal) show that the posterior mean and also the credible interval will show substantial shrinkage. Comment on this phenomenon of the prior not allowing the data to have more influence when the data and prior are in conflict. What would happen if instead a Cauchy(O, 1) prior were to be used? 8. Let XIB"" N(B, 1) and let Fsu denote the class of unimodal priors which are symmetric about 0. (a) Plot {inf,rEFsu m(x),sup1rEFsu m(x)} for 0 :S x :S 10. (b) Plot {inf1rEFsu £7r(Bix), sup1rEFsu £7r(Bix)} for 0 :S x :S 10. 9. Let X 1 ,X2 ,···,Xn be i.i.d. with density

f(x!B)

exp(-(x- B)),

where -oo < < oo. Consider the class of unimodal prior distributions on B which are symmetric about 0. Compute the range of posterior means and that of the posterior probability that 8 > 0, for n = 5 and X = (0.1828, 0.0288, 0.2355, 1.6038, 0.4584). 10. Suppose X 1 , X 2 , ... , Xn are i.i.d. N(B, a 2 ), where B needs to be estimated, but a 2 which is also unknown is a nuisance parameter. Let x denote the sample mean and s;_ 1 = 2::~ 1 (xi- x)2 j(n- 1), the sample variance. (a) Show that under the prior n(B, a 2 ) ex (a 2 )- 1 , the posterior distribution of is given by

y'ri(B- x) --'-----'------'-- rv

t n- 1 .

Sn-1

(b) Using (a), justify the standard confidence interval

X± tn-1 (a/2)sn-I/Vn as an HPD Bayesian credible interval of coefficient 100(1 -a)%, where (a/2) is the tn_ 1 quantile of order (1- a/2). (c) If instead, Bla 2 "" N (J.L, ca 2 ) and n( a 2 ) ex (a 2 ) - 1 , for specified J.L and c, what is the HPD Bayesian credible interval of coefficient 100(1- a)%? (d) In (c) above, suppose c = 5 and J.L is not specified, but is known to lie in the interval 0 :S J.L :S 3, n = 9, x = 0 and sn_ 1 = 1. Investigate the

tn-1

3 Utility, Prior, and Bayesian Robustness

robustness of the credible interval given in (b) by computing the range of its posterior probability. 2 2 2 (e) Consider independent priors: rv N(f.L, T ), n(a ) ex (a )-I, where 2 0 ::=; f.L ::=; 3 and 5 ::=; T ::=; 10. Conduct a robustness study as in (d) now. 11. Let 1 -{ x>-; if.A>O; P>- () X >. lim>.-+O x >--l = log(x) if .A= 0,

and consider the following family of probability densities introduced by Albert et al. (1991): 2

n(Bif.L,¢,c,.A) = k(c,.A)y'¢exp{

12.

13. 14. 15. 16.

17.

-~P>- (1 + ¢(~= ~) )},

(3.39)

where k(c, .A) is the normalizing constant, -oo < f.L < oo, ¢> > 0, c > 1, ,\ ~ 0. (a) Show that 1r is unimodal symmetric about f.L· (b) Show that the family of densities defined by (3.39) contains many location-scale families. (c) Show that normal densities are included in this family. (d) Show that Student's tis a special case of this density when,\= 0. (e) Show that (3.39) behaves like the double exponential when,\= 1/2. (f) For 0 ::=; ,\ ::=; 1, show that the density in (3.39) is a scale mixture of normal densities. Suppose XIB rv N(B,a 2 ), withe being the parameter of interest. Explain how the family of prior densities given by (3.39) can be used to study the robustness of the posterior inferences in this case. In particular, explain what values of ,\ are expected to provide robustness over a large range of values of X = x. Refer to the definition of belief function, Equation (3.20). Show that Bel is a probability measure iff Bel(.)= Pl(.). Show that any probability measure is also a belief function. Refer to Example 3.14. Prove (3.25) and (3.26). Refer to Example 3.14 again. Let XIB rv N(B, 1) and let 7ro denote N(O, T 2 ) with T 2 = 2. Take E = 0.2 and suppose x = 3.5 is observed. (a) Construct the 95% HPD credible interval for B under n 0 . (b) Compute (3.25) and (3.26) for the interval in (a) now, and check whether robustness is present when the E-contamination class of priors is considered. (Dey and Birmiwal (1994)) Let X = (X 1 , ... , Xk) have a multinomial distribution with probability mass function, P(Xl = xl,···,Xk = XkiP) = n~::x,! TI7=1p~', with n = .z:::::7=1Xi and 0 . ow

n:=l

3.12 Exercises

consider the E-contamination class of priors with Q = { D( so:), s > 1}. Derive the extreme values of the curvature C(q).

4 Large Sample Methods

In order to make Bayesian inference about the parameter (}, given a model f(x!O), one needs to choose an appropriate prior distribution for e. Given the data x, the prior distribution is used to find the posterior distribution and various posterior summary measures, depending on the problem. Thus exact or approximate computation of the posterior is a major problem for a Bayesian. Under certain regularity conditions, the posterior can be approximated by a normal distribution with the maximum likelihood estimate (MLE) as the mean and inverse of the observed Fisher information matrix as the dispersion matrix, if the sample size is large. If more accuracy is needed, one may use the Kass-Kadane-Tierney or Edgeworth type refinements. Alternatively, one may sample from the approximate posterior and take resort to importance sampling. Posterior normality has an important philosophical implication, which we discuss below. How the posterior inference is influenced by a particular prior depends on the relative magnitude of the amount of information in the data, which for i.i.d. observations may be measured by the sample size nor ni(O) or observed Fisher information In (defined in Section 4.1.2), and the amount of information in the prior, which is discussed in Chapter 5. As the sample size grows, the influence of the prior distribution diminishes. Thus for large samples, a precise mathematical specification of prior distribution is not necessary. In most cases of low-dimensional parameter space, the situation is like this. A Bayesian would refer to it as washing away of the prior by the data. There are several mathematical results embodying this phenomenon of which posterior normality is the most well-known. This chapter deals with posterior normality and some of its refinements. We begin with a discussion on limiting behavior of posterior distribution in Section 4.1. A sketch of proof of asymptotic normality of posterior is given in this section. A more accurate posterior approximation based on Laplace's asymptotic method and its refinements by Tierney, Kass, and Kadane are the subjects of Section 4.3. A refinement of posterior normality is discussed

100

4 Large Sample Methods

in Section 4.2 where an asymptotic expansion of the posterior distribution with a leading normal term is outlined. Throughout this chapter, we consider only the case with a finite dimensional parameter. Also, () is assumed to be a "continuous" parameter with a prior density function. We apply these results for determination of sample size in Section 4.2.1.

4.1 Limit of Posterior Distribution In this section, we discuss the limiting behavior of posterior distributions as the sample size n ---7 oo. The limiting results can be used as approximations if n is sufficiently large. They may be used also as a form of frequentist validation of Bayesian analysis. We begin with a discussion of posterior consistency in Section 4.1.1. Asymptotic normality of posterior distribution is the subject of Section 4.1.2.

4.1.1 Consistency of Posterior Distribution Suppose a data sequence is generated as i.i.d. random variables with density f(xle 0 ). Would a Bayesian analyzing this data with his prior n(e) be able to learn about e0 ? Our prior knowledge about e is updated into the posterior as we learn more from the data. Ideally, the updated knowledge about e, represented by its posterior distribution, should become more and more concentrated near e0 as the sample size increases. This asymptotic property is known as consistency of the posterior distribution at e0 . Let X 1 , ... , Xn be the observations at the nth stage, abbreviated as X n, having a density f(xn I 0), () E 8 C RP. Let n(O) be a prior density, n(O I Xn) the posterior density as defined in (2.1), and II(. I Xn) the corresponding posterior distribution.

Definition. The sequence of posterior distributions II (. I X n) is said to be consistent at some Oo E 8, if for every neighborhood U of Oo, II(U I Xn) ---7 1 as n ---7 oo with probability one with respect to the distribution under 0 0 . The idea goes back to Laplace, who had shown the following. If X 1 , ... , Xn are i.i.d. Bernoulli with P0 (X 1 = 1) = e and n(e) is a prior density that is continuous and positive on ( 0, 1), then the posterior is consistent at all e0 in (0, 1). von Mises (1957) calls this the second fundamental law of large numbers; the first being Bernoulli's weak law of large numbers. Need for posterior consistency has been stressed by Freedman ( 1963, 1965) and Diaconis and Freedman ( 1986). From the definition of convergence in distribution, it follows that consistency of II(. I Xn) at 0 0 is equivalent to the fact that II(. j Xn) converges to the distribution degenerate at 0 0 with probability one under 0 0 . Consistency of posterior distribution holds in the general case with a finite dimensional parameter under mild conditions. For general results see, for example, Ghosh and Ramamoorthi (2003). For a real parameter e, consistency

4.1 Limit of Posterior Distribution

101

at 00 can be proved by showing E(B I Xn)--+ Bo and Var(B I Xn)--+ 0 with probability one under 00 . This follows from an application of Chebyshev's inequality.

Example 4.1. Let X 1 , X 2 , •.. , Xn be i.i.d. Bernoulli observations with Pe(X 1 = 1) = e. Consider a Beta (a, (3) prior density for e. The posterior density of e given xl, x2, ... 'Xn is then a Beta (2.:~=1 xi+ a, n- L~=l xi+ (3) density with

As ~ L~=l xi --+ Bo with Pea-probability 1 by the law of large numbers, it follows that E(B I xl, ... ,Xn) --+ Bo and Var(B I xl, ... ,Xn) --+ 0 with probability one under 00 . Therefore, in view of the result mentioned in the previous paragraph, the posterior distribution of is consistent.

An important result related to consistency is the robustness of the posterior inference with respect to choice of prior. Let X 1 , ... , Xn be i.i.d. observations. Let n 1 and n 2 be two prior densities which are positive and continuous at 0 0 , an interior point of 8, such that the corresponding posterior distributions II1(. I Xn) and II2(. I Xn) are both consistent at Oo. Then with probability one under 0 0

or equivalently, sup III1(A I Xn)- II2(A \ Xn)l--+ 0. A

Thus, two different choices of the prior distribution lead to approximately the same posterior distribution. A proof of this result is available in Ghosh et al. (1994) and Ghosh and Ramamoorthi (2003).

4.1.2 Asymptotic Normality of Posterior Distribution Large sample Bayesian methods are primarily based on normal approximation to the posterior distribution of 0. As the sample size n increases, the posterior distribution approaches normality under certain regularity conditions and hence can be well approximated by an appropriate normal distribution if n is sufficiently large. When n is large, the posterior distribution becomes highly concentrated in a small neighborhood of the posterior mode. Suppose that the notations are as in Section 4.1.1, and On denotes the posterior mode. Under suitable regularity conditions, a Taylor expansion of log n( (} I X n) at On gives

102

4 Large Sample Methods

log;r(O I Xn)

= log;r(On 1

I Xn) -

8 + (0- On)' 1og;r(O I Xn)l&n 00

-2(0- On)'In(O- On)+··· -

~ log;r(On I Xn)-

2(0- On)'In(O- On)

(4.1)

where in is a p x p matrix defined as

and may be called generalized observed Fisher information matrix. The term involving the first derivative is zero as the derivative is zero at the mode On. Also, under suitable conditions the terms involving third and higher order derivatives can be shown to be asymptotically negligible as 0 is essentially close to On. Because the first term in (4.1) is free of e, ;r(OIXn), as a function of 0, is approximately represented as a density proportional to

--1

which is a Np(On,In ) density (withp being the dimension ofO). As the posterior distribution becomes highly concentrated in a small neighborhood of the posterior mode On where the prior density ;r(O) is nearly constant, the posterior density ;r( 0 I X n) is essentially the same as the likelihood f(Xn I 0). Therefore, in the above heuristics, we can replace On by the maximum likelihood estimate (MLE) On and in by the observed Fisher information matrix

-1

so that the posterior distribution of 0 is approximately Np(On, In ). The dispersion matrix of the approximating normal distribution may also be taken to be the expected Fisher information matrix I (0) evaluated at On where I(O) is a matrix defined as

I(O) = Ee (- oB~;e logf(Xn IO)).

Thus we have the following result.

Result. Suppose that X 1 , X 2 , ... , Xn are i.i.d. observations, abbreviated as Xn, having a density f(xn I 0), 0 E 8 C RP. Let ;r(O) be a prior density and ;r(O I Xn) the posterior density as defined in (2.1). Let On be the posterior mode, On the MLE and In, In and I(O) be the different forms of Fisher information matrix defined above. Then under suitable regularity conditions,

4.1 Limit of Posterior Distribution

103

for large n, the posterior distribution of() can be approximated by any one of - --I , ,_I , I ' the normal distributions Np(On,ln ) or Np(On,ln ) or Np(On, I- (On)). In particular, under suitable regularity conditions, the posterior distribu12

tion of /~ (0- On), given Xn, converges to Np(O, I) with probability one under the true model for the data, where I denotes the identity matrix of order p. This is comparable with the result from classical statistical theory 12

that the repeated sampling distribution of /~ (()-On) given () also converges to Np(O, I). For a comment on the accuracy of the different normal approximations stated in the above result and an example, see Berger (1985a, Sec. 4.7.8). We formally state a theorem below giving a set of regularity conditions under which asymptotic normality of posterior distribution holds. Posterior normality, in some form, was first observed by Laplace in 1774 and later by Bernstein (1917) and von Mises (1931). More recent contributors in this area include Le Cam (1953, 1958, 1986), Bickel and Yahav (1969), Walker (1969), Chao (1970), Borwanker et al. (1971), and Chen (1985). Ghosal (1997, 1999, 2000) considered cases where the number of parameters increases. A general approach that also works for nonregular problems is presented in Ghosh et al. (1994) and Ghosal et al. (1995). We present below a version of a theorem that appears in Ghosh and Ramamoorthi (2003). For simplicity, we consider the case with a real parameter Band i.i.d. observations X 1 , .•• ,Xn. Let XI,X2, ... ,Xn be i.i.d observations with a common distribution Pe possessing a density f(xiB) where B E 8, an open subset of R. We fix Bo E 8, which may be regarded as the "true value" of the parameter as the probability statements are all made under Ba. Let l(B,x) = logf(xiB),Ln(B) = I:~= I l(B, Xi), the log-likelihood, and for a function h, let h(i) denote the ith derivative of h. We assume the following regularity conditions on the density

f(xiB). (A1) The set {x: f(xiB) > 0} is the same for all BE 8. (A2) l(B,x) is thrice differentiable with respect to B in a neighborhood (Bo- o,Bo + o) of Bo. The expectations Ee 0 ZCrl(Bo,XI) and Ee0 ZC 2l(Bo,XI) are both finite and sup

llC 3 l(B,x)l:::; M(x) and Ee 0 M(XI) < oo.

(4.2)

OE(Oo-o,Oo+o)

(A3) Interchange of the order of integration with respect to Pe0 and differentiation at 80 is justified, so that

Also, the Fisher information number per unit observation I(B 0 ) = Ee 0 (l(l)(Bo,XI)? is positive. (A4) For any 0 > 0, with Pe 0 -probability one

104

4 Large Sample Methods

sup -(Ln(B)- Ln(Bo)) < 1e-eal>8 n for some

-E

> 0 and for all sufficiently large n.

Remark: Suppose there exists a strongly consistent sequence of estimators Bn of B. This means for all Bo E e, Bn -+ Bo with Pea-probability one. Then by the arguments given in Ghosh (1983), a strongly consistent solution Bn of the 1 likelihood equation £~ ) (B) = 0 exists, i.e., there exists a sequence of statistics Bn such that with Pea-probability one Bn satisfies the likelihood equation for sufficiently large n and Bn -+ Ba.

Theorem 4.2. Suppose assumptions (A1) - (A4) hold and Bn is a strongly consistent solution of the likelihood equation. Then for any prior density 1r( B) which is continuous and positive at B0 ,

with Pea-probability one, where 1T'~(tiX 1 , ... ,Xn) is the posterior density of t = fo(B- Bn) given xl' ... 'Xn· Also under the same assumptions, (4.3) holds with I(Bo) replaced by in := 2

-~L~ \Bn)· A sketch of proof We only present a sketch of proof. Interested readers may obtain a detailed complete proof from this sketch. The proof consists of essentially two steps. It is first shown that the tails of the posterior distribution are negligible. Then in the remaining part, the log-likelihood function is expanded by Taylor's theorem up to terms involving third derivative. The linear term in the expansion vanishes, the quadratic term is proportional to logarithm of a normal density, and the remainder term is negligible under assumption (4.2) on the third derivative. n

Because 7rn(BIX 1 , ... , Xn) oc

IJ f(X;IB)Jr(B), the posterior density oft= i=l

fo(B- Bn) can be written as

1r~(tiX1, ... , Xn) = C~ 1 1r(Bn where Cn =

1r(Bn

+ t/vn) exp[Ln(Bn + t/vn)- Ln(Bn)]

(4.4)

+ t/vn) exp[Ln(Bn + t/vn)- Ln(Bn)] dt.

Most of the statements made below hold with Pea-probability one but we will omit the phrase "with Pea-probability one". Let

We first note that in order to prove (4.3), it is enough to show

4.1 Limit of Posterior Distribution

llgn(t)l dt---+ 0.

105

(4.5)

If (4.5) holds, Cn ---+ n(B0 )J2n/I(B0 ) and therefore, the integral in (4.3), which is dominated by

also goes to zero. To show (4.5), we break R into two regions A1 = {t : ltl > bofo} and A2 = {t: ltl < 6ofo} for some suitably chosen small positive number 6o and show fori= 1, 2.

(4.6)

lgn(t)l dt---+ 0.

To show (4.6) fori= 1, we note that

jA1

lgn(t) I dt

~ { n(Bn + t/vn) exp[Ln(Bn + tj,;ri)- Ln(Bn)] dt + { n(B 0 )e~~t 2 I(IIo) dt. ~1

It is easy to see that the second integral goes to zero. For the first integral, we note that by assumption (A4), fortE A 1 ,

for all sufficiently large n. It follows that (4.6) holds for i = 1. To show (4.6) for i = 2, we use the dominated convergence theorem. Expanding in Taylor series and noting that L~)(Bn) = 0 we have for large n,

(4.7) where Rn(t) = ts(t/fo) 3 L~ l(e~) and e~ lies between Bn and Bn + tjfo. By assumption (A2), for each t, Rn(t) ---+ 0 and in ---+ 1(80 ) and therefore, gn(t)---+ 0. For suitably chosen 6o, for any t E A2, 3

for sufficiently large n so that from (4. 7),

106

4 Large Sample Methods

Therefore, for suitably chosen small 80 , Jgn(t)J is dominated by an integrable function on the set A 2 • Thus( 4.6) holds for i = 2. This completes the proof of (4.3). The second part of the theorem follows as in --+ I(B 0 ). D

Remark. We assume in the proof above that n( B) is a proper probability density. However, Theorem 4.2 holds even if 1r is improper, if there is an n 0 such that the posterior distribution of B given (xl, X2, ... , Xn 0 ) is proper for a. e. (xl, x2, ... , Xn 0 ). The following theorem states that in the regular case with a large sample, a Bayes estimate is approximately the same as the maximum likelihood estimate Bn. If we consider the squared error loss the Bayes estimate for B is given by the posterior mean

B~ =

Bnn(BJXl, ... , Xn) dB.

Theorem 4.3. In addition to the assumptions of Theorem 4.2, assume that that prior density n(B) has a finite expectation. Then y'n(B~- Bn) --+ 0 with Pe0 -probability one.

Proof. Proceeding as in the proof of Theorem 4.2 and using the assumption of finite expectation for n, (4.3) can be strengthened to

{

JtJJn~(tJXl, ... ,Xn)-

v:::te-!t I(Bo)Jdt--t0 2n

with Pe0 -probability one. This implies

Now B~ = E(BJX 1 , .•. ,Xn) = E(Bn

+ tjy'nJX 1 , •.. ,Xn) and therefore,

y'n(B~-Bn)= fntn~(tJXl,···,Xn)dt--+0.

Theorems 4.2 and 4.3 and their variants can be used to make inference about B for large samples. We have seen in Chapter 2 how our inference can be based on the posterior distribution. If the sample size is sufficiently large, for a wide variety of priors we can replace the posterior distribution by the -1 approximating normal distribution having mean Bn and dispersion In or (nin)- 1 which do not depend on the prior. Theorem 4.3 tells that in the problem of estimating a real parameter with squared error loss, the Bayes estimate is approximately the same as the MLE Bn· Indeed, Theorem 4.3 can be extended to show that this is also true for a wide variety of loss functions. Also the moments and quantiles of the posterior distribution can be approximated by the corresponding measures of the approximating normal distribution. We consider an example at the end of Section 4.3 to illustrate the use of asymptotic posterior normality in the problems of interval estimation and testing. A

4.2 Asymptotic Expansion of Posterior Distribution

107

4.2 Asymptotic Expansion of Posterior Distribution Consider the setup of Theorem 4.2. Let -'1/2 ' Fn(u)-IIn( { vnin (e- en)< u}IX1, ... , Xn) 2

be the posterior distribution function of ..;nf~1 (e- Bn)· Then under certain regularity assumptions, Fn(u) is approximately equal to tP(u), where tP is the standard normal distribution function. Theorem 4.2 states that under assumptions (A1)-(A4) on the density f(xle), for any prior density 1r(e) which is continuous and positive at eo, lim sup IFn(u)- tP(u)l = 0 a.s. Pea·

n----+CX)

(4.8)

Recall that this is proved essentially in two steps. It is first shown that the tails of the posterior distribution are negligible. Then in the remaining part, the log-likelihood function is expanded by Taylor's theorem up to terms involving third derivative. The linear term in the expansion vanishes, the quadratic term is proportional to logarithm of a normal density, and the remainder term is negligible under assumption (4.2) on the third derivative. Suppose now that Z(e,x) = logf(xle) is (k+3) times continuously differentiable and 1r(e) is (k+1) times continuously differentiable at e0 with 1r(e0 ) > 0. Then the subsequent higher order terms in the Taylor expansion provide a refinement of the posterior normality result stated in Theorem 4.2 or in (4.8) above. Under conditions similar to (4.2) for the derivatives of [( X) of order 3, 4, ... , k + 3, and some more conditions on f(xle), Johnson (1970) proved the following rigorous and precise version of a refinement due to Lindley (1961).

sup IFn(u)- tP(u)- ¢(u) u

L 1/;j(u; X1, ... , Xn)n-j/

:S: Mkn-(k+l)/ 2 (4.9)

j=1

eventually with Pea-probability one for some Mk > 0, depending on k, where ¢(u) is the standard normal density and each 1/;j(u;X1, ... ,Xn) is a polynomial in u having coefficients bounded in X 1, ... , Xn. Under the same assumptions one can obtain a similar result involving the £ 1 distance between the posterior density and an approximation. The case k = 0 corresponds to that considered in Section (4.1.2) as (4.9) becomes sup IFn(u)- tP(u)l :::; Mon- 112 . (4.10) u

Another (uniform) version of the above result, as stated in Ghosh et al. (1982) is as follows. Let 6> 1 be a bounded open interval whose closure 8 1 is properly contained in 6> and the prior 1r be positive on 8 1. Then, as stated in Ghosh et al. (1982), for r > 0, (4.9) holds with Pea-probability 1- O(n-r), uniformly in 0 E 6> 1 under certain regularity conditions (depending on r) which are stronger than those of Johnson (1970).

4 Large Sample Methods

108

For a formal argument showing how the terms in the asymptotic expansion given in (4.9) are calculated, see Johnson (1970) and Ghosh et al. (1982). For example, if we want to obtain an approximation of the posterior distribution upto an error of order o(n- 1 ), we take k = 2 and proceed as follows. This is taken from Ghosh ( 1994). _1d'Ln(O)

•

Lett- y'n(e- Bn) and a i - r;:~l 0 ,z;::: 1, so that a2- -In. The posterior density oft is given by (4.4) and by Taylor expansion _

_ A

and

Ln({jn

+ tj Vn)-

Ln({jn)

~t 2 a2 + ~n- 1 1 2 t 3 a 3 + 214 n- 1t 4a4 + o(n- 1 ).

Therefore,

1r({jn =

+ tjy'n) exp[Ln((jn + tjy'n)- Ln({jn)]

7r( en) exp[a2t 2/2] X [1

+ n- 1 / 2a:1 (t; x1, ... 'Xn) + n- 1a:2(t; x1' ... 'Xn)] + o(n- 1 ),

where

The normalizer Cn also has a similar expansion that can be obtained by integrating the above. The posterior density oft is then expressed as

7r~ (t1X1' ... 'Xn)

(27r )-1/2 j~/2e-t2 /2 x

[I+

. X 1 3 w h ere 1'1 ( t,, 1, ... , X n ) -- 5t a3

n-il'o;(t;

+ t 7r'(On) 7r(On)

an d

X,, ... ,Xn)] +o(n_, ),

4.2 Asymptotic Expansion of Posterior Distribution

109

Transforming to s = J~l t, we get the expansion for the posterior density of 12 (e- Bn), and integrating it from -oo to u, we get the terms in (4.9). The above expansion for posterior density also gives an expansion for the posterior mean: 2

vnn

Similar expansions can also be obtained for other moments and quantiles. For more details and discussion see Johnson (1970), Ghosh et al. (1982), and Ghosh (1994). Ghosh et al. (1982) and Ghosh (1994) also obtain expansions of Bayes estimate and Bayes risk. These expansions are rather delicate in the sense that the terms in the expansion can tend to infinity, see, e.g., the discussion in Ghosh et al. (1982). The expansions agree with those obtained by Tierney and Kadane (1986) (see Section 4.3) up to o(n- 2 ). Although the Tierney-Kadane approximation is more convenient for numerical calculations, the expansions obtained in Ghosh et al. (1982) and Ghosh (1994) are more suitable for theoretical applications. A Bayesian would want to prove an expansion like (4.9) under the marginal distribution of X 1 , ... , Xn derived from the joint distribution of X's and e. There are certain technical difficulties in proving this from (4. 9). Such a result will hold if the prior 1r(e) is supported on a bounded interval and behaves smoothly at the boundary points in the sense that 1r(e) and (dijdei)1r(e), i = 1, 2, ... , k are zero at the boundary points. A rather technical proof is given in Ghosh et al. (1982). See also in this context Bickel and Ghosh (1990). For the uniform prior on a bounded interval, there can be no asymptotic expansion of the integrated Bayes risk (with squared error loss) of the form a 0 + ~ + ~~ + o(n- 2 ) (Ghosh et al. (1982)).

4.2.1 Determination of Sample Size in Testing In this subsection, we consider certain testing problems and find asymptotic approximations to the corresponding (minimum) Bayes risks. These approximations can be used to determine sample sizes required to achieve given bounds for Bayes risks. We first consider the case with a real parameter E 8, an open interval in R, and the problem of testing

Ho : e:::; eo versus H1 : e >eo

for some specified value 0 . Let X 1 , ... Xn be i.i.d. observations with a common density f(xle) involving the parameter e. Let 1r(e) be a prior density over 8 and 1r(elx) be the corresponding posterior. Set

Ro(x)

P(e > eolx)

R 1 (x) = 1- Ro(x).

{

Je>eo

1r(elx)de,

110

4 Large Sample Methods

As mentioned in Section 2.7.2, the Bayes rule for the usual 0- 1 loss (see Section 2.5) is to choose H 0 if R 0 (X) :::; R 1(X) or equivalently R 1(X) ~ ~ and to choose H 1 otherwise. The (minimum) Bayes risk is then given by r(n) =

{

Je>eo

Pe[R 1(X)

1/2]n(B)dB +

{ Je<eo

Pe[R 1(X) < 1/2]n(B)dB.

(4.11) By Theorem 2. 7 an alternative expression for the Bayes risk is given by

(4.12)

r(n) = E[min{R0 (X),R 1(X)}]

where the expectation is with respect to the marginal distribution of X. Suppose JB- Bol > 8 where 8 is chosen suitably. For each such B, Bn is close to B with large probability and hence IBn- Bol > 8. Intuitively, for such Bn it will be relatively easy to choose the correct hypothesis. This suggests most of the contribution to the right hand side of (4.11) comes from B close to B0 , i.e., from JB- B0 J< 8. A formal argument that we skip shows r(n) =

{

} 0o[-ylnJ 112(B)(B- B0)]n(B) dB

{ } fJo(tJ1/2(Bo)) dt + o(n-1/2) Vn -oo (4.14)

where C = J0 [1- q>(u)]du;::::; 0.3989423. From (4.14) it follows that if one wants to have Bayes risk at most equal to some specified r 0 then the required sample size n 0 with which one can achieve this (approximately) is given by 00

(4.15) In the same way we can handle, say, a two-parameter problem with parameter 0 = (B 1, B2). Suppose B1 and B2 are comparable and the quantity of interest is rJ = B1 - B2. The problem is to test

:S: rJo

Ho : rJ

for some specified ry0 . Let n( 0) be the joint prior density of B1 , B2 and p( 77) be the marginal prior density of ry. Let In be the observed Fisher information matrix as defined in the first part of Subsection 4.1.2. Then a normal approxiA

-1

mation to the posterior distribution of 0 is N(On, In ), vide Subsection 4.1.2. This implies that a normal approximation to the posterior of rJ is given by N(e1n- {j2n, vn) with

All

Vn = In

where

[Jll(O)

A22

+ In

A12

- 2I n

I~ denotes the (i,j)th element of I~ . Note that (nvn)- 112 --+ b(O)

+ J 22 (0)- 2! 12 (0)]- 112 under 0

where Ji1(0) denotes the (i,j)th element of J- 1 (0), I(O) being the expected Fisher information matrix per unit observation.

112

4 Large Sample Methods Let

7r* ((J)

and

7r* ( ryi(J)

be respectively the marginal prior density of (J

fh +8 2 and conditional prior density of 17 given (Janda( ry, (J) be b( 0) expressed in terms of 17 and (J. Then by arguments similar to those used above, an approximation to the Bayes risk for this problem is

r(1r)

Jn J

7r*(ryolf3)

{1= [1-

1/2. If we consider the uniform prior 1r(8) = 1 on (0, 1), we have

Ro(X)=Ro(T)=

2 T(n+ ) r(T + 1)r(n- T

+ 1) l1;2

8T(1-8)n-Td8

which is a function ofT = 2::~= 1 X;, and the marginal distribution ofT is uniform over {0, 1, ... , n}. Then from (4.12) the Bayes risk is given by 1

r(1r) Here !(8)

= - - L::min{R0 (t), 1- Ro(t)}. n+1

t=O

[8(1- 8)]- and the approximation suggested in (4.14) is

r*(7r)

vn1 lor= [1- )/O"), i = 0, 1. For example, fo may be standard normal i.e., N(O, 1) and h may be standard Cauchy, i.e.,

h(x)

1+ x2

= ---.

The observations X1, X2, ... , Xn are i.i.d. with density belonging to one of these two families. One has to decide which is true. Consider the Bayes rule which accepts h if

If cis chosen such that the Type 1 and Type 2 error probabilities are the same, then this is a minimax test, i.e., it minimizes the maximum error probability among all tests, where the maximum is over i = 0, 1 and (J-L, O") E R x n+. Suppose we consider the estimation problem of a location parameter with a squared error loss. Let X1,X2, ... ,Xn be i.i.d. ~ fo(x- e). Here Pr = Pt = constant. The corresponding Bayes estimate is

f~oo e IT? f(Xj -e) de

f()Ooo IT? f(Xj -e) de which is both minimax and best among equivariant estimators, i.e., it minimizes R(e, T(X)) = Ee(T(X)- e) 2 among all T satisfying

T(xl +a, ... , Xn +a)

T(xl, ... , Xn) +a, a

A similar result for scale families is explored in Problem 4. 5.1.9 General Group Families There are interesting statistical problems that are left invariant by groups of transformations other than the location-scale transformations discussed in the preceding subsections. It is of interest then to consider invariant Haar measures for such general groups also. An example follows.

140

5 Choice of Priors for Low-dimensional Parameters

Example 5.3. Suppose X"' Np(O,I). It is desired to test

Ho : () = 0 versus H 1

: ()

=/= 0.

This testing problem is invariant under the group 9o of all orthogonal transformations; i.e., if His an orthogonal matrix of order p, then gHX = HX rv Np(HO, I), so that !JHO = HO. Also, !JHO = 0. Further discussion of this example as well as treatment of invariant tests is taken up in Chapter 6. Discussion on statistical applications involving general groups can be found in sources such as Eaton (1983, 1989), Farrell (1985), and Muirhead (1982). 5.1.10 Reference Priors In statistical problems that are left invariant by a group of nice transformations, the Jeffreys prior turns out to be the left invariant prior, vide Datta and Ghosh (1996). But for reasons outlined in Subsection 5.1.8, one would prefer the right invariant prior. In all the examples that we have seen, an interesting modification of the Jeffreys prior, introduced in Bernardo (1979) and further refined in Berger and Bernardo (1989, 1992a and 1992b), leads to the right invariant prior. These priors are called reference priors after Bernardo (1979). A reference prior is simply an objective prior constructed in a particular way, but the term reference prior could, in principle, be applied to any objective prior because any objective prior is taken as some sort of objective or conventional standard, i.e., a reference point with which one may compare subjective priors to calibrate them. As pointed out by Bernardo (1979), suitably chosen reference priors can be appropriate in high-dimensional problems also. We discuss this in a later chapter. Our presentation in this section is based on Ghosh (1994, Chapter 9) and Ghosh and Ramamoorthi (2003, Chapter 1). If we consider all the parameters of equal importance, we maximize the entropy of Subsection 5.1.3. This leads to the Jeffreys prior. To avoid this, one assumes parameters as having different importance. We consider first the case of d = 2 parameters, namely, (0 1 , 82), where we have an ordering of the parameters in order of importance. Thus 0 1 is supposed to be more important than 82. For example, suppose we are considering a random sample from N(JL,a 2 ). If our primary interest is in JL, we would take 0 1 = JL,02 = a 2. If our primary interest is in a 2 , then 0 1 = a 2 , 82 = JL. If our primary interest is in JL/a, we take 0 1 = JL/a and 82 = JL or a or any other function such that (0 1 ,02 ) is a one-to-one sufficiently smooth function of (JL,a). For fixed 01, the conditional density p( 02101) is one dimensional. Bernardo (1979) recommends setting this equal to the conditional Jeffreys prior

c(01)Vh2(0). Having fixed this, the marginal p( 0 1 ) is chosen by maximizing

5.1 Different Methods of Construction of Objective Priors

E(l

141

p(B1IX1, ... ,Xn))

p(Bl)

in an asymptotic sense as explained below. Fix an increasing sequence of closed and bounded, i.e., compact rectangles K 1i x K 2i whose union is 8. Let Pi(B 2 IBI) be the conditional Jeffreys prior, restricted to K 2i and Pi(Bl) a prior supported on K 1i. As mentioned before,

where ci(Bl) is a normalizing constant such that

l Then Pi(el,e2)

Pi(B2IBl)dB2 = 1.

Pi(Bl)pi(B21Bl) on K1i x K2i and we consider

J(p·(B) X)= E {l '

i 2

1 '

Pi(BliX)} og Pi(Bl)

=E [logPi(OIX)] -E [logpi(B2IB1,X)] Pi(O) Pi(B21Bl)

J(pi(B1, 82), X)-

lli

Pi(Bl)J(pi(B21Bl), X)dB1

where for fixed 81, J(pi(B 21Bl),X) is the Lindley-Bernardo functional J

X)}

{log Pi(B2IB1, Pi (B2IBI)

with Pi(B 2 IB 1) being regarded as a conditional prior for 82 for fixed 81. Applying the asymptotic normal approximation to the first term on the right-hand side as well as the second term on the right-hand side, as in Subsection 5.1.3,

J(pi(Bl),X) =

[lli

Pi(O) log{ det(J(O))}! dO-

-lli l Pi(Bl)

=Kn+ {

}Kli

lli

;

Pi(O) logpi(O)dO]

Pi(B2IBd [log Vh2(0) -logpi(B2IBd] dB2dB 1

Pi(Bl) {

JK2;

-[li + lli 1 -l

Pi(B2IB1)log{d~t(~~))}! dB2dB1 22

Pi(el) logpi(Bl)dB1

Pi(Bl)

2 ,

Pi(B2IB1) log { I}(O)}

dB 2dB 1

Pi(Bl) logpi(Bl)dB1

(5.27)

142

5 Choice of Priors for Low-dimensional Parameters

where Kn is a constant depending on n. Let '1/Ji(fh) be the geometric mean of (J 11 (l1))-~ with respect to Pi(0 2 IOI). Then (5.27) can be written as

which is maximized if

Pi(OI)

= c~'l/Ji(OI)

= 0

on Kli

outside.

The product is the reference prior on K 1 i x K 2 i. If we can write this as

Pi(lJ)

diA(01, 82) on = 0 elsewhere

K1i

K2i

then the reference prior on 8 may be taken as proportional to A (01 , 02 ). Clearly, the reference prior depends on the choice of (0 1 , 82) and the compact sets K 1i, K2i· The normalization on K 1 i x K2i first appeared in Berger and Bernardo (1989). If an improper objective prior is used for fixed n, one might run into paradoxes of the kind exhibited by Fraser et al. (1985). See in this connection Berger and Bernardo (1992a). Recently there has been some change in the definition of reference prior, but we understand the final results are similar (Berger et al. (2006)). The above procedure is based on Ghosh and Mukerjee (1992) and Ghosh (1994). Algebraically it is more convenient to work with [J(0)]- 1 as in Berger and Bernardo (1992b). We illustrate their method ford= 3, the general case is handled in the same way. First note the following two facts. A. Suppose 0 1 , 02 , ... , Oj follow multivariate normal with dispersion matrix E. Then the conditional distribution of Oj given 0 1 ,0 2 , ... , Oj _ 1 is normal with variance equal to the reciprocal of the (j, j)-th element of E- 1 . B. Following the notations of Berger and Bernardo (1992b), letS= [J(0)]- 1 where I( 9) is the d X d Fisher information matrix and sj be the j X j principal submatrix of S. Let Hj = sj- 1 and hj be the (j,j)-th element of Hj. Then by A, the asymptotic variance of 0j given 01 , 02 , ... , Oj _ 1 is (hj) - 1 / n. To get some feeling for hj, note that for arbitrary d, j = d, hj = Idd and for arbitrary d, j = 1, h1 = 1/1 11 . We now provide a new asymptotic formula for Lindley-Bernardo information measure, namely,

5.1 Different Methods of Construction of Objective Priors

143

-1

E(log(N(Bj(B1, ... , Bj_l), hj (0)/n)))- E(logpi(BjiB1, ... , Bj_l)) (where Bj(B1, ... 'Bj-d is the MLE for Bj given B1, ... 'Bj-1 and

Pi is a prior supported on a compact rectangle K 1i x K 2i x ... x Kdi·) =Kn+E

K 1,

log

'l/Jj(B1,···,Bj) (I ) B )pBjB1, ... ,Bj-1dBj

(BIB J

1, ... , J-1

+op(l)

(5.28)

where Kn is a constant depending on n and

is the geometric mean of h~ 12 (0) with respect to p(BH 1, ... , BdiB 1, ... , Bj). The proof of (5.28) parallels the proof of (5.4). It follows that asymptotically (5.28) is maximized if we set

p(BjiB1, ... ,Bj_l)

= c~(B1, =

... ,Bj_l)'lj;j(B1, ... ,Bj) on Kji

(5.29)

elsewhere .

If the dimension exceeds 2, say d = 3, we merely start with a compact rectangle K1i x K2i x K3i and set Pi(B3IB1, B2) = ci(B1, B2)VT:;;(O) on K3i· Then we first determine Pi(B2IB1) and then Pi(Bl) by applying (5.29) twice. Thus, Pi(B2IB1)

Pi(Bl)

c~(Bl)'lj;2(B1, B2) on K2i

= c~'lj;1(Bl)

on K1i·

One can also verify that the formulas obtained for d = 2 can be rederived in this way. A couple of examples follow.

Example 5.4. Let X 1, X2, ... , Xn be i.i.d. normal with mean B2 and variance B1, with B1 being the parameter of interest. Here I(O) =

( ~0 e\0) , ! 22 (0) B1 I 1

(0) = 2B 21 .

Thus Pi(B 21Bl) = ci on K 2i where c;- 1 =volume of K 2i, and therefore,

We then have for some constant di and the reference prior is taken as

144

5 Choice of Priors for Low-dimensional Parameters

This is the right invariant Haar measure unlike the Jeffreys prior which is left invariant (see Subsection 5.1.7). Example 5.5. (Berger and Bernardo, 1992b) Let (X1,X2,X3) follow a multinomial distribution with parameters (n;£h,02,{h), i.e., (XI,X2,X3) has density p(xi, x2, x3jOI, 82, 83) nl

-,--....,--,-,----·------:-:-O~I0~20~3(1-

x1! x2! x3!(n- XI - x2- x3)!

OI _ 02

Xi 2: O,i

1,2,3, l:xi :S n,

03 )n-x 1 -x2-x 3 ,

oi > o,i = 1,2,3, .z=oi < 1.

Here the information matrix is

where Diag {a I, a2 , a3} denotes the diagonal matrix with diagonal elements a I, a 2, a3 and 1 3 denotes the 3 x 3 matrix with all elements equal to one. Hence

and for j

With 8 [j]

= (0 I , ... , 0j) 1 ,

1,2,3

Hj(8) = n Diag {O!I, ... , Oji}

+ n(1- OI

- · · ·- Oj)-Ilj

and hj(8)

= nOji(1- OI- · · ·- Oj-I)(1- OI- · · ·- Oj)-I.

Note that hj ( 8) depends only on OI, 02 , ... , 0j so that

Here we need not restrict to compact rectangles as all the distributions involved have finite mass. As suggested above for the general case, the reference prior can now be obtained as

= 1T-Io3I 12 (1- Or- 82- 83)-I/ 2,

0 < 83 < 1- OI- 82, 2 2 p(02\0I) = Jr-Io;-I/ (1- OI- 82)-I/ , 0 < 82 < 1- OI, p(0 1 ) = Jr-Io~I/ 2 (1- Or)-I/ 2, 0 < OI < 1,

p(03!0I,02)

5.1 Different Methods of Construction of Objective Priors

145

i.e.,

p( 0) = Ir- 3e;:- 112 (1- 81) -r;ze;- 112 (1- 81 - Bz) - 1/ 2e; 112 (1- Br - Bz- 83) - 112,

ei > o, i = 1, 2, 3, I::i ei < 1.

As remarked by Berger and Bernardo (1992b), inferences about 81 based on the above prior depend only on x1 and not on the frequencies of other cells. This is not the case with standard noninformative priors such as Jeffreys prior. See in this context Berger and Bernardo (1992b, Section 3.4). 5.1.11 Reference Priors Without Entropy Maximization

Construction of reference priors involves two interesting ideas. The first is the new measure of information in a prior obtained by comparing it with the posterior. The second idea is the step by step algorithm based on arranging the parameters in ascending order of importance. The first throws light on why an objective prior would depend on the model for the likelihood. But it is the step by step algorithm that seems to help more in removing the problems associated with the Jeffreys prior. We explore below what kind of priors would emerge if we follow only part of the Berger-Bernardo algorithm (of Berger and Bernardo (1992a)). We illustrate with two (one-dimensional) parameters 81 and 82 of which 81 is supposed to be more important. This would be interpreted as meaning that the marginal prior for 81 is more important than the marginal prior for 82. Then the prior is to be written as p(B1)p(B2IB1), with

and p(B1) is to be determined suitably. Suppose, we determine p(B 1 ) from the probability matching conditions

P{B1::; B1,a(X)!X}

a+ Ov(n- 1),

P{B1 ::; B1,a(X)IB1, Bz}p(Bz!B1)d82 = 1- a+ O(n- 1).

(5.30) (5.31)

Here (5.30) defines the Bayesian quantile B1,a of 81, which depends on data X and (5.31) requires that the posterior probability on the left-hand side of (5.30) matches the frequentist probability averaged out with respect to p(Bz!B1). Under the assumption that 81 and Bz are orthogonal, i.e., fr 2 (0) = Iz 1 (0) = 0, one gets (Ghosh (1994))

Pi(B1) = constant ( [

J~112 (0)p(B2 IB 1 )d82 ) -

1 (5.32)

where K1i x Kzi is a sequence of increasing bounded rectangles whose union is 8 1 x 8z. This equation shows the marginal of 81 is a (normalized) harmonic mean of ..JTl1 which equals the harmonic mean of 1/VfiT.

146

5 Choice of Priors for Low-dimensional Parameters

What about a choice of Pi(Ol) equal to the geometric or arithmetic mean? The Berger-Bernardo reference prior is the geometric mean. The marginal prior is the arithmetic mean if we follow the approach of taking weak limits of suitable discrete uniforms, vide Ghosal et al. (1997). In many interesting cases involving invariance, for example, in the case of location-scale families, all three approaches lead to the right invariant Haar measures as the joint prior.

5.1.12 Objective Priors with Partial Information Suppose we have chosen our favorite so-called noninformative prior, say Po· How can we utilize available prior information on a few moments of 0? Let p be an arbitrary prior satisfying the following constraints based on available information (5.33) If gj ( 0) = e{' ... e~p, we have the usual moments of (). We fix increasing compact rectangles Ki with union equal to 8 and among priors Pi supported on Ki and satisfying (5.33), minimize the Kullback-Leibler number

The minimizing prior is

where >.j's are hyperparameters to be chosen so as to satisfy (5.33). This can be proved by noting that for all priors Pi satisfying (5.33),

K(pi,Po) = { Pi(O) log Pi((:)) dO+ K(pi,p;) }K, Po

= constant +

L AjAj + K(pi,p;)

which is minimized at Pi = Pi. If instead of moments we know values of some quantiles for (a onedimensional) e or more generally the prior probabilities aj of some disjoint subsets Bj of 8, then it may be assumed UBj = 8 and one would use the prior given by

*(A)=""' Po(AnBj) ~a3 (B) . j Po J

Sun and Berger (1998) have shown how reference priors can be constructed when partial information is available.

5.2 Discussion of Objective Priors

147

5.2 Discussion of Objective Priors This section is based on Ghosh and Samanta (2002b ). We begin by listing some of the common criticisms of objective priors. We refer to them below as "noninformative priors". 1. Noninformative priors do not exist. How can one define them? 2. Objective Bayesian analysis is ad hoc and hence no better than the ad hoc paradigms subjective Bayesian analysis tries to replace. 3. One should try to use prior information rather than waste time trying to find noninformative priors. 4. There are too many noninformative priors for a problem. Which one is to be used? 5. Noninformative priors are typically improper. Improper priors do not make sense as quantification of belief. For example, consider the uniform distribution on the real line. Let L be any large positive number. Then P{-L-:; B-:; L}/P{B rf'. (-L,L)} = 0 for all L but for a sufficiently large L, depending on the problem, we would be pretty sure that - L -:; B -:; L. 6. If has uniform distribution because of lack of information, then this should also be true for any smooth one-to-one function 7] = g(B). 7. Why should a noninformative prior depend on the model of the data? 8. What are the impacts of 7 on coherence and the likelihood principle?

We make a couple of general comments first before replying to each of these criticisms. The purpose of introducing an objective prior is to produce a posterior that depends more on the data than the objective prior. One way of checking this would be to compare the posteriors for different objective priors as in Example 2.2 of Chapter 2. The objective prior is only the means for producing the posterior. Moreover, objective Bayesian analysis agrees that it is impossible to define a noninformative prior on an unbounded parameter space because maximum entropy need not be finite. This is the reason that increasing bounded sets were used in the construction. One thinks of the objective priors as consensus priors with low information~ at least in those cases where no prior information is available. In all other cases, the choice of an objective prior should depend on available prior information (Subsection 5.1.12). We now turn to the criticisms individually. Points 1 and 2 are taken care of in the general comments. Point 3 is well taken, we do believe that elicitation of prior information is very important and any chosen prior should be consistent with what we know. A modest attempt towards this is made in Subsection 5.1.12. However, we feel it would rarely be the case that a prior would be fully elicited, only a few salient points or aspects with visible practical consequences can be ascertained, but subjected to this know ledge the construction of the prior would still be along the lines of Subsection 5.1.12 even though in general no explicit solution will exist. As to point 4, we have already addressed this issue in the general comments. Even though there is no unique objective prior, the posteriors will

148

5 Choice of Priors for Low-dimensional Parameters

usually be very similar even with a modest amount of data. Where this is not the case, one would have to undertake a robustness analysis restricted to the class of chosen objective priors. This seems eminently doable. Even though usually objective priors are improper, we only work with them when the posterior is proper. Once again we urge the reader to go over the general comments. We would only add that many improper objective priors lead to same posteriors as coherent, proper, finitely additive priors. This is somewhat technical, but the interested reader can consult Heath and Sudderth (1978, 1989). Point 6 is well taken care of by Jeffreys prior. Also in a given problem not all one-to-one transformations are allowed. For example, if the coordinates of (J are in a decreasing order of importance, then we need only consider "1 = (r11, ... , 'r/d) such that 'r/j is a one-to-one continuously differentiable function of ()j· There are invariance theorems for reference and probability matching priors in such cases, Datta and Ghosh (1996). We have discussed Point 7 earlier in the context of the entropy of Bernardo and Lindley. This measure depends on the experiment through the model of likelihood. Generally information in a prior cannot be defined except in the context of an experiment. Hence it is natural that a low-information prior will not be the same for all experiments. Because a model is a mathematical description of an experiment, a low-information prior will depend on the model. We now turn to the last point. Coherence in the sense of Heath and Sudderth (1978) is defined in the context of a model. Hence the fact that an objective prior depends on a model will not automatically lead to incoherence. However, care will be needed. As we have noted earlier, a right Haar prior for location-scale families ensures coherent inference but in general a left Haar prior will not. The impact on likelihood principle is more tricky. The likelihood principle in its strict sense is violated because the prior and hence the posterior depends on the experiment through the form of the likelihood function. However, for a fixed experiment, decision based on the posterior and the corresponding posterior risk depend only on the likelihood function. We pursue this a bit more below. Inference based on objective priors does violate the stopping rule principle, which is closely related to the likelihood principle. In particular, in Example 1.2 of Carlin and Louis (1996), originally suggested by Lindley and Phillips (1976), one would get different answers according to a binomial or a negative binomial model. This example is discussed in Chapter 6. To sum up we do seem to have good answers to most of the criticisms but have to live with some violations of the likelihood and stopping rule principles.

5.4 Elicitation of Hyperparameters for Prior

149

5.3 Exchangeability A sequence of real valued random variables {Xi} is exchangeable if for all n, all distinct suffixes {i1, i2, ... , in} and all B1, B2, ... , Bn C R,

In many cases, a statistician will be ready to assume exchangeability as a matter of subjective judgment. Consider now the special case where each Xi assumes only the values 0 and 1. A famous theorem of de Finetti then shows that the subjective judgment of exchangeability leads to both a model for the likelihood and a prior.

Theorem 5.6. (de Finetti) If Xi's are exchangeable and assume only values 0 and 1, then there exists a distribution II on (0, 1) such that the joint distribution of X 1 , ... , Xn can be represented as

1fr 1

P(Xl

= xl,.

,Xn

= Xn) =

exi(1- 8) 1 -Xi dii(B).

i=l

This means Xi's can be thought of as i.i.d. B(1, B) variables, given where has the distribution II. For a proof of this theorem and other results of this kind, see, for example, Bernardo and Smith (1994, Chapter 4). The prior distribution II can be determined in principle from the joint distribution of all the Xi's, but one would not know the joint distribution of all the Xi's. If one wants to actually elicit II, one could ask oneself what is one's subjective predictive probability P{Xi+l = 1IX1 , X 2 , ... , Xi}. Suppose the subjective predictive probability is (o: + 'EXi)/(o: + f3 + i) where o: > 0, /3 > 0. Then the prior for is the Beta distribution with hyperparameters o: and /3. Nonparametric elicitations of this kind are considered in Fortini et al. (2000).

5.4 Elicitation of Hyperparameters for Prior Although a full prior is not easy to elicit, one may be able to elicit hyperparameters in an assumed model for a prior. We discuss this problem somewhat informally in the context of two examples, a univariate normal and a bivariate normal likelihood. Suppose X 1 , X 2 , ... , Xn are i.i.d. N(p,, 0' 2 ) and we assume a normal prior for p, given 0' 2 and an inverse gamma prior for 0' 2 • How do we choose the hyperparameters? We think of a scenario where a statistician is helping a subject matter expert to articulate his judgment. Let p(p,I0' 2) be normal with meanT) and variance c20' 2 where cis a constant. The hyperparameter T) is a prior guess for the mean of X. The statistician has to make it clear that what is involved is not a guess about p, but a guess about

150

5 Choice of Priors for Low-dimensional Parameters

the mean of f.L· So the expert has to think of the mean JL itself as uncertain and subject to change. Assuming that the expert can come up with a number for ry, one may try to elicit a value for c in two different ways. If the expert can assign a range of variation for JL (given r>) and is pretty sure ry will be in this range, one may equate this to ry ± 3cr>, ry would be at the center of the range and distance of upper or lower limit from the center would be 3cr>. To check consistency, one would like to elicit the range for several values of r> and see if one gets nearly the same c. In the second method for eliciting c, one notes that c2 determines the amount of shrinking of X to the prior guess ry in the posterior mean

E( IX) f.1

nx+ CJ(i'I'TJ 1

(;2

..!!o._ a-2

1c2 a-2

x + 'rJ ;2 c

n n

+ 1/c2

(vide Example 2.1 of Section 2.2). Thus if c 2 = 1/n, X and ry have equal weights. If c2 = 5/n, the weight of ry diminishes to one fifth of the weight for X. In most problems one would not have stronger prior belief. We now discuss elicitation of hyperparameters for the inverse Gamma prior for r> 2 given by 2 1 1 -1/(,8a2) p(r> ) - r(a)f3a (r>2)a+1 e .

The prior guess for r> 2 is [/3( a - 1) J- 1 . This is likely to be more difficult to elicit than the prior guess ry about JL. The shape parameter a can be elicited by deciding how much to shrink the Bayes estimate of r> 2 towards the prior guess [/3( a - 1) J- 1 . Note that the Bayes estimate has the representation 2X E( (J I ) =

a- 1

1 a- 1 + n/2 f3(a- 1)

(n- 1)/2 a- 1 + n/2

s2 +

n(X- ry) -,------'------,.-,-------;:-:(2a + n- 2)(1 + nc2 )

where (n- 1)s 2 = 2:(Xi- X) 2 . In order to avoid dependence on X one may want to do the elicitation based on

E( 2l 8 2)

a- 1 1 (n- 1)/2 2 8 a- 1 + (n- 1)/2 f3(a- 1) +a- 1 + (n- 1)/2 ·

The elicitation of prior for JL and r>, specially the means and variances of priors, may be based on examining related similar past data. We turn now to i.i.d. bivariate normal data (Xi, Yi), i = 1, 2, ... , n. There are now five parameters, (JLx, rYk), (JLY, r>~) and the correlation coefficient p. Also E(YIX = x) = (30 + (3 1 x, Var(YIX = x) = r> 2, where r> 2 = r>~(1- p2 ),

fJ1 = pr>y/r>x,

f3o

= f.LY-

(wy/r>x)JLx.

One could reparameterize in various ways. We adopt a parameterization that is appropriate when prediction of Y given X is a primary concern. We consider (JLx, rYk) as parameters for the marginal distribution of X, which may

5.4 Elicitation of Hyperparameters for Prior

151

be handled as in the univariate case. We then consider the three parameters (17 2, (3 0 , (31) of the conditional distribution of Y given X= x. The joint density may be written as a product of the marginal density of X, namely N(J.1x, O"i) and the conditional density of Y given X= x, namely N((30 + (3 1 x, 0" 2). The full likelihood is

It is convenient to rewrite f3o + f31xi as lo + 11 (xi - x), with 11 = fJ1, lo = (30+ rl x. Suppose we think of xi's to be fixed. We concentrate on the second factor to find its conjugate prior. The conditional likelihood given the Xi's is

1 1 - - ] n exp{ - -- "(yi-'Yo-'YI (xi -x)) 2 - ~('Yo-lo) 2 - Sxx (1'1 -11) 2} [ V21i(J" 20"2 ~ 20"2 20"2 where 'Yo= fj, 1'1 = ~(Yi- y)(xi- x)/sxx, and Sxx =~(xi- x) 2. Clearly, the conjugate prior for 0" 2 is an inverse Gamma and the conjugate priors for lo and 11 are independent normals whose parameters may be elicited along the same lines as those for the univariate normal except that more care is needed. The statistician could fix several values of x and invite the expert to guess the corresponding values of y. A straight line through the scatter plot will yield a prior guess on the linear relation between x and y. The slope of this line may be taken as the prior mean for (31 and the intercept as the prior mean for (3 0 . These would provide prior means for /o, /l (for the given values of Xi's in the present data). The prior variances can be determined by fixing how much shrinkage towards a prior mean is desired. Suppose that the prior distribution of 0" 2 has the density

Given 0" 2, the prior distributions for lo and 1 1 are taken to be independent normals N(J.1o, c60" 2) and N(J.11, ci0" 2) respectively. The marginal posterior distributions of lo and 1 1 with these priors are Student's t with posterior means given by

(5.34) (5.35) As indicated above, these expressions may be used to elicit the values of c0 and c 1 . For elicitation of the shape parameter a of the prior distribution of 0" 2 we may use similar representation of E(0" 2 Ix,y). Note that the statistics 8 2 = ~ (Yi - ro - ''h (Xi - x)) 2, 1o and "11 are jointly sufficient for (17 2, ro, 11).

152

5 Choice of Priors for Low-dimensional Parameters

Table 5.1. Data on Water Flow (in 100 Cubic Feet per Second) at Two Points (Libby and Newgate) in January During 1931-43 in Kootenai River (Ezekiel and Fox, 1959) Year 1931 32 33 34 35 36 37 38 39 40 41 42 43 Newgate(y) 19.7 18.0 26.1 44.9 26.1 19.9 15.7 27.6 24.9 23.4 23.1 31.3 23.8 Libby(x) 27.1 20.9 33.4 77.6 37.0 21.6 17.6 35.1 32.6 26.0 27.6 38.7 27.8

As in the univariate normal case considered above we do the elicitation based on E(a2[S2) = a- 1 1 + (n/2)- 1 o- 2 (5.36) (n/2)+a-2b(a-1) (n/2)+a-2 where a2 example.

= 8 2 /(n-

2) is the classical estimate of a 2 . We illustrate with an

Example 5. 7. Consider the bivariate data of Table 5.1 (Ezekiel and Fox, 1959). This relates to water flow in Kootenai River at two points, Newgate (British Columbia, Canada) and Libby (Montana, USA). A dam was being planned on the river at Newgate, B.C., where it crossed the Canadian border. The question was how the flow at Newgate could be estimated from that at Libby. Consider the above setup for this set of bivariate data. Calculations yield x = 32.5385, Bxx = 2693.1510, i'o = 24.9615, i'1 = 0.4748 and a2 = 3.186 so that the classical (least squares) regression line is y = 24.9615 + 0.4748(x-

x),

i.e., y = 9.5122 + 0.4748x. Suppose now that we have similar past data D for a number of years, say, the previous decade, for which the fitted regression line is given by y = 10.3254 + 0.4809x

with an estimate of error variance (a 2 ) 3.9363. We don't present this set of "historical data" here, but a scatter plot is shown in Figure 5.1. As suggested above, we may take the prior means for {30 , {31 and a 2 to be 10.3254, 0.4809, and 3.9363, respectively. We, however, take these to be 10.0, 0.5, and 4.0 as these are considered only as prior guesses. This gives flo = 10 + 0.5x = 26.2693 and fll = 0.5. Given that the historical data set D was obtained in the immediate past (before 1931), we have considerable faith in our prior guess, and as indicated above, we set the weights for the prior means flo, fl 1, and 1/(b(a-1)) of J'o, ')' 1 , and a 2 in (5.34), (5.35), and (5.36) equal to 1/6 so that the ratio of the weights for the prior estimate and the classical estimate is 1 : 5 in each case. Thus we set

S XX+

-2

a -1 (n/2)+a-2

1 6

5.4 Elicitation of Hyperparameters for Prior

153

which yields, with n = 13 and Bxx = 2693.151 for the current data, c0 2 = 2.6, c1 2 = 538.63, and a = 2.1. If, however, the data set D was older, we would attach less weight (less than 1/6) to the prior means. Our prior guess for IJ 2 is 1/(b(a- 1)). Equating this to 4.0 we get b = 0.2273. Now from (5.34)-(5.36) we obtain the Bayes estimates of /o, 1 1 and IJ 2 as 25.1795, 0.4790 and 3.3219 respectively. The estimated regression line is y

i.e., y

= 25.1795 + 0.4790(x- x), = 9.5936 + 0.4790x.

The scatter plots for the current Kootenai data of Table 5.1 and the historical data D as well as the classical and Bayesian regression line estimates derived above are shown in Figure 5.1. The symbols "o" for current and "·" for historical data are used here. The continuous line stands for the Bayesian regression line based on the current data and the prior, and the broken line represents the classical regression line based on the current data. The Bayesian line seems somewhat more representative of the whole data set than the classical regression line, which is based on the current data only. If one fits a classical regression line to the whole data set, it will attach equal importance to the current and the historical data; it is a choice between all or nothing. The Bayesian method has the power to handle both current data and other available information in a flexible way. The 95% HPD credible intervals for ro and 1 1 based on the posterior t-distributions are respectively (21.4881, 28.8708) and (0.2225, 0.7355), which are comparable with the classical 95% confidence intervals (23.8719, 26.0511) for ro and (0.3991, 0.5505) for 1 1 . Note that, as expected, the Bayesian intervals are more conservative than the classical ones, the Bayesian providing for the possible additional variation in the parameters. If one uses the objective prior P(/o,/ 1 ,1J 2) ex 1/IJ 2, the objective Bayes solutions would agree with the classical estimates and confidence intervals. All of the above would be inapplicable if x and y have the same footing and the object is estimation of the parameters in the model rather than prediction of unobserved y's for given x's. In this case, one would write the bivariate normal likelihood

J1 n

1 _2_7r_IJ_x_IJ_y_v--:=1=-=P=2

1 ((xi-J-Lx) x exp { - 2( 2) 2 1- p !Jx

+ (yi-J-LY) 2 !Jy

2 - 2p

(xi-J-Lx)(yi-J-LY))} tJx

!Jy

The conjugate prior is a product of a bivariate normal and an inverse-Wishart distribution. Notice that we have discussed elicitation of hyperparameters for a conjugate prior for several parameters. Instead of substituting these elicited values in the conjugate prior, we could treat the hyperparameters as having

154

5 Choice of Priors for Low-dimensional Parameters

Bayesian regression line

/ /

Classical regression line

/ /

Current data

/ /

Historical data / / /

0 y

.0 ,"" ""

o "o L()

0 ¥' ¥'

L()

Fig. 5.1. Scatter plots and regression lines for the Kootenai River data.

a prior distribution over a set of values around the elicited numbers. The prior distribution for the hyperparameters could be a uniform distribution on the set of values around the elicited numbers. This would be a hierarchical prior. An alternative would be to use several conjugate priors with different hyperparameter values from the set around the elicited numbers and check for robustness. Elicitation of hyperparameters of a conjugate prior for a linear model is treated in Kadane et al. (1980), Garthwaite and Dickey (1988, 1992), etc. A recent review is Garthwaite et al. (2005). Garthwaite and Dickey (1988) observe that the prior variance-covariance matrix of the regression coefficients, specially the off-diagonal elements of the matrix, are the most difficult to elicit. We have assumed the off-diagonal elements are zero, a common simplifying assumption, and determined the diagonal elements by eliciting how much

5.5 A New Objective Bayes Methodology Using Correlation

155

shrinkage towards the prior is sought in the Bayes estimates of the means of the regression coefficients. Garthwaite and Dickey (1988) indicate an indirect way of eliciting the variance-covariance matrix.

5.5 A New Objective Bayes Methodology Using Correlation As we have already seen, there are many approaches for deriving objective, reference priors and also for conducting default Bayesian analysis. One such approach that relies on some new developments is discussed here. Using the Pearson correlation coefficient in a rather different way, DasGupta et al. (2000) and Delampady et al. (2001) show that some of its properties can lead to substantial developments in Bayesian statistics. Suppose X is distributed with density f(xiB) and has a prior 1r. Let the joint probability distribution of X and under the prior 1r be denoted by P. We can then consider the Pearson correlation coefficient pP between two functions g 1 (X, B) and g 2 (X, B) under this probability distribution P. An objective prior in the spirit of reference prior can then be derived by maximizing the correlation between two post-data summaries about the parameter namely the posterior density and the likelihood function. Given a class of priors, Delampady et al. (2001) show that the prior 7r that maximizes pP{f(xiB), n(Bix)} is the one with the least Fisher information J( 1r) = En { ddli log n( 8)} 2 in the class Actually, Delampady et al. (2001) note that it is very difficult to work with the exact correlation coefficient pP{f(xiB), n(Bix)} and hence they maximize an appropriate large sample approximation by assuming that the likelihood function and the prior density are sufficiently smooth. The following example is from Delampady et al. (2001).

with IBI ::; 1. Assume that and 1r are sufficiently smooth. Then the prior density which achieves the minimum Fisher information in the class of priors compactly supported on [-1, 1] is what is desired. Bickel (1981) and Huber (1974) show that this prior

Example 5.8. Consider a location parameter

n(B)

= {

cos (n8/2), if IBI ::;_1; 0, otherwise.

Thus, the Bickel prior is the default prior under the correlation criterion. The variational result that obtains this prior as the one achieving the minimum Fisher information was rediscovered by Ghosh and Bhattacharya in 1983 (see Ghosh (1994); a proof of this interesting result is also given there). In addition to the above problem, it is also possible to identify a robust estimate of using this approach. Suppose n is a reference prior, and is Let 5v be the a class of plausible priors. n may or may not belong to Bayes estimate of with respect to prior v E F. To choose an optimal Bayes

156

5 Choice of Priors for Low-dimensional Parameters Table 5.2. Values of 8,.(X) and 8,(X) for Various X= x X 0 .5 1 1.5 2 3 5 8 10 15 8,.(x) 0 .349 .687 1.002 1.284 1.735 2.197 2.216 2.065 1.645 8,(x) 0 .348 .683 .993 1.267 1.685 1.976 1.60 1.15 .481

estimate, maximize (over v E r) the correlation coefficient p P ( (}, Jv) between (} and Jv. Note that pP is calculated under P, and in this joint distribution (} follows the reference prior 1r. Again, by maximizing an appropriate large sample approximation pP(B,Jv) (assuming that the likelihood function and the prior density are sufficiently smooth), Delampady et al. (2001) obtain the following theorem. Theorem 5.9. The estimate Jv(X) maximizing pP(B, Jv) is Bayes with respect to the prior density (5.37)

where /1, T 2 are arbitrary and c is a normalizing constant. The interesting aspect of this reference prior v is evident from the fact that it is simply a product of the initial reference prior 1r and a Gaussian factor. This may be interpreted as v being the posterior density when one begins with a flat prior 1r and it is revised with an observation (} from the Gaussian distribution to pull in its tails. Consider the following example again from Delampady et al. (2001).

Example 5.10. Consider the reference prior density 1r(B) ex (1 + (} 2 /3)- 2 , density of the Student's t 3 prior, a flat prior. Suppose that the family r contains only symmetric priors and so v(B) is of the form v(B) = c(1 + B2 j3)- 2 exp{-B 2 /(2T 2 )}. Let X'"" Cauchy(B,a), with known a and having density

Some selected values are reported in Table 5.2 for a = 0.2. For small and moderate values of x, Jrr and Jv behave similarly, whereas for large values, Jv results in much more shrinkage than Jrr. This is only expected because the penultimate v has normal tails, whereas the reference 1r has flat tails.

5.6 Exercises 1. Let X'"" B(n,p). Choose a prior on p such that the marginal distribution of X is uniform on {0, 1, ... , n }.

5.6 Exercises

157

2. (Schervish (1995, p.121)). Let h be a function of (x, e) that is differentiable in e. Define a prior p*(e)

3. 4.

6. 7. 8.

10.

[Var 0 ((alae)h(X, e))] 112 .

(a) Show that p* (e) satisfies the invariance condition (5.1). (b) Choose a suitable h such that p*(e) is the Jeffreys prior. Prove (5.16) and generalize it to the multiparameter case. (Lehmann and Casela (1998)) For a scale family, show that there is an equivariant estimate of 17k that minimizes E(T - 17k) 2 I 17 2 k. Display the estimate as the ratio of two integrals and interpret as a Bayes estimate. Consider a multinomial distribution. (a) Show that the Dirichlet distribution is a conjugate prior. (b) Identify the precision parameter for the Dirichlet prior distribution. (c) Let the precision parameter go to zero and identify the limiting prior and posterior. Suggest why the limiting posterior, but not the limiting prior, is used in objective Bayesian analysis. Find the Jeffreys prior for the multinomial model. Find the Jeffreys prior for the multivariate normal model with unknown mean vector and dispersion matrix. (a) Examine why the Jeffreys prior may not be appropriate if the parameter is not identifiable over the full parameter space. (b) Show with an example that the Jeffreys prior may not have a proper posterior. (Hint. Try the following mixture: X = 0 with probability 112 and is N(11, 1) with probability 112.) (c) Suggest a heuristic reason as to why the posterior is often proper if we use a Jeffreys prior. Bernardo has recently proposed the use of min{K(f0 , h), K(h, f 0 )}, instead of K(f0 , h), as the criterion to maximize at each stage of reference prior. Examine the consequence of this change for the examples of reference priors discussed in this chapter. Given (11,17 2 ), let X 1 , ... ,Xn be i.i.d. N(11,17 2 ) and consider the prior n(11, 172 ) ex. 1I 17 2 . Verify that

P{X- ta/2,n-lslvn:::; 11:::; X+ ta/2,n-lslvnl11, 17 2} = P{X- ta/2,n-lslvn:::; 11:::; X+ ta/2,n-lslvn1Xl, ... , Xn}

= 1-a, where X is the sample mean, s 2 = l:(Xi- X) 2 l(n -1), and ta; 2 is the upper al2 point of tn-l, 0 approximation is also commonly use9_, where instead of the posterior mode Oi, the maximum likelihood estimate Oi is employed. Then, instead of (6.8), one obtains (6.9)

Here H9 is the observed Fisher information matrix evaluated at the maximum likelihood estimator. If the observations are i.i.d. we have that H9 = nH1 8 , where H 1 8 is the observed Fisher information matrix obtained fr;m a sin'gie observati~n'. In this case,

and hence 2log(Bor)

~ 2log (f(xl~o))

+2log

J(xiOr)

(go(~o)) 9r (Or)

IH-ll) -(Po- Pr) log!!:___+ log ~ . 27r ( IH-ll

(6.10)

1,8,

An approximation to (6.10) correct to 0(1) is

2log(Bor)

2log

J(xiOo)) ~ - (Po - pr) log n. ( f(xiOr)

(6.11)

This is the approximate Bayes factor based on the Bayesian information criterion (BIC) due to Schwarz (1978). The term (Po- pr) log n can be considered a penalty for using a more complex model.

6.2 P-value and Posterior Probability of Ho

163

A related criterion is 2log

f(xiOo)) ~ - 2(po- PI) ( f(xiOI)

(6.12)

which is based on the Akaike information criterion (AIC), namely,

AIC = 2log f(xiO)- 2p for a model J(xiO). The penalty for using a complex model is not as drastic as that in BIC. A Bayesian interpretation of AIC for high-dimensional prediction problems is presented in Chapter 9. Problem 16 of Chapter 9 invites you to explore if AIC is suitable for low-dimensional testing problems.

6.2 P-value and Posterior Probability of H 0 as Measures of Evidence Against the Null One particular tool from classical statistics that is very widely used in applied sciences for model checking or hypothesis testing is the P-value. It also happens to be one of the concepts that is highly misunderstood and misused. The basic idea behind R.A. Fisher's (see Fisher (1973)) original (1925) definition of P-value given below did have a great deal of appeal: It is the probability under a (simple) null hypothesis of obtaining a value of a test statistic that is at least as extreme as that observed in the sample data. Suppose that it is desired to test

H 0 : B = Bo versus H1 : B -=f. Bo,

(6.13)

and that a classical significance test is available and is based on a test statistic T(X), large values of which are deemed to provide evidence against the null hypothesis. If data X = x is observed, with corresponding t = T(x), the P-value then is

a= Pe 0 (T(X)

T(x)).

Example 6.2. Suppose we observe X 1 , ••• ,Xn i.i.d. from N(B,~J 2 ), where ~J 2 is known. Then X is sufficient for e and it has the N(B, IJ 2 fn) distribution. Noting that T = T(X) = lfo (X- 80 ) fiJI, is a natural test statistic to test (6.13), one obtains the usual P-value as a = 2[1 - 0 denotes that Q is positive definite. Proof. Note that sup gEGuEs

m 9 (x) =

sup gEGuES

f(x[O)g(O) d()

= supjf(x[O)h(O'QO)[Q[! dO h,Q

170

6 Hypothesis Testing and Model Selection

=sup {sup Q>O

= sup {sup Q>O

k>O

Jf(xiQ-~u)h(u'u) du} v(\) {

jJJuJJ:Sk

(6.24)

f(xiQ-~u) du}'

because the maximization of the inside integral over non-increasing h in (6.24) is the same as maximization of that integral over the class of unimodal spherically symmetric densities, and hence Lemma 6.4 applies. D Consider the above result in the context of Example 6.8. The lower bounds on the Bayes factor as well as the posterior probability of the null hypothesis will be substantially lower if we use the class G u ES rather than G u sP. This is immediate from (6.23), because the lower bounds over CusP correspond with the maximum in (6.23) with Q = I. The result also questions the suitability of Gu ES for these lower bounds in view of the fact that the lower bounds will correspond with prior densities that are extremely biased towards H 1 . Many other esoteric classes of prior densities have also been considered by some for deriving lower bounds. In particular, generalization of symmetry from the single-parameter case to the multiparameter case has been examined. DasGupta and Delampady (1990) consider several subclasses of the symmetric star-unimodal densities. Some of these are mixtures of uniform distributions on .Cp (for p = 1, 2, oo ), class of distributions with components that are independent symmetric unimodal distributions and a certain subclass of one-symmetric distributions. Note that mixtures of uniform distributions on .C 2 balls are simply unimodal spherically symmetric distributions, whereas mixtures of uniform distributions on .C 1 balls contain distributions whose components are i.i.d. exponential distributions. Uniform distributions on hypercubes form a subclass of mixtures of uniforms on .C= balls. Also considered there is the larger subclass consisting of distributions whose components are identical symmetric unimodal distributions. Another class of one-symmetric distributions considered there is of interest because it contains distributions whose components are i.i.d. Cauchy. Even though studies such as these are important from robustness considerations, we feel that they do not necessarily add to our understanding of possible interpretation of P-values from a robust Bayesian point of view. However, interested readers will find that Dharmadhikari and Joag-Dev (1988) is a good source for multivariate unimodality, and Fang et al. (1990) is a good reference for multivariate symmetry for material related to the classes mentioned above. We have noted earlier that computation of Bayes factor and posterior probability is difficult when parametric alternatives are not available. Many frequentist statisticians claim that P-values are valuable when there are no alternatives explicitly specified, as is common with tests of fit. We consider this issue here for a particularly common test of fit, the chi-squared test of goodness of fit. It will be observed that alternatives do exist implicitly, and hence Bayes factors and posterior probabilities can indeed be computed. The

6.3 Bounds on Bayes Factors and Posterior Probabilities

171

following results from Delampady and Berger (1990) once again point out the discrepancies between P-values and Bayesian measures of evidence.

2::7=

Example 6.1 0. Let n = ( n1, n2, ... , nk) be a sample of fixed size N = 1 ni from a k-cell multinomial distribution with unknown cell probabilities p (p1,P2, ... ,pk) and density (mass) function

Consider testing

Ho : p = p 0 versus H 1 : p

cJ p 0 ,

where p 0 = (p~, pg, ... , p~) is a specified interior point of the k-dimensional simplex. Instead of focusing on the exact multinomial setup, the most popular approach is to use the chi-,.;quared approximation. Here the test statistic of interest is

which has the asymptotic distribution (as N ---+ oo) of 1 under H 0 . To compare P-values so obtained with the corresponding robust Bayesian measures of evidence, the following are two natural classes of prior distributions to consider. (i) The conjugate class Gc of Dirichlet priors with density

where

Cti

> 0 satisfy

(ii) Consider the following transform of (p 1, P2, ... , Pk- I) 1 : u

= up ( ) = (P1 - p~ ' P2 - pg , ... , P_k_-_1_-_P_:_~:__---=-1) I P1

Pk-1

+ ( ~- Pk ) ( vJil, y/p2, ... ' VPk-1) vPk + Pk

1 •

The justification (see Delampady and Berger (1990)) for using such a transform is that its range is nk- 1 unlike that of p and its likelihood function is more symmetric and closer to a multivariate normal. Now let

Cusp= {unimodal g*(u) that are spherically symmetric about 0},

172

6 Hypothesis Testing and Model Selection

and consider the class of prior densities g obtained by transforming back to the original parameter:

GrusP

{ g(p)

= g

ou(p) } · (u(p))i-----ap-1

Delampady and Berger (1990) show that as N ---+ oo, the lower bounds on Bayes factors over Gc and Crus converge to those corresponding with the multivariate normal testing problem (chi-squared test) in Example 6.8, thus proving that irreconcilability of P-values and Bayesian measures of evidence is present in goodness of fit problems as well. Additional discussion of the multinomial testing problem with mixture of conjugate priors can be found in Good (1965, 1967, 1975). Edwards et al. (1963) discuss the possibility of finding lower bounds on Bayes factors over the conjugate class of priors for the binomial problem. Extensive discussion of the binomial problem and further references can be found in Berger and Delampady (1987).

6.3.4 Invariant Tests 1 A natural generalization of the symmetry assumption (on the prior distribution) is invariance under a group of transformations. Such a generalization and many examples can be found in Delampady (1989a). A couple of those examples will be discussed below to show the flavor of the results. The general results that utilize sophisticated mathematical arguments will be skipped, and instead interested readers are referred to the source indicated above. For a good discussion on invariance of statistical decision rules, see Berger (1985a). Recall that the random observable X takes values in a space X and has density (mass) function f(xJO). The unknown parameter is 0 E 8 ~ Rn, for some positive integer n. It is desired to test H 0 : 0 E 8 0 versus H 1 : 0 E 8 1 . We assume the following in addition. (i) There is a group g (of transformations) acting on X that induces a group g (of transformations acting) on 8. These two groups are isomorphic (see Section 5.1.7) and elements of g will be denoted by g, those of g by g. (ii) f(gxJgO) = f(xJO)k(g) for a suitable continuous map k (from g to (0, oo)). (iii) gGo =Go, g81 = 81, gG = 8. In this context, the following concept of a maximal invariant is needed.

Definition. When a group G of transformations acts on a space X, a function T(x) on X is said to be invariant (with respect to G) if T(g(x)) = T(x) for all x E X and g E G. A function T(x) is maximal invariant (with respect to G) if it is invariant and further T(xl) 1

= T(x2 )

implies x 1

= g(x 2 )

Section 6.3.4 may be omitted at first reading.

for some g E G.

6.3 Bounds on Bayes Factors and Posterior Probabilities

173

This means that G divides X into orbits where invariant functions are constant. A maximal invariant assigns different values to different orbits. Now from (i), we have that the action of g and g induce maximal invariants t(X) on X and TJ(O) on 8, respectively.

Remark 6.11. The family of densities f(xiO) is said to be invariant under g if (ii) is satisfied. The testing problem Ho : 8 E 8o versus H 1 : 0 E 81 is said to be invariant under g if in addition (iii) is also satisfied. Example 6.12. Consider Example 6.8 again and suppose X "' NP (0, I). It is desired to test Ho : 0 = 0 versus H 1 : 0 =f. 0. This testing problem is invariant under the group (] 0 of all orthogonal transformations; i.e., if His an orthogonal matrix of order p, then gHX = HX"' Np(HO, I), so that gHO = HO. Further, f(xiO) = (27r)-PI 2 exp( -

f(gHxlgHO)

(x- O)'(x- 0)), and

(27r)-P/ 2 exp( -

(21r) -p/ 2 exp(-- (x - 0)' (x- 0))

f(xiO),

(Hx- HO)'(Hx- HO))

2 1 2

so that (ii) is satisfied. Also, gHO = 0, and (iii) too is satisfied.

Example 6.13. Let X 1 , X 2 , · · ·, Xn be a random sample from N (8, 0" 2 ) with both 8 and O" unknown. The problem is to test the hypothesis H 0 : 8 = 0 against Hl : 8 =I 0. A sufficient statistic for (8, 0") is X = (X, S), X= L~ Xdn and S =

[L~(Xi- .X? /n]

112

Then

f(xl8, O") = KO"-n sn- 2 exp( -n

[(X- 8) 2 + S 2 J /(20" 2 )),

where K is a constant. Also,

X= {(x,s):

R,s > 0}, and 8

{(8,0"): 8 E R,O" > 0}.

The problem is invariant under the group G = {gc = c : c > 0}, where the action of gc is given by gc(x) = c(x, s) = (ex, cs). Note that f(gcxl8, O") = c- 2 f(xl8, O"). A number of technical conditions in addition to the assumptions (i)-(iii) yield a very useful representation for the density of the maximal invariant statistic t(X). Note that this density, q(t(x)ITJ(O)), depends on the parameter 0 only through the maximal invariant in the parameter space, TJ(O).

174

6 Hypothesis Testing and Model Selection

The technique involved in the derivation of these results uses an averaging over a relevant group. The general method of this kind of averaging is due to Stein (1956), but because there are a number of mathematical problems to overcome, various different approaches were discovered as can be seen in Wijsman (1967, 1985, 1986), Andersson (1982), Andersson et al. (1983), and Farrell (1985). For further details, see Eaton (1989), Kariya and Sinha (1989), and Wijsman (1990). The specific conditions and proofs of these results can be found in the above references. In particular, the groups considered here are amenable groups as presented in detail in Bondar and Milnes (1981). See also Delampady (1989a). The orthogonal group, and the group of location-scale transformations are amenable. The multiplicative group of non-singular p x p linear transformations is not. Let us return to the issue of comparison of P-values and lower bounds on Bayes factors and posterior probabilities (of hypotheses) in this setup. We note that it is necessary to reduce the problem by using invariance for any meaningful comparison because the classical test statistic and hence the computation of P-value are already based on this. Therefore, the natural class of priors to be used for this comparison is the class G 1 of G-invariant priors; i.e., those priors 1r that satisfy (iv) 1r(A) = 1r(Ag). Theorem 6.14. If 9 is a group of transformations satisfying certain regularity conditions (see Delampady {1989a)),

inf r11 E8o/9

B(GI,x) =

sup

q (t(x)lryl) q ( t ( x )I '1]2 )'

(6.25)

T/2E81/9

where

e /9

denotes the space of maximal invariants on the parameter space.

Corollary 6.15. If 8 0 /9 = {0}, then under the same conditions as in Theorem 6.14, B(GI x) = q (t(x)IO) ' {supryEe;g q (t(x)lry)}

Example 6.16. (Example 6.12, continued.) Consider the class of all priors that are invariant under orthogonal transformations, and note that this class is simply the class of all spherically symmetric distributions. Now, application of Corollary 6.15 yields, q ( t(x) IO)

B(Ghx)

= q(t(x)lr])'

where q(tlry) is the density of a noncentral x2 random variable with p degrees of freedom and non-centrality parameter ry, and r] is the maximum likelihood estimate of 7J from data t(x). For selected values of p the lower bounds, B and P (for 1r0 = 0.5) are tabulated against their P-values in Table 6.4.

6.3 Bounds on Bayes Factors and Posterior Probabilities

175

Table 6.4. Invariant Test for Normal Means

3 4 15 20

0.01

B .0749 .0745 .0737 .0734

B .2913 .2903 .2842 .2821

.0697 .0693 .0686 .0684

0.05 p

.2256 .2250 .2213 .2200

Notice that the lower bounds on the posterior probabilities of the null hypothesis are anywhere from 4 to 7 times as large as the corresponding P-values, indicating that there is a vast discrepancy between P-values and posterior probabilities. This is the same phenomenon as what was seen in Table 6.3. What is, however, interesting is that the class of priors considered here is larger and contains the one considered there, but the magnitude of the discrepancy is about the same. Example 6.17. (Example 6.13, continued.) In the normal example with unknown variance, we have the maximal invariants t(x) = x/s and T)(B, a)= Bja. If we define, Gr = {n: dn(B,a) = h 1 (T])dT)da,h 1 is any density forT)}, a

we obtain, q (t(x)IO) B(Gr, x) = q (t(x)lil)' where q(tiT)) is the density of a noncentral Student's t random variable with n1 degrees of freedom, and non-centrality parameter T), and ij is the maximum likelihood estimate ofT). The fact that all the necessary conditions (which are needed to apply the relevant results) are satisfied is shown in Andersson (1982) and Wijsman (1967). For selected values of the lower bounds are tabulated along with the P-values in Table 6.5. For small values of n, the lower bounds in Table 6.5 are comparable with the corresponding P-values, whereas as n gets large the differences between these lower bounds and the P-values get larger. See also in this connection Section 6.4. There is substantial literature on Bayesian testing of a point null. Among these are Jeffreys (1957, 1961), Good (1950, 1958, 1965, 1967, 1983, 1985, 1986), Lindley (1957, 1961, 1965, 1977), Raiffa and Schlaiffer (1961), Edwards et al. (1963), Hildreth (1963), Smith (1965), Zellner (1971, 1984), Dickey (1971, 1973, 1974, 1980), Lempers (1971), Rubin (1971), Leamer (1978), Smith and Spiegelhalter (1980), Zellner and Siow (1980), and Diamond and Forrester (1983). Related work can also be found in Pratt (1965), DeGroot (1973), Dempster (1973), Dickey (1977), Bernardo (1980), Hill (1982), Shafer (1982), and Berger ( 1986).

176

6 Hypothesis Testing and Model Selection Table 6.5. Test for Normal Mean, Variance Unknown Q

n 2 8 12 16 32

= 0.01

= 0.05

= 0.10

!l_

.0117 .0137 .0212 .0290 .0327

.0116 .0135 .0208 .0282 .0317

.0506 .0941 .1245 .1301 .1380

.0482 .0860 .1107 .1151 .1213

.0939 .2114 .2309 .2375 .2478

.0858 .1745 .1876 .1919 .1986

lnvariance and Minimaxity Our focus has been on deriving bounds on Bayes factors for invariant testing problems. There is, however, a large literature on other aspects of invariant tests. For example, if the group under consideration satisfies the technical condition of amenability and hence the Hunt-Stein theorem is valid, then the minimax invariant test is minimax among all tests. We do not discuss these results here. For details on this and other related results we would like to refer the interested readers to Berger (1985a), Kiefer (1957, 1966), and Lehmann (1986).

6.3.5 Interval Null Hypotheses and One-sided Tests Closely related to a sharp null hypothesis H 0 : () = ()0 is an interval null hypothesis H 0 : IB-Bol ~E. Dickey (1976) and Berger and Delampady (1987) show that the conflict between P-values and posterior probabilities remains if E is sufficiently small. The precise order of magnitude of small E depends on the sample size n. One may also ask similar questions of possible conflict between P-values and posterior probabilities for one-sided null, say, H 0 : () ~ () 0 versus H 1 : () > ()0 . In the case of () = mean of a normal, and the usual uniform prior, direct calculation shows the P-value equals posterior probability. On the other hand, Casella and Berger (1987) show in general the two are not the same and the P-value may be smaller or greater depending on the family of densities in the model. Incidentally, the ambiguity of an improper prior discussed in Section 6. 7 does not apply to one-sided nulls. In this case the Bayes factor remains invariant if the improper prior is multiplied by an arbitray constant.

6.4 Role of the Choice of an Asymptotic Framework2 This section is based on Ghosh et al. (2005). Suppose X 1 , ... , Xn are i.i.d. N(B, a 2 ), a 2 known, and consider the problem of testing H 0 : () = ()0 versus 2

Section 6.4 may be omitted at first reading.

6.4 Role of the Choice of an Asymptotic Framework

177

H 1 : B -=1- Ba. If instead of taking a lower bound as in the previous sections, we take a fixed prior density g 1(B) under H 1 but let n go to oo, then the conflict between P-values and posterior probabilities is further enhanced. Historically this phenomenon was noted earlier than the conflict with the lower bound, vide Jeffreys (1961) and Lindley (1957). Let g 1 be a uniform prior density over some interval ( Ba - a, Ba + a) containing Ba. The posterior probability of Ha given X= (X1, ... , Xn) is

P(HaiX) =Ira exp[-n(X- Ba) 2 /(20" 2 )]/ K, where Ira is the specified prior probability of Ha and -

K =Ira exp[-n(X- Ba) 2 /(20" 2 )]

1 -Ira

+ -2a

Bo+a

2 2 exp[-n(X- B) /(20" )]dB.

Bo-a

Suppose X is such that X= Ba + ZaO"/Vn where Za is the 100(1- a)% quantile of N(O, 1). Then X is significant at level ct. Also, for sufficiently large n, X is well within (Ba-a, Ba +a) because X- Ba tends to zero as n increases. This leads to

and hence

P(HaiX)

~ Iraexp(-z;/2)/[Iraexp(-z;/2) + (1 -Ira) O"V(2Ir/n)]. 2a

Thus P(HaiX) -+ 1 as n-+ oo whereas the P-value is equal to a for all n. This is known as the Jeffreys-Lindley paradox. It may be noted that the same phenomenon would arise with any fiat enough prior in place of uniform. Indeed, P-values cannot be compared across sample sizes or across experiments, see Lindley (1957), Ghosh et al. (2005). Even a frequentist tends to agree that the conventional values of the significance level a like a = 0.05 or 0.01 are too large for large sample sizes. This point is further discussed below. The Jeffreys-Lindley paradox shows that for inference about B, P-values and Bayes factors may provide contradictory evidence and hence can lead to opposite decisions. Once again, as mentioned in Section 6.3, the evidence against Ha contained in P-values seems unrealistically high. We argue in this section that part of this conflict arises from the fact that different types of asymptotics are being used for the Bayes factors and the P-values. We begin with a quick review of the two relevant asymptotic frameworks in classical statistics for testing a sharp null hypothesis. The standard asymptotics of classical statistics is based on what are called Pitman alternatives, namely, Bn = Ba + d/fo at a distance of 0(1/fo) from the null. The Pitman alternatives are also called contiguous in the very general asymptotic theory developed by Le Cam (vide Roussas (1972), Le Cam and

178

6 Hypothesis Testing and Model Selection

Yang (2000), Hajek and Sidak (1967)). The log-likelihood ratio of a contiguous alternative with respect to the null is stochastically bounded as n -+ oo. On the other hand, for a fixed alternative, the log-likelihood ratio tends to -oo (under the null) or oo (under the fixed alternative). If the probability of Type 1 error is 0 < a < 1, then the behavior of the likelihood ratio has the following implication. The probability of Type 2 error will converge to 0 < j3 < 1 under a contiguous alternative Bn and to zero if e is a fixed alternative. This means the fixed alternatives are relatively easy to detect. So in this framework it is assumed that the alternatives of importance are the contiguous alternatives. Let us call this theory Pitman type asymptotics. There are several other frameworks in classical statistics of which Bahadur's (Bahadur, 1971; Serfling, 1980, pp. 332~341) has been studied most. We focus on Bahadur's approach. In Bahadur's theory, the alternatives of importance are fixed and do not depend on n. Given a test statistic, Bahadur evaluates its performance at a fixed alternative by the limit (in probability or a.s.) of ~(logP-value) when the alternative is true. Which of these two asymptotics is appropriate in a given situation should depend on which alternatives are important, fixed alternatives or Pitman alternatives 80 + d/ .Jii that approach the null hypothesis at a certain rate. This in turn depends on how the sample size n is chosen. If n is chosen to ensure a Type 2 error bounded away from 0 and 1 (like a), then Pitman alternatives seem appropriate. If n is chosen to be quite large, depending on available resources but not on alternatives, then Bahadur's approach would be reasonable. 6.4.1 Comparison of Decisions via P-values and Bayes Factors in Bahadur's Asymptotics

In this subsection, we essentially follow Bahadur's approach for both P-values and Bayes factors. A Pitman type asymptotics is used for both in the next subsection. We first show that if the P-value is sufficiently small, as small as it is typically in Bahadur's theory, B 01 will tend to zero, calling for rejection of H 0 , i.e., the evidence in the P-value points in the same direction as that in the Bayes factor or posterior probability, removing the sense of paradox in the result of Jeffreys and Lindley. One could, therefore, argue that the Pvalues or the significance level a assumed in the Jeffreys-Lindley example are not small enough. The asymptotic framework chosen is not appropriate when contiguous alternatives are not singled out as alternatives of importance. We now verify the claim about the limit of B 01 . Without loss of generality, take 80 = 0, a 2 = 1. First note that logBo 1 where

n -

2 2 x + 2 1ogn + Rn,

Rn = -log7r(XIH!)-

2 log(27r) + o(1)

(6.26)

6.5 Bayesian P-value

179

provided the prior g 1 (B) is a continuous function of Band is positive at all B. If we omit Rn from the right-hand side of (6.26), we have Schwarz's (1978) approximation to the Bayes factor via BIC (Section 4.3). The logarithm of P-value (p) corresponding to observed X is logp = log 2[1 - ( ..;n I x I)] = -

~ X 2 (1 + o(1))

by standard approximation to a normal tail (vide Feller (1973, p. 175) or Bahadur (1971, p. 1)). Thus~ logp --t -0 2 /2 and by (6.26), logB 01 --t-oo. This result is true as long as IX I > c(log n j n) 1 / 2 , c > V2. Such deviations are called moderate deviations, vide Rubin and Sethuraman (1965). Of course, even for such P-values, p rv (B 0 1/n) so that P-values are smaller by an order of magnitude. The conflict in measuring evidence remains but the decisions are the same. Ghosh et al. (2005) also pursue the comparison of the three measures of evidence based on the likelihood ratio, the P-value based on the likelihood ratio test, and the Bayes factor B 01 under general regularity conditions. 6.4.2 Pitman Alternative and Rescaled Priors

We consider once again the problem of testing H 0 : B = 0 versus H 1 : B -f. 0 on the basis of a random sample from N(B, 1). Suppose that the Pitman alternatives are the most important ones and the prior g 1 (B) under H 1 puts most of the mass on Pitman alternatives. One such prior is N(0,6jn). Then B 01 =

VS+i exp [- ~

1) X

If the P-value is close to zero, ..fiiiXI is large and therefore, B 01 is also close to zero, i.e., for these priors there is no paradox. The two measures are of the same order but the result of Berger and Sellke (1987) for symmetric unimodal priors still implies that P-value is smaller than the Bayes factor.

6.5 Bayesian P-value3 Even though valid Bayesian quantities such as Bayes factor and posterior probability of hypotheses are in principle the correct tools to measure the evidence for or against hypotheses, they are quite often, and especially in many practical situations, very difficult to compute. This is because either the alternatives are only very vaguely specified, vide (6.6), or very complicated. Also, in some cases one may not wish to compare two or more models but check how a model fits the data. Bayesian P-values have been proposed to deal with such problems. 3

Section 6.5 may be omitted at first reading.

180

6 Hypothesis Testing and Model Selection

Let M 0 be a target model, and departure from this model be of interest. If, under this model, X has density f(xlry), 17 E £, then for a Bayesian with prior 1r on 1], m1r(x) = J£ f(xlry)n(ry) dry, the prior predictive distribution is the actual predictive distribution of X. Therefore, if a model departure statistic

T(X) is available, then one can define the prior predictive P-value (or tail area under the predictive distribution) as

P = pm7C (T(X) 2': T(xobs)IMo), where Xobs is the o

An Introduction to Bayesian Analysis: Theory and Methods - PDF Free Download (2024)

References