1017 lines
46 KiB
Plaintext
1017 lines
46 KiB
Plaintext
\documentclass{article}[11pt]
|
|
\usepackage{Sweave}
|
|
\usepackage{amsmath}
|
|
\addtolength{\textwidth}{1in}
|
|
\addtolength{\oddsidemargin}{-.5in}
|
|
\setlength{\evensidemargin}{\oddsidemargin}
|
|
%\VignetteIndexEntry{Population contrasts}
|
|
|
|
\SweaveOpts{prefix.string=tests,width=6,height=4, keep.source=TRUE, fig=FALSE}
|
|
% Ross Ihaka suggestions
|
|
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
|
|
\fvset{listparameters={\setlength{\topsep}{0pt}}}
|
|
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
|
|
|
|
\SweaveOpts{width=6,height=4}
|
|
\setkeys{Gin}{width=\textwidth}
|
|
|
|
<<echo=FALSE>>=
|
|
options(continue=" ", width=70)
|
|
options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1))))
|
|
pdf.options(pointsize=8) #text in graph about the same as regular text
|
|
options(contrasts=c("contr.treatment", "contr.poly")) #reset default
|
|
library(survival)
|
|
library(splines)
|
|
@
|
|
|
|
\title{Population contrasts}
|
|
\author{Terry M Therneau \\ \emph{Mayo Clinic}}
|
|
\newcommand{\code}[1]{\texttt{#1}}
|
|
\newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth]
|
|
{tests-#1.pdf}}
|
|
|
|
\newcommand{\ybar}{\overline{y}}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
\ \tableofcontents
|
|
|
|
\section{Introduction}
|
|
Statisticians and their clients have always been fond of single number
|
|
summaries for a data set, perhaps too much so.
|
|
Consider the hypothetical data shown in figure \ref{fig1}
|
|
comparing treatments A and B with age as a confounder.
|
|
What is a succinct but useful summary of the difference between
|
|
treatment arms A and B?
|
|
One approach is to select a fixed \emph{population} for the
|
|
age distribution, and then compute the
|
|
mean effect over that population.
|
|
|
|
\begin{figure}
|
|
<<fig1, echo=FALSE, fig=TRUE>>=
|
|
plot(c(50,85), c(2,4.5), type='n', xlab="Age", ylab="Effect")
|
|
#abline(.645, .042, lty=1, col=1, lwd=2)
|
|
#abline(.9, .027, lty=1, col=2, lwd=2)
|
|
abline(.35, .045, lty=1, col=1, lwd=2)
|
|
abline(1.1, .026, lty=1, col=2, lwd=2)
|
|
legend(50, 4.2, c("Treatment A", "Treatment B"),
|
|
col=c(1,2), lty=1, lwd=2, cex=1.3, bty='n')
|
|
@
|
|
\caption{Treatment effects for a hypothetical study.}
|
|
\label{fig1}
|
|
\end{figure}
|
|
|
|
More formally, assume we have a fitted model.
|
|
We want to compute the conditional expectation
|
|
\begin{equation*}
|
|
m_A = E_F\left( \hat y | trt=A \right)
|
|
\end{equation*}
|
|
where $F$ is some chosen population for the covariates other than treatment.
|
|
Important follow-up questions are: what population should be used,
|
|
what statistic should be averaged,
|
|
what computational algorithm should be used,
|
|
and what are the statistical properties of the resulting
|
|
estimate?
|
|
Neither the statistic nor population questions should be taken lightly,
|
|
and both need to be closely linked to the scientific question.
|
|
If, for instance, the model of figure 1 were used to inform a nursing home
|
|
formulary, then the distribution $F$ might be focused on higher ages.
|
|
|
|
Four common populations are
|
|
|
|
\begin{itemize}
|
|
\item Empirical: The data set itself. For the simple example above, this
|
|
would be the distribution of all $n$ ages in the data set,
|
|
irrespective of treatment.
|
|
\item Factorial or Yates: This is only applicable if the adjusting
|
|
variables
|
|
are all categorical, and consists of all unique combinations of them.
|
|
That is, the data set one would envision for a balanced factorial
|
|
experiment.
|
|
\item External: An external reference such as the age/sex distribution
|
|
of the US census. This is common in epidemiology.
|
|
\item SAS type 3: A factorial distribution for the categorical predictors
|
|
and the data distribution for the others. More will be said about
|
|
this in section \ref{sect:SAS}.
|
|
\end{itemize}
|
|
|
|
The \code{yates} function is designed to compute such population averages
|
|
from a fitted model, along with desired contrasts on the resultant
|
|
estimates, e.g., whether the population average effects for treatment A and
|
|
treatment B are equal.
|
|
The function has been tested with the results of \code{lm}, \code{glm},
|
|
and \code{coxph} fits,
|
|
and can easily be extended to any R model that includes a standard set of
|
|
objects in the result, i.e.,
|
|
\code{terms}, \code{contrasts}, \code{xlevels}, and \code{assign}.
|
|
|
|
The routine's name is a nod to the 1934 paper by Yates \cite{Yates34}
|
|
\emph{The analysis of multiple classifications with unequal numbers in
|
|
different classes}.
|
|
In dealing with an unbalanced 2-way layout he states that
|
|
\begin{quote}\ldots in the absence of any further
|
|
assumptions the efficient estimates of the average A effects
|
|
are obtained by taking the means
|
|
of the observed sub-class means for like A over all B sub-classes.
|
|
\end{quote}
|
|
In his example these sub-class means are the predicted values for each
|
|
A*B combination, thus his estimate for each level of A is the
|
|
mean predicted value over B, i.e., the mean prediction for A over
|
|
a factorial population for B.
|
|
Yates then develops
|
|
formulas for calculating these quantities and testing their
|
|
equality that are practical for the manual methods of the time;
|
|
these are now easily accomplished directly using matrix computations.
|
|
(It is interesting that Yates' paper focuses far more on estimands than
|
|
tests, while later references to his work focus almost exclusively
|
|
on the latter, e.g., the ``Yates sum of squares''.
|
|
Reflecting, again, our profession's singular focus on p values.)
|
|
|
|
This
|
|
concept of population averages is actually a common one in statistics.
|
|
Taking an average is, after all, nearly the first thing a statistician
|
|
will do.
|
|
Yates' weighted means analysis, the g-estimates of causal models,
|
|
direct adjusted survival curves, and least squares means are but a
|
|
small sample of the idea's continual rediscovery.
|
|
Searle et. al. \cite{Searle80} use the term population marginal mean (PMM),
|
|
which we will adopt as the acronym,
|
|
though they deal only with linear models, assume a factorial population,
|
|
and spend most of their energy writing out explicit formulas for particular
|
|
cases. They also use a separate acronym to distinguish the estimated PMM
|
|
based on a model fit from the ideal, which we will not do.
|
|
|
|
\section{Solder Example}
|
|
\subsection{Data}
|
|
In 1988 an experiment was designed and implemented at one of AT\&T's
|
|
factories to investigate alternatives in the wave soldering procedure
|
|
for mounting electronic components to printed circuit boards.
|
|
The experiment varied a number of factors relevant to the process.
|
|
The response, measured by eye, is the number of visible solder skips.
|
|
The data set was used in the
|
|
book Statistical Models in S \cite{Chambers93} and is included in the
|
|
\code{survival} package.
|
|
|
|
<<solder1a>>=
|
|
summary(solder)
|
|
length(unique(solder$PadType))
|
|
@
|
|
|
|
A perfectly balanced experiment would have 3*2*10*3 = 180 observations for
|
|
each Mask, corresponding to all combinations of Opening, Solder Thickness,
|
|
PadType and Panel.
|
|
The A3 Mask has extra replicates for a subset of the Opening*Thickness
|
|
combinations, however, while Mask A6 lacks observations for these sets.
|
|
Essentially, one extra run of 180 was done with a mixture of Masks.
|
|
Figure \ref{fig:solder} gives an overview of univariate results for
|
|
each factor.
|
|
|
|
\begin{figure}
|
|
<<solder1b, fig=TRUE, echo=FALSE>>=
|
|
# reproduce their figure 1.1
|
|
temp <- lapply(1:5, function(x) tapply(solder$skips, solder[[x]], mean))
|
|
plot(c(0.5, 5.5), range(unlist(temp)), type='n', xaxt='n',
|
|
xlab="Factors", ylab ="mean number of skips")
|
|
|
|
axis(1, 1:5, names(solder)[1:5])
|
|
for (i in 1:5) {
|
|
y <- temp[[i]]
|
|
x <- rep(i, length(y))
|
|
text(x-.1, y, names(y), adj=1)
|
|
segments(i, min(y), i, max(y))
|
|
segments(x-.05, y, x+.05, y)
|
|
}
|
|
@
|
|
\caption{Overview of the solder data.}
|
|
\label{fig:solder}
|
|
\end{figure}
|
|
|
|
\subsection{Linear model}
|
|
A subset of the solder data that excludes Mask A6 is exactly the type
|
|
of data set considered in Yates' paper: a factorial design whose data set
|
|
is not quite balanced.
|
|
Start with a simple fit and then obtain the Yates predictions.
|
|
|
|
<<solder2>>=
|
|
with(solder[solder$Mask!='A6',], ftable(Opening, Solder, PadType))
|
|
|
|
fit1 <- lm(skips ~ Opening + Solder + Mask + PadType + Panel,
|
|
data=solder, subset= (Mask != 'A6'))
|
|
y1 <- yates(fit1, ~Opening, population = "factorial")
|
|
print(y1, digits=2) # (less digits, for vignette page width)
|
|
@
|
|
|
|
The printout has two parts: the left hand columns are population marginal mean
|
|
(PMM) values and the right hand columns are tests on those predicted values.
|
|
The default is a single global test that these PMM are all equal.
|
|
Under a factorial population these are the Yates' weighted
|
|
means \cite{Yates34} and the corresponding test is the Yates' sum of
|
|
squares for that term. These would be labeled as
|
|
``least squares means'' and ``type III SS'', respectively,
|
|
by the glm procedure of SAS.
|
|
More on this correspondence appears in the section on the SGTT algorithm.
|
|
Now we repeat this using the default population, which is the set of all 810
|
|
combinations for Solder, Mask, PadType and Panel found in the non-A6 data.
|
|
The \code{pairwise} option requests tests on all pairs of openings.
|
|
<<solder2b>>=
|
|
y2 <- yates(fit1, "Opening", population = "data", test="pairwise")
|
|
print(y2, digits=2)
|
|
#
|
|
# compare the two results
|
|
temp <- rbind(diff(y1$estimate[,"pmm"]), diff(y2$estimate[,"pmm"]))
|
|
dimnames(temp) <- list(c("factorial", "empirical"), c("2 vs 1", "3 vs 2"))
|
|
round(temp,5)
|
|
@
|
|
|
|
Although the PMM values shift with the new population, the difference in
|
|
PMM values between any two pairs is unchanged.
|
|
This is because we have fit a model with no interactions.
|
|
Referring to figure \ref{fig1}, this is a model where all of the predictions are
|
|
parallel lines; shifting the population left or
|
|
right will change the PMM, but has no effect on the difference between two
|
|
lines.
|
|
For a linear model with no interactions, the test statistics created by the
|
|
\code{yates} function are thus not very interesting, since they will be no
|
|
different than simple comparisons of the model coefficients.
|
|
|
|
Here are results from a more interesting fit that includes interactions.
|
|
|
|
<<solder3>>=
|
|
fit2 <- lm(skips ~ Opening + Mask*PadType + Panel, solder,
|
|
subset= (Mask != "A6"))
|
|
temp <- yates(fit2, ~Opening, population="factorial")
|
|
print(temp, digits=2)
|
|
@
|
|
This solder data set is close to being balanced,
|
|
and the means change only a small amount.
|
|
(One hallmark of a completely balanced experiment is that any
|
|
PMM values are unaffected by the addition of interaction terms.)
|
|
|
|
\subsection{Missing cells}
|
|
Models that involve factors and interactions can have an issue with
|
|
missing cells as shown by the example below using the full
|
|
version of the solder data.
|
|
<<solder4>>=
|
|
fit3 <- lm(skips ~ Opening * Mask + Solder + PadType + Panel, solder)
|
|
temp <- yates(fit3, ~Mask, test="pairwise")
|
|
print(temp, digits=2)
|
|
@
|
|
|
|
The population predictions for each Mask include all combinations of
|
|
Opening, Solder, PadType, and Panel that are found in the data.
|
|
The above call implicity uses the default value of \code{population=`data'},
|
|
which is the option that users will normally select.
|
|
The underlying algorithm amounts to:
|
|
\begin{enumerate}
|
|
\item Make a copy of the data set (900 obs), and set Mask to A1.5 for
|
|
all observations
|
|
\item Get the 900 resulting predicted values from the model, and take their
|
|
average
|
|
\item Repeat 1 and 2 for each mask type.
|
|
\end{enumerate}
|
|
|
|
However, there were no observations in the data
|
|
set with Mask = A6 and Opening = Large.
|
|
Formally, predictions for the A6/Large combination are \emph{not estimable},
|
|
and as a consequence
|
|
neither are any population averages that include those predicted values,
|
|
nor any tests that involve those population averages.
|
|
This lack of estimability is entirely due to the inclusion of a Mask by Opening
|
|
interaction term in the model, which states that each Mask/Opening combination
|
|
has a unique effect, which in turn implies that we need an estimate for all
|
|
Mask*Opening pairs to compute population predictions for all levels of
|
|
the Mask variable.
|
|
|
|
If you do the above steps `by hand' using the R \code{predict} function,
|
|
it will return a value for all 900 observations
|
|
along with a warning message that the results may not be reliable,
|
|
and the warning is correct in this case.
|
|
The result of \code{coef(fit2)} reveals that the fit generated an NA
|
|
as one of the coefficients.
|
|
The presence of a missing value shows that some preditions
|
|
will not be estimable, but it is not possible to determine \emph{which}
|
|
ones are estimable from the coefficients alone.
|
|
The predict function knows that some predictions will be wrong, but not
|
|
which ones.
|
|
A formal definition of estimability for a given prediction is that it
|
|
can be written
|
|
as a linear combination of the rows of $X$, the design matrix for the fit.
|
|
The \code{yates} function performs the necessary calculations to verify formal
|
|
estimability of each predicted value, and thus is able to correctly identify the
|
|
deficient terms.
|
|
|
|
\section{Generalized linear models}
|
|
\label{sect:glm}
|
|
Since the solder response is a count of the number of skips, Poisson
|
|
regression is a more natural modeling approach than linear regression.
|
|
In a glm model we need to consider more carefully both the population
|
|
and the statistic to be averaged.
|
|
|
|
<<glm>>=
|
|
gfit2 <- glm(skips ~ Opening * Mask + PadType + Solder, data=solder,
|
|
family=poisson)
|
|
y1 <- yates(gfit2, ~ Mask, predict = "link")
|
|
print(y1, digits =2)
|
|
print( yates(gfit2, ~ Mask, predict = "response"), digits=2)
|
|
@
|
|
|
|
Mean predicted values for the number of skips using
|
|
\code{type=`response'} in \code{predict.glm} are similar to estimates when
|
|
the linear regression model was used.
|
|
Prediction of type `link' yields a population average of the
|
|
linear predictor $X\beta$.
|
|
Though perfectly legal (you can, after all, take the mean of anything you want),
|
|
the linear predictor PMM values can be more difficult to interpret.
|
|
Since the default link function for Poisson regression is log(),
|
|
the PMM of the linear predictor equals the mean of the
|
|
log(predicted values).
|
|
Since $\exp({\rm mean}(\log(x)))$ defines the geometric mean of a random
|
|
variable $x$,
|
|
one can view the exponentiated version of the link estimate as
|
|
a geometric mean of predicted values over the population.
|
|
Given the skewness of Poisson counts, this may actually be an advantageous
|
|
summary.
|
|
Arguments for other links are much less clear, however: the authors have
|
|
for instance
|
|
never been able to come to a working understanding of what
|
|
an ``average log odds''
|
|
would represent, i.e., the link from logistic regression.
|
|
|
|
One computational advantage of the linear predictor lies in creating the
|
|
variance.
|
|
Since \code{mean(Z \%*\% b) = colMeans(Z) \%*\%b} the computation can be
|
|
done in 3 steps: first create $Z$ with one row for each observation in the
|
|
population, obtain the vector of column means $d= 1'Z/m$, and then the PMM
|
|
estimate is $d'\hat\beta$ and its variance is $d'Vd$ where $V$ is the
|
|
variance matrix of $\hat\beta$.
|
|
For other PMM quantities the routine samples \code{nsim} vectors
|
|
from $b \sim N(\hat\beta, V)$, then computes PMM estimates separately for
|
|
each vector $b$ and forms an empirical variance of the PMM estimates
|
|
from the results.
|
|
|
|
For nonlinear predictors such as the response, the population choice
|
|
matters even for an additive model. The two results
|
|
below have different estimates for the ``between PMM differences''
|
|
and tests.
|
|
|
|
<<glm3>>=
|
|
gfit1 <- glm(skips ~ Opening + Mask +PadType + Solder, data=solder,
|
|
family=poisson)
|
|
yates(gfit1, ~ Opening, test="pairwise", predict = "response",
|
|
population='data')
|
|
yates(gfit1, ~ Opening, test="pairwise", predict = "response",
|
|
population="yates")
|
|
@
|
|
|
|
|
|
\section{Free Light Chain}
|
|
As an example for the more usual case, a data set which does \emph{not} arise
|
|
from a balanced factorial experiment, we will look at the free light chain
|
|
data set.
|
|
In 2012, Dispenzieri and colleagues examined the
|
|
distribution and consequences of the free light chain value,
|
|
a laboratory test, on a large fraction of the 1995
|
|
population of Olmsted County, Minnesota aged 50 or older
|
|
\cite{Kyle06, Dispenzieri12}.
|
|
The R data set \code{flchain} contains a 50\% random sample of this larger study
|
|
and is included as a part of the \code{survival} package.
|
|
The primary purpose of the study was to measure the amount of
|
|
plasma immunoglobulin and its components.
|
|
Intact immunoglobulins are composed of a heavy chain and light chain
|
|
portion. In normal subjects there is overproduction of the light chain
|
|
component by the immune cells leading to a small amount of
|
|
\emph{free light chain} in the circulation.
|
|
Excessive amounts of free light chain (FLC) are thought to be a marker of
|
|
disregulation in the immune system.
|
|
An important medical question is whether high levels of FLC have an
|
|
impact on survival, which will be explored using a Cox model.
|
|
Free light chains have two major forms denoted as kappa and lambda;
|
|
we will use the sum of the two.
|
|
|
|
A confounding factor is that FLC values rise with age, in
|
|
part because it is eliminated by the kidneys and renal function
|
|
declines with age.
|
|
The age distribution of males and females differs, so we
|
|
will adjust any comparisons for both age and sex.
|
|
The impact of age on mortality is dominatingly large
|
|
and so correction for the age imbalance is critical when exploring
|
|
the impact of FLC on survival.
|
|
|
|
Figure \ref{fig:flc} shows the trend in FLC values
|
|
as a function of age.
|
|
For illustration of linear models using factors, we have also
|
|
created a categorical age value using deciles of age.
|
|
|
|
\begin{figure}
|
|
<<data, fig=TRUE, echo=FALSE>>=
|
|
male <- (flchain$sex=='M')
|
|
flchain$flc <- flchain$kappa + flchain$lambda
|
|
mlow <- with(flchain[male,], smooth.spline(age, flc))
|
|
flow <- with(flchain[!male,], smooth.spline(age, flc))
|
|
plot(flow, type='l', ylim=range(flow$y, mlow$y),
|
|
xlab="Age", ylab="FLC")
|
|
lines(mlow, col=2, lwd=2)
|
|
legend(60, 6, c("Female", "Male"), lty=1, col=1:2, lwd=2, bty='n')
|
|
@
|
|
\caption{Free light chain values as a function of age.}
|
|
\label{fig:flc}
|
|
\end{figure}
|
|
|
|
The table of counts shows that the sex distribution becomes increasingly
|
|
unbalanced at the older ages, from about 1/2 females in the youngest
|
|
group to a 4:1 ratio in the oldest.
|
|
<<counts>>=
|
|
flchain$flc <- flchain$kappa + flchain$lambda
|
|
age2 <- cut(flchain$age, c(49, 59, 69, 79, 89, 120),
|
|
labels=c("50-59", "60-69", "70-79", "80-89", "90+"))
|
|
fgroup <- cut(flchain$flc, quantile(flchain$flc, c(0, .5, .75, .9, 1)),
|
|
include.lowest=TRUE, labels=c("<50", "50-75", "75-90", ">90"))
|
|
counts <- with(flchain, table(sex, age2))
|
|
counts
|
|
#
|
|
# Mean FLC in each age/sex group
|
|
cellmean <- with(flchain, tapply(flc, list(sex, age2), mean))
|
|
round(cellmean,1)
|
|
@
|
|
|
|
Notice that the male/female difference in FLC varies with age,
|
|
\Sexpr{round(cellmean[1,1],1)} versus \Sexpr{round(cellmean[2,1],1)}
|
|
at age 50--59 years and \Sexpr{round(cellmean[1,5],1)} versus
|
|
\Sexpr{round(cellmean[2,5],1)} at age 90 years,
|
|
and as shown in figure \ref{fig:flc}.
|
|
The data does not fit a simple additive model; there are ``interactions''
|
|
to use statistical parlance.
|
|
Men and women simply do not age in quite the same way.
|
|
|
|
\subsection{Linear models}
|
|
Compare the mean FLC for males to females, with and without adjusting for
|
|
age.
|
|
<<flc1>>=
|
|
library(splines)
|
|
flc1 <- lm(flc ~ sex, flchain)
|
|
flc2a <- lm(flc ~ sex + ns(age, 3), flchain)
|
|
flc2b <- lm(flc ~ sex + age2, flchain)
|
|
flc3a <- lm(flc ~ sex * ns(age, 3), flchain)
|
|
flc3b <- lm(flc ~ sex * age2, flchain)
|
|
#
|
|
# prediction at age 65 (which is near the mean)
|
|
tdata <- data.frame(sex=c("F", "M"), age=65, age2="60-69")
|
|
temp <- rbind("unadjusted" = predict(flc1, tdata),
|
|
"additive, continuous age" = predict(flc2a, tdata),
|
|
"additive, discrete age" = predict(flc2b, tdata),
|
|
"interaction, cont age" = predict(flc3a, tdata),
|
|
"interaction, discrete" = predict(flc3b, tdata))
|
|
temp <- cbind(temp, temp[,2]- temp[,1])
|
|
colnames(temp) <- c("Female", "Male", "M - F")
|
|
round(temp,2)
|
|
@
|
|
|
|
The between sex difference is underestimated without adjustment for
|
|
age. The females are over-represented at the high ages, which inflates
|
|
their estimate.
|
|
For this particular data set,
|
|
both continuous and categorical age adjustment are able
|
|
to recover the true size of the increment for males.
|
|
Now look at population adjustment.
|
|
|
|
<<flc2>>=
|
|
yates(flc3a, ~sex) # additive, continuous age
|
|
#
|
|
yates(flc3b, ~sex) # interaction, categorical age
|
|
#
|
|
yates(flc3b, ~sex, population="factorial")
|
|
@
|
|
|
|
The population average values are just a
|
|
bit higher than the prediction at the mean due to the upward curvature
|
|
of the age vs FLC curve.
|
|
The average for a factorial population is larger yet, however.
|
|
This is because it is the average for an unusual population which has
|
|
as many 90+ year old subjects as 50--59 year old; i.e.,
|
|
it is the correct answer to a rather odd question, since this is a
|
|
population that will never be encountered in real life.
|
|
|
|
We can also reverse the question and examine age effects
|
|
after adjusting for sex. For the
|
|
continuous model the age values for the PMM need to be specified using
|
|
the \code{levels} argument; otherwise the routine will not know which
|
|
ages are ``of interest'' to the reader.
|
|
(With a factor the routine will assume that you want all the levels,
|
|
but a subset can be chosen using the \code{levels} argument.)
|
|
<<flc3>>=
|
|
yates(flc3a, ~ age, levels=c(65, 75, 85))
|
|
yates(flc3b, ~ age2)
|
|
@
|
|
Not surprisingly the predicton at age 65 years for a continuous model is quite close
|
|
to that for the 60--69 year age group from a dicrete model.
|
|
|
|
\section{Cox Models}
|
|
Finally we come to Cox models which are, after all, the point of this
|
|
vignette, and the question that prompted creation of the \code{yates} function.
|
|
Here the question of what to predict is more serious.
|
|
To get a feel for the data look at three simple models.
|
|
|
|
<<cox1>>=
|
|
options(show.signif.stars=FALSE) # show statistical intelligence
|
|
coxfit1 <- coxph(Surv(futime, death) ~ sex, flchain)
|
|
coxfit2 <- coxph(Surv(futime, death) ~ sex + age, flchain)
|
|
coxfit3 <- coxph(Surv(futime, death) ~ sex * age, flchain)
|
|
anova(coxfit1, coxfit2, coxfit3)
|
|
#
|
|
exp(c(coef(coxfit1), coef(coxfit2)[1])) # sex estimate without and with age
|
|
@
|
|
|
|
The model with an age*sex interaction does not fit substantially better
|
|
than the additive model.
|
|
This is actually not a surprise: in a plot of log(US death rate)
|
|
the curves for males and females are essentially parallel after age 50 years.
|
|
(See the \code{survexp.us} data set, for instance.)
|
|
The sex coefficients for models 1 and 2 differ substantially.
|
|
Males in this data set have almost 1.5 the death rate of females at any
|
|
given age, but when age is ignored the fact that females dominate the
|
|
oldest ages almost completely cancels this out and males appear to have the
|
|
same `overall' mortality.
|
|
Adjustment for both age and sex is critical to understanding the potential
|
|
effect of FLC on survival.
|
|
|
|
Dispenzieri \cite{Dispenzieri12} looked at the impact of FLC by dividing the
|
|
sample into those above and below the 90th percentile of FLC;
|
|
for illustration we will use 4 groups consisting of the lowest 50\%,
|
|
50 to 75th percentile, 75 to 90th percentile and above the 90th percentile.
|
|
|
|
<<coxfit2>>=
|
|
coxfit4 <- coxph(Surv(futime, death) ~ fgroup*age + sex, flchain)
|
|
yates(coxfit4, ~ fgroup, predict="linear")
|
|
yates(coxfit4, ~ fgroup, predict="risk")
|
|
@
|
|
We see that after adjustment for age and sex, FLC is a strong
|
|
predictor of survival.
|
|
Since the Cox model is a model of relative risk, any constant term is
|
|
arbitrary: one could add 100 to all of the log rates (type `linear' above)
|
|
and have as valid an answer. To keep the coefficients on a sensible scale,
|
|
the \code{yates} function centers the mean linear predictor of the original data
|
|
at zero. This centers the linear predictor at 0 but
|
|
does not precisely center the risks at exp(0) = 1 due to
|
|
Jensen's inequality, but suffices to keep values in a sensible range
|
|
for display.
|
|
|
|
A similar argument to that found in section \ref{sect:glm} about the arithmetic
|
|
versus geometric mean can be made here, but a more fundamental issue is that
|
|
the overall hazard function for a population is not the average of the
|
|
hazards for each of its members, and in fact will change over time as
|
|
higher risk members of the population die.
|
|
Though computable, the mean hazard ratio only applies to the study at time 0,
|
|
before selective death begins to change the structure of the remaining
|
|
population, and likewise for the average linear predictor =
|
|
mean log hazard ratio.
|
|
A PMM based on either of these estimates is hard to interpret.
|
|
|
|
Survival curves, however, do lead to a proper average:
|
|
the survival curve of a population is the
|
|
mean of the individual survival curves of its members.
|
|
Functions computed from the survival curve, such as the mean time
|
|
until event, will also be proper and interpretable.
|
|
The longest death time for the FLC data set is at
|
|
\Sexpr{round(with(flchain, max(futime[death==1])/365.25),1)} years; as a
|
|
one number summary of each PMM curve we
|
|
will use a restricted mean survival with a threshold of 13 years.
|
|
<<surv>>=
|
|
# longest time to death
|
|
round(max(flchain$futime[flchain$death==1]) / 365.25, 1)
|
|
#compute naive survival curve
|
|
flkm <- survfit(Surv(futime, death) ~ fgroup, data=flchain)
|
|
print(flkm, rmean= 13*365.25, scale=365.25)
|
|
@
|
|
|
|
Straightforward survival prediction takes longer than recommended for a CRAN
|
|
vignette: there are \Sexpr{nrow(flchain)} subjects in the study and
|
|
4 FLC groups, which leads to just over 30 thousand predicted survival curves
|
|
when using the default \code{population=`data'}, and each curve
|
|
has over 2000 time points (the number of unique death times).
|
|
To compute a variance,
|
|
this is then repeated the default \code{nsim = 200} times.
|
|
We can use this as an opportunity to demonstrate a user supplied population,
|
|
i.e., a data set containing a population of
|
|
values for the control variables.
|
|
We'll use every 20th observation in the \code{flchain} data as the population
|
|
and also reduce the number of simulations to limit the run time.
|
|
<<surv2>>=
|
|
mypop <- flchain[seq(1, nrow(flchain), by=20), c("age", "sex")]
|
|
ysurv <- yates(coxfit4, ~fgroup, predict="survival", nsim=50,
|
|
population = mypop,
|
|
options=list(rmean=365.25*13))
|
|
ysurv
|
|
|
|
# display side by side
|
|
temp <- rbind("simple KM" = summary(flkm, rmean=13*365.25)$table[, "rmean"],
|
|
"population adjusted" = ysurv$estimate[,"pmm"])
|
|
round(temp/365.25, 2)
|
|
|
|
@
|
|
The spread in restricted mean values between the different FLC groups
|
|
is considerably less in the marginal
|
|
(i.e., PMM) survival curves than in the unadjusted Kaplan-Meier,
|
|
with the biggest change for those above the 90th percentile of FLC.
|
|
Males dominate the highest FLC values. Without adjustment for sex, the
|
|
survival for the high FLC group is biased downward by the male predominance.
|
|
|
|
The \code{ysurv} object also contains an optional \code{summary} component, which
|
|
in this case is the set of 4 PMM survival curves.
|
|
Plot these along with the unadjusted curves, with solid lines for the
|
|
PMM estimates and dashed for the unadjusted curves.
|
|
This shows the difference between adjusted and unadjusted even more clearly.
|
|
Adjustment for age and sex has pulled the curves together, though the $>90$th percentile
|
|
group still stands apart from the rest.
|
|
(It is not a 1 number summary with a simple p value, however, and so spurned by
|
|
users and journals ;-).
|
|
<<surv3, fig=TRUE>>=
|
|
plot(flkm, xscale=365.25, fun="event", col=1:4, lty=2,
|
|
xlab="Years", ylab="Death")
|
|
lines(ysurv$summary, fun="event", col=1:4, lty=1, lwd=2)
|
|
legend(0, .65, levels(fgroup), lty=1, lwd=2, col=1:4, bty='n')
|
|
@
|
|
|
|
|
|
\section{Mathematics}
|
|
The underlying code uses a simple brute force algorithm. It first builds a
|
|
population data set for the control variables that includes a placeholder
|
|
for the variable of interest.
|
|
Then one at a time, it places each level of variable(s) of interest in
|
|
all rows
|
|
of the data (e.g., sex=F for all rows). It computes the model's prediction
|
|
for all rows, and then comptes the mean prediction.
|
|
Since predicted values are
|
|
independent of how the variables in a model are coded,
|
|
the result is also independent of coding.
|
|
|
|
When the prediction is the simple linear predictor $X\beta$, we take advantage
|
|
of the fact that
|
|
$\mbox{\rm mean}(X \beta) = [\mbox{\rm column means}(X)] \beta = c\beta$.
|
|
If $C$ is the matrix of said column means, one row for each of the groups
|
|
of interest, then $C\beta$ is the vector of PMM values and $CVC'$ is the
|
|
variance covariance matrix of those values, where $V$ is the variance matrix
|
|
of $\hat \beta$.
|
|
The lion's share of the work is building the individual $X$ matrices, and that
|
|
is unchanged.
|
|
|
|
For other than the linear case, the variance is obtained by simulation.
|
|
Assume that $\hat\beta \sim N(\beta, V)$, and draw
|
|
\code{nsim} independent samples $b$ from $N(\hat\beta, V)$.
|
|
PMM values are computed for each instance of $b$, and an empirical
|
|
variance matrix for the PMM values is then computed.
|
|
|
|
\section{SAS glim type III (SGTT) algorithm}
|
|
\label{sect:SAS}
|
|
Earlier in this document reference was made to the SAS ``type 3'' estimates,
|
|
and we now delve into that topic.
|
|
It is placed at the end because it is a side issue with
|
|
respect to population averages.
|
|
However, whatever one's opinion on the wisdom
|
|
or folly of the SAS estimator, one cannot ignore its ubiquity, and showing how
|
|
it fits into this framework is an important part of the picture.
|
|
|
|
As groundwork we start with some compuational facts about ANOVA. Assume
|
|
an $X$ matrix for the model in standard order: the intercept, then main
|
|
effects, then first order interactions, second order interactions, etc.
|
|
as one proceeds from the leftmost column.
|
|
\begin{itemize}
|
|
\item Let $LDL' = (X'X)$ be the generalized Cholesky decomposition of the
|
|
$X'X$ matrix, where $L$ is lower triangular with $L_{ii}=1$ and $D$ is
|
|
diagonal. $D_{ii}$ will be zero if column $i$ of $X$ can be written as
|
|
a linear combination of prior columns and will be positive otherwise.
|
|
\item Let $L'_k$ refer to the rows of $L'$ that correspond to the $k$th term
|
|
in the model, e.g., a main effect. Then the test statistics for
|
|
$L'_1\beta= 0$, $L'_2 \beta=0$, \ldots, are the sums of squares for the
|
|
standard ANOVA table, often referred to as type I or sequential
|
|
sums of squares.
|
|
\item If $X$ corresponds to a balanced factorial design, then $L_{ij} \ne 0$
|
|
only if $i<j$ and term $j$ contains $i$.
|
|
The interaction \code{x1:x2} contains \code{x1} for instance.
|
|
\end{itemize}
|
|
|
|
The blocks of zeros implied by the last of these leads to one of the most
|
|
well known identities in ANOVA: for a
|
|
balanced factorial design, the type I SS for a term does not depend on its order
|
|
in the model. (Swapping the order of two main effects, for instance, simply
|
|
swaps those rows and columns of $L$.)
|
|
In SAS parlance, type I and type II sums of squares are identical.
|
|
Some statisticians see this independence between the test and the order of the
|
|
terms within the model as an interesting aside
|
|
while others view it as a central aspect of estimation,
|
|
one which should be emulated in other models whenever possible.
|
|
Goodnight writes
|
|
\begin{quote} For most unbalanced designs it is usually possible to test the
|
|
same set of hypotheses (estimable functions) that would have been tested if
|
|
the design had been balanced. For those designs which started out balanced,
|
|
but for which observations were lost due to external forces, there is no
|
|
reason to alter the hypothesis \cite{Goodnight78}.
|
|
\end{quote}
|
|
This is actually quite different than the rationale found in Yates \cite{Yate34}:
|
|
Yates focused on an estimate of interest (a marginal mean) for which he derived
|
|
a test statistic, while the above is focused on the test statistic itself.
|
|
Type I and II never coincide for continuous variables and so none of this
|
|
argument applies to them, however.
|
|
|
|
The $L'$ matrix for the sequential sums of squares is orthogonal with respect
|
|
to $(X'X)^-$, the variance matrix of $\hat\beta$, i.e., $L'_i (X'X)^-L_j=0$
|
|
for any terms $i$ and $j$ with $i \ne j$.
|
|
One way to construct the Yates' SS, i.e., the global test that the PMM values
|
|
for all levels of a particular factor are equal, is to find an $L'$
|
|
matrix that is upper triangular, maximal rank, has zeros off the diagonal for
|
|
unrelated terms, and is orthogonal to $(Z'Z)^-$, where $Z$ is the design matrix
|
|
for an orthogonal subset.
|
|
If there are no missing cells in the design, i.e., no combination of factors
|
|
which is present in the model statement but not in the data, then
|
|
\code{Z = unique(X)} is a direct way to create a balanced subset, with $L$ the
|
|
Cholesky decomposition of $Z'Z$.
|
|
|
|
The SAS type III definition simply carries this idea forward. The formal
|
|
definition is found in \cite{Goodnight78}:
|
|
``A set of $L$s, one for each effect in the model, is type III if each $L$
|
|
is a maximum rank hypothesis involving only the parameters of the effect
|
|
in question and parameters of effects that contain it; and each $L$ is
|
|
orthogonal to all $L$s of effects that contain the effect in question.''
|
|
The outline of an algorithm to create such an $L$ for design that
|
|
may or may not have missing cells can be inferred from the examples of a second
|
|
SAS technical report \cite{SAS2013}.
|
|
This is the algorithm used by the \code{yates} function.
|
|
\begin{enumerate}
|
|
\item Create a design matrix $X$ in standard form, i.e., intercept, then
|
|
main effects, then first order interaction, etc., from left to right,
|
|
with categorical variables coded in the same way as the SAS GLM procedure.
|
|
\begin{itemize}
|
|
\item A categorical variable with $k$ levels is
|
|
represented by $k$ 0/1 dummy variables,
|
|
which represent the first, second, etc. levels of the variable.
|
|
\item The interaction between two categorical variables
|
|
that have $k$
|
|
and $m$ levels will be represented by $km$ 0/1 columns, and
|
|
likewise for higher level interactions.
|
|
\end{itemize}
|
|
\item Create the $p$ by $p$ dependency matrix $D=(X'X)^-(X'X)$ from the
|
|
$n$ by $p$ matrix $X$.
|
|
\begin{itemize}
|
|
\item $D$ will be upper triangular. If row $i$ of $X$ can be written
|
|
as a linear combination of prior rows, the $i$th column of $D$
|
|
will contain those coefficients, otherwise the column will be a copy of
|
|
the identity.
|
|
\item Sample R code is \code{D = coef(lm(X \textasciitilde I(X) -1))}
|
|
\end{itemize}
|
|
\item Intialize $L' = D$, and then partially orthogonalize $L$.
|
|
\begin{itemize}
|
|
\item If terms $i$ and $j$, $i<j$, are such that term $j$ includes
|
|
term $i$, then make $L_i$ orthogonal to $L_j$. That is, replace
|
|
$L_i$ with the residuals from a regression of $L_i$ on $L_j$.
|
|
(Orthogonalize rows of $L'$ = columns of $L$.)
|
|
\item Continuous variables or interactions involving a continuous
|
|
variable are ignored in this step.
|
|
\end{itemize}
|
|
\end{enumerate}
|
|
The resulting $L'$ matrix is then used to compute test statistics for all of
|
|
the terms that are categorical or interactions of categoricals.
|
|
Terms involving continuous variables use a type II test and a completely
|
|
separate computation, but then are labeled as type III.
|
|
|
|
The SGTT code in the \code{yates} function only handles models whose categorical
|
|
variables have been coded using either the \code{contr.treatment} or
|
|
\code{contr.SAS} options, and that admit of full rank tests.
|
|
(If there are too many terms and too little data a factor with $k$ levels,
|
|
say, might not have the expected $k-1$ degrees of freedom test possible.)
|
|
The point of the \code{method='sgtt'} option is not to compute all cases but
|
|
to illustrate what a type III test is and is not; and as an answer to the
|
|
widespread and uncritical use of such.
|
|
|
|
If the model/data combination has no missing cells, then the $L'$ construction
|
|
of type III tests gives exactly the same same result as PMM estimates using
|
|
a factorial population followed by a test for equality of the PMM estimates,
|
|
which is in turn exactly equal to the proposal of Yates. The SAS LSM values
|
|
agree with the PMM estimates in this case as well, so it is fair to say that
|
|
type III tests are a global test for equality of a factor, adjusting for all
|
|
others.
|
|
If in addition the model has continuous variables, then the SAS LSM estimates
|
|
correspond the PMM for the mixed \code{population='SAS'} case, and again
|
|
the type III test agrees with a test of equality for the PMM values.
|
|
|
|
If there are missing cells, however, one or more of the PMM values will not
|
|
be estimable and there is no global population based test.
|
|
A set of valid type III estimators can still be constructed,
|
|
which have an upper triangular matrix form, with appropriate blocks of zeros and
|
|
a particular orthogonality.
|
|
It is however no longer clear what connection, if any, that these tests have
|
|
to a factorial population.
|
|
The upper triangular part assures that the set of tests ``covers the space''
|
|
in the way that type I tests do and the blocks of zeros ensure that they will be
|
|
unchanged by a reodering of the terms. But what exactly do they test?
|
|
A few more ideas are presented in an appendix.
|
|
|
|
Note that the SGTT algorithm is the SAS \emph{glm} type 3 procedure.
|
|
Several other SAS procedures also create output labeled as ``type 3'' which
|
|
is not necessarily the same.
|
|
The SAS phreg procedure appears to use the NSTT computation, for instance.
|
|
|
|
\subsection{NSTT}
|
|
A major problem with SAS type III computations is that almost no one knows
|
|
exactly what is being computed nor quite how to do it.
|
|
The documentation is deficient for the glm procedure and is largely non-existent
|
|
for others.
|
|
This has led to propogation of a false ``type 3'' algorithm in multiple packages,
|
|
which I call the not-safe-type-three (NSTT).
|
|
\begin{enumerate}
|
|
\item Build an $X$ matrix from left to right in the
|
|
standard order of intercept, then main effects, then 2 way interactions, etc.
|
|
\item From left to right, remove any columns of $X$ that are redundant,
|
|
i.e., can be written as a linear combination of prior columns.
|
|
They will not be used again.
|
|
\item Fit the model using the revised $X$ matrix.
|
|
\item For any given term $k$, do simple test that the
|
|
coefficients corresponding to term $k$ are zero.
|
|
In a model \code{y \textasciitilde a * b} the test for \code{b}
|
|
compares a model
|
|
with the all the columms to one that has only those for \code{a} and
|
|
\code{a:b} (the restricted subset of a:b remaining after step 2).
|
|
\end{enumerate}
|
|
|
|
Here is an example using the solder data:
|
|
<<nstt>>=
|
|
options(contrasts = c("contr.treatment", "contr.poly")) # default
|
|
nfit1 <- lm(skips ~ Solder*Opening + PadType, solder)
|
|
drop1(nfit1, ~Solder)
|
|
@
|
|
|
|
This shows a type III sum of squares of 389.88.
|
|
However, if a different coding is used then we get a very different SS:
|
|
it increases by over 25 fold.
|
|
|
|
<<nstt2>>=
|
|
options(contrasts = c("contr.SAS", "contr.poly"))
|
|
nfit2 <- lm(skips ~ Solder*Opening + PadType, solder)
|
|
drop1(nfit2, ~Solder)
|
|
@
|
|
|
|
The example shows a primary problem with the NSTT: the answer that you get
|
|
depends on how the contrasts were coded.
|
|
For a simple 2 way interaction like the above, it turns out that the
|
|
NSTT actually tests the effect of Solder within the reference cell
|
|
for Opening, the NSTT is not a global test at all.
|
|
|
|
<<means>>=
|
|
with(solder, tapply(skips, list(Solder, Opening), mean))
|
|
@
|
|
|
|
Looking at the simple cell means shown above, it is no surprise that the
|
|
\code{contr.SAS}
|
|
fit, which uses Opening=S as the reference will yield a large NSTT
|
|
SS since it is a comparison of 17.4 and 5.5, while the \code{contr.treatment}
|
|
version using Opening=L as reference has a much smaller NSTT.
|
|
In fact, re-running a particular analysis with different reference levels for
|
|
one or more of the adjusting variables is a quick way to diagnose
|
|
probable use of the NSTT algorithm by a program.
|
|
Several R libraries that advertise type 3 computations actually use the NSTT.
|
|
|
|
The biggest problem with the NSTT is that it sometimes gives
|
|
the correct answer.
|
|
If one uses summation constraints, a form of the model
|
|
that most of us have not seen since graduate school:
|
|
\begin{align*}
|
|
y_{ijk} &= \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon \\
|
|
\sum_i \alpha_i & = 0 \\
|
|
\sum_j \beta_j & = 0 \\
|
|
\sum_{ij} \gamma_{ij} &=0
|
|
\end{align*}
|
|
then the `reference cell' for Opening is the mean Opening effect,
|
|
and the NSTT for Solder will correspond to an PMM using the factorial
|
|
population, the PMM is invariant to the chosen coding.
|
|
|
|
<<nstt3>>=
|
|
options(contrasts = c("contr.sum", "contr.poly"))
|
|
nfit3 <- lm(skips ~ Solder*Opening + PadType, solder)
|
|
|
|
drop1(nfit3, ~Solder )
|
|
yates(nfit1, ~Solder, population='factorial')
|
|
@
|
|
|
|
Thus our acronym for this method of not-safe-type-three (NSTT), since
|
|
the method does work if one is particularly careful.
|
|
Given the number of incorrect analyses that have arisen from
|
|
this approach
|
|
`nonsense type 3' would also be valid interpretation, however.
|
|
|
|
<<nstt4, echo=FALSE>>=
|
|
options(contrasts = c("contr.treatment", "contr.poly")) # restore
|
|
@
|
|
|
|
\section{Conclusion}
|
|
The population average predicted value or marginal estimate is a
|
|
very useful statistical concept.
|
|
The \code{yates} function computes these using a direct approach: create
|
|
the relevant population, get \emph{all} the predicted values, and average them.
|
|
An advantage of this simple approach is that because predicted values do not
|
|
depend on how a model is parameterized, the results are naturally invariant
|
|
to how categorical variables are represented.
|
|
But like many good ideas in statistics, proper application of the idea requires
|
|
some thought with respect to the choice of a statistic to be averaged, and
|
|
the population over which take the expectation.
|
|
For linear models the simple linear predictor
|
|
$X\beta$ is an obvious choice as a statistic,
|
|
but for other models the choice is more nuanced.
|
|
For a Cox model we prefer the restricted mean, but this is an area that
|
|
needs more research for a firm recommendation.
|
|
|
|
In terms of the population choice, the population should match the
|
|
research question at hand.
|
|
The factorial population in particular has been badly overused.
|
|
This population is appropriate for the use cases
|
|
described in Yates \cite{Yates34} and Goodnight \cite{Goodnight78} that
|
|
relate to designed experiments, the solder data is a good example.
|
|
But in observational or clinical data the factorial population result
|
|
will too often be the ``answer to a question that nobody asked''.
|
|
In the FLC data set, for instance, the factorial population has equal numbers
|
|
of subjects in every age group: a
|
|
a population that no one will ever observe.
|
|
Using the result from such computations to inform medical decisions
|
|
begins to resemble
|
|
medieval debates of angels dancing on the head of a pin.
|
|
This suboptimal population choice is the primary damming criticism for
|
|
``type III'' estimates.
|
|
|
|
Because the detailed algorithm and criteria for type III tests is not well
|
|
documented, many psuedo type 3 estimates have arisen of which the NSTT is perhaps
|
|
the most common.
|
|
Statistical software that claims to produce a ``type III'' estimate
|
|
while doing something else is indisputably wrong, even dangerous,
|
|
and should not be tolerated.
|
|
|
|
\appendix
|
|
\section{Miscellaneous notes.}
|
|
None of the material below is presented with proof.
|
|
Comments to the authors that fill in any of these gaps are welcome.
|
|
|
|
\subsection{Subsampling}
|
|
One reaction that I have seen to the solder data is to instead analyse a
|
|
balanced subset of the data, throwing away extra observations, because
|
|
then ``the result is simple to understand''.
|
|
Let's take this desire at face value, but then add statistics to it: rerun the
|
|
two steps of ``select a subset'' and ``compute the balanced 2-way
|
|
solution'' multiple times, with different random subsets each time.
|
|
In fact, we could be more compuslive and tabulate the solution for \emph{all}
|
|
the possible balanced subsets, and then take an average.
|
|
This will give the Yates estimate.
|
|
|
|
\subsection{Type 3}
|
|
Based on the $L$ matrix argument above, one natural defintion of
|
|
type III contrasts would be an upper triangular matrix, with appropriate
|
|
zeros, whose terms were orthogonal with respect to $(Z'Z)^-$,
|
|
the design matrix for a balanced data set.
|
|
The SAS algorithm creates contrasts in an expanded basis,
|
|
and these contrasts are instead made
|
|
orthogonal with respect to the identity matrix $I$.
|
|
Why does this work?
|
|
Multiple example cases have shown that in data with no missing cells the
|
|
resulting sum of squares agrees with the Yates' SS, but a formal proof of
|
|
equivalence has been elusive.
|
|
|
|
It would be preferable to use an algorithm that does not require rebuilding
|
|
the $X$ matrix in the extended basis set used by the SAS approach, but
|
|
instead directly used the $X$ matrix of the user, as it would be
|
|
simpler and more portable code.
|
|
For the case of no missing
|
|
cells the Cholesky decompostion of $Z'Z$, for instance, satisfies this
|
|
requirement. The original fit
|
|
may use treatment, SAS, Helmert, or any other dummy variable coding
|
|
for the factors without affecting the computation or result.
|
|
We have not been able to discover an approach that replicates the SAS algorithm
|
|
when there are missing cells, however.
|
|
|
|
Is there a unique set of contrasts that fullfill the type III requirements for
|
|
any given model and data set pair, or many? How might they be constructed?
|
|
Since tests of $L'\beta =0$ and $cL'\beta=0$ are equivalent for any
|
|
constant $c$, wlog we can assume that the diagonal of $L$ is 0 or 1.
|
|
|
|
|
|
\subsection{Additive models}
|
|
Consider the free light chain data and a linear model containing age group
|
|
and sex as predictors.
|
|
We have spoken of factorial, data, and external populations, each of which
|
|
leads to a global test comparing the PMM estimates for males versus females.
|
|
What population would give the smallest variance for that comparison?
|
|
(A question that perhaps only an academic statistician could love.)
|
|
|
|
Here are the sample sizes of each cell.
|
|
<<sizes>>=
|
|
with(flchain, table(sex, age2))
|
|
@
|
|
The optimal population for comparing male to female will have population
|
|
sizes in each age group
|
|
that are proportional to the harmonic means in each column:
|
|
$(1/n_{11} + 1/n_{21})^{-1}$ for age group 1, and etc. These are the familiar
|
|
denominators found in the t-test.
|
|
A PMM built using these population weights will yield the
|
|
same sex difference and test as a simple additive model with sex and age group.
|
|
Linear regression is, after all, the UMVE.
|
|
|
|
\bibliographystyle{plain}
|
|
\bibliography{refer}
|
|
\end{document}
|