934 lines
40 KiB
Plaintext
Raw Permalink Normal View History

2025-01-12 00:52:51 +08:00
\documentclass{article}[11pt]
\usepackage{Sweave}
\usepackage{amsmath}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
%\VignetteIndexEntry{Concordance}
\SweaveOpts{keep.source=TRUE, fig=FALSE}
% Ross Ihaka suggestions
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
% I had been putting figures in the figures/ directory, but the standard
% R build script does not copy it and then R CMD check fails
\SweaveOpts{prefix.string=compete,width=6,height=4}
\newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth]
{compete-#1.pdf}}
\setkeys{Gin}{width=\textwidth}
<<echo=FALSE>>=
options(continue=" ", width=60)
options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1))))
pdf.options(pointsize=10) #text in graph about the same as regular text
options(contrasts=c("contr.treatment", "contr.poly")) #ensure default
#require("survival")
#library(survival)
library(survival)
library(splines)
@
\title{Concordance}
\author{Terry Therneau, Elizabeth Atkinson}
\newcommand{\code}[1]{\texttt{#1}}
\begin{document}
\maketitle
\section{The concordance statistic}
\subsection{Overview}
Let $(x_i, y_i)$ be paired data values, say two measurements on a set of
subjects, or observed data $y$ and predictions $x$ for those values from a
statistical model.
A pair of observations ($i$, $j$) is considered concordant if the prediction
and the data go in the same direction, i.e.,
$(x_i > x_j, y_i > y_j)$ or $(x_i < x_j, y_i < y_j)$.
The concordance statistic $C$ is defined as the fraction of concordant pairs,
and is an estimate of $P(x_i > x_j | y_i > y_j)$, or equivalently
$P(y_i>y_j | x_i > x_j)$.
The first form might make more sense if, for instance, one took a list of
elected officials, say, and looked back at local newspapers to see how many of
the election contests had been predicted correctly
(probablility the prediction is correct, given outcome).
The second conforms to our though process if we started with the predictions,
before election day, and then counted up the fraction correct after the
contest. When looking at a statistical model's predictions either form can
be argued; they give the same result.
One wrinkle is what to do with ties in either $x$ or $y$. Such pairs
can be ignored in the count (treated as incomparable), treated as discordant,
or given a score of 1/2.
Let $c, d, t_x, t_y$ and $t_{xy}$ be a count of the pairs that are concordant,
discordant, tied on the predictor $x$ (but not $y$),
tied on $y$ (but not $x$), and tied on both. Then
\begin{align}
\tau_a &= \frac{c-d}{c+d + t_x + t_y + t_{xy}} \label{tau.a} \\
\tau_b &= \frac{c-d}{\sqrt{(c+d+t_x)(c + d + t_y)}} \label{tau.b} \\
\gamma &= \frac{c-d}{c+d} \label{gamma} \\
D &= \frac{c-d}{c + d + t_x} \label{somer} \\
C &= (D+1)/2 = \frac{c+ t_x/2}{c + d + t_x} \label{dC}
\end{align}
\begin{itemize}
\item Kendall's tau-a \eqref{tau.a} is the most conservative, ties shrink
the value towards zero.
\item The Goodman-Kruskal $\gamma$ statistic \eqref{gamma} ignores ties in
either $y$ or $x$.
\item Somers' $D$ \eqref{somer} treats ties in $y$ as incomparable;
pairs that are tied in $x$ (but not $y$) score as 1/2, as we can see from
equation \eqref{dC} which is the concordance statistic.
\item Kendall's tau-b \eqref{tau.b} can be viewed as a version of Somers' $D$
that is symmetric in $x$ and $y$.
\end{itemize}
The first 4 statistics range from -1 to 1, similar to the correlation coefficient
$r$.
The concordance \eqref{dC} ranges from 0--1, which matches the scale for a probability.
Why is $C$ defined using Somers' $D$ rather than one of the other three?
\begin{itemize}
\item If $y$ is a 0/1 variable, then $C$ = AUROC, the area under the
receiver operating curve, which is well established for binary outcomes.
(Proving this simple theorem is harder than it looks, but the result is
well known.)
\item For survival data, this choice will agree with Harrell's $C$.
More importantly, as we will see below, it has strong connections
to standard tests for equality of survival curves.
\end{itemize}
The concordance has a natural interpretation as an experiment: present pairs
of subjects one at a time to the physician, statistical model, or some other
oracle, and count the number of correct
predictions. Pairs that have the same outcome $y_i=y_j$ are not put forward
for scoring, since they do not help discriminate a good oracle from a bad one.
If the oracle cannot decide then a random choice is made.
This leads to $c + t_x/2$ correct selections out of $c + d + t_x$ choices.
This hypothetical experiment gives a baseline insight into the concordance.
A value of 1/2 corresponds to using a random guess for each subject.
Values of .5--.55 are not very impressive, since
the ordering for some pairs of subjects will be obvious, and someone with
almost no medical knowledge could do
that well by marking these easy pairs and using a coin flip for the rest.
Values of less than 1/2 are possible --- some stock market analysts come to
mind.
\subsection{Simple examples}
The \code{concordance} function accepts simple data or models as input.
For the latter, it assesses the concordance between $y$ and the model
prediction $\hat y$. Here is a set of simple examples.
\subsubsection*{Direct}
<<examples1a>>=
# x1 and y2 are both continuous variables
concordance(y2 ~ x1, data= anscombe)
@
\subsubsection*{Logistic regression}
<<examples1b>>=
# Fisher's iris data
fit1 <- glm(Species=="versicolor" ~ ., family=binomial, data=iris)
concordance(fit1) # equivalent to an AUC
@
\subsubsection*{Linear regression}
<<examples1c>>=
# Anscombe data (all variables are continuous)
fit2 <- lm(y2 ~ x1 + x4, data= anscombe)
concordance(fit2) # C
sqrt(summary(fit2)$r.squared) # R
@
\subsubsection*{Parametric survival}
<<examples1d>>=
# parametric survival
fit3 <- survreg(Surv(time, status) ~ karno + age + trt, data=veteran)
concordance(fit3)
@
\subsubsection*{Cox regression}
<<examples1e>>=
# 3 Cox models
fit4 <- coxph(Surv(time, status) ~ karno + age + trt, data=veteran)
fit5 <- update(fit4, . ~ . + celltype)
fit6 <- update(fit5, . ~ . + prior)
ctest <- concordance(fit4, fit5, fit6)
ctest
@
As shown in the last example, the concordance for multiple fits can
be obtained from a single call. The variance-covariance
matrix for all three concordance values is available using
\code{vcov(ctest)};
this is used in Section \ref{sec:multiple} to formally test the equality of
two concordance values.
The above also shows that addition of another variable to a fitted
model can decrease the concordance.
The larger model will have higher correlation between the linear predictor
$X \beta$ and the response $y$, by definition, but this does not guarantee
a greater association between ${\rm rank}(X \beta)$ and ${\rm rank}(y)$.
\subsection{Efficient compuatation}
For continuous data without ties, the concordance involves a comparison
between all $n(n-1)/2$ pairs.
This $O(n^2)$ computation will become painfully slow for large datasets.
To improve this, first order the data by increasing $y$ values, leading to
\begin{align}
c-d &= \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n {\rm sign}(y_i-y_j)
{\rm sign}(x_i-x_j) \label{cd1} \\
&= \sum_{i=n}^1 \left[ \sum_{y_j > y_i} {\rm sign}(x_i - x_j) \right]
\label{cd2}
\end{align}
\begin{figure}
<<balance, fig=TRUE, echo=FALSE>>=
# The balance figure for the concordance document
btree <- function(n) {
tfun <- function(n, id, power) {
if (n==1) id
else if (n==2) c(2*id, id)
else if (n==3) c(2*id, id, 2*id+1)
else {
nleft <- if (n== power*2) power else min(power-1, n-power/2)
c(tfun(nleft, 2*id, power/2), id,
tfun(n-(nleft+1), 2*id +1, power/2))
}
}
tfun(n, 1, 2^(floor(logb(n-1,2))))
}
temp <- c(1,2,6,8, 9,12,14, 18, 19, 21, 23, 24, 27 )
indx <- btree(13)
xpos <- 1:15
xpos[4:7] <- tapply(xpos[8:15], rep(1:4, each=2), mean)
xpos[2:3] <- tapply(xpos[4:7], rep(1:2, each=2),mean)
xpos[1] <- mean(xpos[2:3])
ypos <- rep(4:1, c(1,2,4,8))
oldpar <- par(mar=c(1,1,1,1))
plot(xpos, ypos, type='n', xaxt='n', yaxt='n', bty='n',
xlab="", ylab="")
temp2 <- c(13,7,5,3,3,3,1,1,1,1,1,1,1)
#text(xpos[indx], ypos[indx], paste(temp, " (", temp2[indx], ")", sep=''))
text(xpos[indx], ypos[indx], as.character(temp))
delta=.1
for (i in 1:6) {
segments(xpos[i]-delta, ypos[i]-delta,
xpos[2*i]+delta, ypos[2*i]+delta)
segments(xpos[i]+delta, ypos[i]-delta,
xpos[2*i+1]-delta, ypos[2*i+1] +delta)
}
par(oldpar)
@
\caption{A balanced binary tree with 13 elements.}
\label{btree}
\end{figure}
The first equation is the simple definition of concordance as a sum over all
$n^2$ possible pairs, where ${\rm sign}$ is the R \code{sign} function.
Equation \eqref{cd2} makes the obvious simplification of counting each pair
only once, by taking advantage of the fact that $y$ is sorted.
The key coding insight is to store the $x$ values as part of a balanced
binary tree; an example of such a tree is shown in Figure \ref{btree}.
The basic algorithm is to:
\begin{enumerate}
\item Create a balanced binary tree for all $n$ $x_i$ values. This can be
done in $O(n \log_2(n))$ steps. The final tree will have a node for each
unique $x$ value. Each node contains the value, along with counts for the
number of observations at that value, for left hand children, and
for right hand children. Initialize all the counts to 0.
\item Go through the data from largest $y$ to smallest $y$.
\begin{enumerate}
\item For this new $(x_i, y_i)$ pair,
count the number of $x_j$ values in the
tree that are smaller or larger than this element, and increment
the overall counts of concordant, discordant, and tied. This can be
done in $\log_2(n)$ steps (proceed down from the top).
\item Add this observation to the tree.
Each addition will update the count for its node, then
``walk up'' the tree updating child counts of the parent, grandparent,
etc.
\end{enumerate}
\end{enumerate}
If there are tied $y$ values, do all the counts for a set of ties first, and
then add their $x$ values to the tree.
\section{Concordance and survival data}
As stated above, for continuous data without ties, the concordance involves a comparison
between all $n(n-1)/2$ pairs.
If there are tied $y$ values, however, those pairs are ignored.
For example, in the iris data above $C= 4129/(4129 + 871)$, the
6175 pairs with tied response values play no role.
For survival data, this set of incomparables is extended to include those
pairs for which the time ordering is ambiguous.
For instance, assume that $y_i$ is censored at time 10 and $y_j$ is an
event (or censor) at time 20.
Subject $i$ may or may not survive longer than subject $j$,
and so it is not possible to tell if a rule has ranked them correctly
or not.
Note that if $y_i$ is censored at time
10 and $y_j$ is an event at time 10 then $y_i > y_j$.
This same convention is followed for all the survival models, and
agrees with common clinical data, i.e., a patient censored on day 100
was observed to be still alive on day 100, their death time must be
strictly greater than 100.
A second, smaller issue for survival data is to recognize that we
desire to assess the concordance between an observed survival time $y_i$
and a predicted survival time $\hat y_i$.
This can be done without creating an explicit survival curve from the
model: for a \code{survreg} fit, for instance,
$\eta_1 > \eta_2$ implies $S(t; x_1) > S(t; x_2)$ for all $t$, where
$\eta_1$ and $\eta_2$ are the respective linear predictors $x_i\beta$ and
$x_2\beta$. Not creating the survival curves saves considerable computation
time.
For a Cox model $\eta_1 > \eta_2$ implies $S(t; x_1) < S(t; x_2)$; the order is
reversed. This is because the Cox model is a hazard model, and higher hazard
implies a shorter survival.
When a \code{coxph} object is used as the argument to
\code{concordance} the reversal is handled automatically.
However, when a concordance is done ``by hand'' the user needs to be aware of
this as seen in the example below.
In this case the concordance routine does not
know that the prediction came from a Cox model, resulting in a swap of the
count for concordant and discordant pairs and $C < .5$.
The user is responsible for adding the \code{reverse = TRUE} argument in this case.
<<>>=
# Concordance for a coxph object
concordance(fit4)
# Concordance using predictions from a Cox model
concordance(Surv(time, status) ~ predict(fit4), data = veteran, reverse = TRUE)
@
\subsubsection*{Stratified models}
Stratified models present a further variation: if observations $i$ and
$j$ are in different strata, the survival curves for those strata might
cross; $S(t; x_i)$ and $S(t; x_j)$ no longer have a simple ordering.
A solution is to use a stratified concordance, which compares all pairs
\emph{within} each stratum, and then adds up the result.
In the example below there is a separate count for each stratum, the
final concordance is based on the column sums.
The same issue, and solution, applies to stratified \code{survreg} models.
(The fact that strata names are not retained as labels for the counts
matrix is a deficiency in the routine.)
<<veteran2>>=
fit4b <- coxph(formula = Surv(time, status) ~ karno + age + trt +
strata(celltype), data = veteran)
concordance(fit4b)
table(veteran$celltype)
@
\subsection{Time-weighted concordance}
Look again at equation \eqref{cd2}, rewriting it for survival with $t_i$ as the
response in order to more closely match standard notation for survival.
Watson and Therneau \cite{Watson15} show that this can be further rewritten
as
\begin{align}
c-d &= \sum_{i=1}^n \delta_i \left[ \sum_{t_j > t_i} {\rm sign}(x_i- x_j) \right]
\nonumber \\
&= \sum_{i=1}^n \delta_i \left[\sum_{t_j \ge t_i} {\rm sign}(x_i - x_j) \right]
\label{cd4} \\
&= 2 \sum_i \delta_i n(t_i) \left[ r_i(t_i) - \overline r \right] \label{cscore}
\end{align}
Here $\delta$ is 0 = censored, 1 = uncensored; if $t_i$ is censored then all
other observations with $t_j \ge t_i$ are not comparable.
Equation \eqref{cd4} rewrites the inner term is a sum over all subjects in the
risk set at time $t_i$, a familiar concept in survival models.
In equation \eqref{cscore} the inner terms has been rewritten with
$n(t)$ as the number of subjects still at risk at time $t$, and
$r_i(t)$ the rank of $x_i$ among all those still at risk at time $t$,
where ranks are defined such that $0 \le r \le 1$, and $\overline r$ the
mean of those ranks.
(Proofs for \eqref{cd4} and \eqref{cscore} have been omitted.)
It turns out that
equation \eqref{cscore} is exactly the score statistic for a Cox model
with a single time-dependent covariate $n(t) r(t)$.
One immediate consequence of this connection is a straightforward definition
of concordance for a risk score containing time dependent covariates. Since
the Cox model score statistic is well defined for time dependent covariates,
the concordance is also well defined for a time-dependent risk score: at each
event time the current risk score of the subject who failed is compared to the
current (time dependent) scores of all those still at risk.
A deeper consequence of the equivalence between the concordance and
the Cox model is a link to alternate weightings of the
risk scores.
If the original Cox model has a single 0/1 treatment covariate then
equation \eqref{cscore} exactly matches the numerator of the
Gehan-Wilcoxon statistic; replacing $n(t)$ with weights of 1
will yield the log-rank statistic.
There is a deep literature with respect to the ``best'' weight for survival
tests,
and we can apply the same historical arguments
to the concordance as well. We will point out four of interest:
\begin{itemize}
\item Peto and Peto \cite{Peto72} point out that
$n(t) \approx n(0)S(t-)G(t-)$, where $S$
is the survival distribution and $G$ the censoring distribution.
They argue that $S(t-)$ would be a better weight since $G$ may have
features that are irrelevant to the question being tested.
For a particular dataset, Prentice \cite{Prentice79} later showed
that these concerns were indeed justified, and
most software now uses the Peto-Wilcoxon variant.
\item Tarone and Ware point out that weights of $n(t)$ and 1 give the
Gehan-Wilcoxon and log-rank tests, respectively, and suggest $\sqrt{n(t)}$
as an intermediate value.
\item Schemper et al
\cite{Schemper09} argue for
a weight of $S(t)/G(t)$ in the Cox model.
When proportional hazards does not hold the coefficient from the Cox model
is an ``average'' hazard ratio, and they show that using $S/G$ leads to
a value that remains interpretable in terms of an underlying population
model.
The same argument would also apply to the concordance, since our
goal is an ``assumption free'' assessment of association.
\item Uno et al \cite{Uno11} recommend the use of $n/G^2$ as a weight based
on a consistency argument. If we assume that the concordance value that
would be obtained after full follow-up of all subjects (no censoring) is
the ``right'' one, and proportional hazards does not hold, then the
standard concordance will not consistently estimate this target quantity
when there is censoring.
\end{itemize}
In practice, weights need to be based on left continuous versions of the
survival curves $S(t-)$ and $G(t-)$, and extra care needs to be exercised
in computation of $G$. Consider the \code{aml} dataset as an
example; the first few lines of the relevant survival curve are shown below.
<<amlexample>>=
afit <- survfit(Surv(time, status) ~1, aml, se = FALSE)
summary(afit, times = afit$time[1:6], censor = TRUE)
@
The first censor is at time 13, when there is both a death and a censor.
For the event at time 13, however, no one has yet been censored, and no
``censoring correction'' should be done at this point.
Weights at any time $t$, like the number at risk at time $t$, must be the values
in effect just \emph{before} the event. For a weight based
on $S$; at time 13 the proper value would be .739, dropping to .696 just
after time 13.
The value of $G(t-)$ at 13 will be 1.
Just after time 13 will be 15/16, not the value of 16/17
that will be obtained from \code{survfit(Surv(time, 1-status) ~1)}.
Censors happen after deaths, thus there are only 16 at risk for the censoring
event.
Based on the Peto and Peto argument that $n(t) \approx n(0)S(t-)G(t-)$,
we might expect the Schemper and Uno weights to be similar.
In testing, we discovered to our surprise that if computations are
done carefully as per above, then $n(t) = n(0)S(t-)G(t-)$ exactly.
As a consequence, the Schemper and Uno weights are also identical.
The \code{timewt} option in the \code{concordance} function allows
you to modify weights for the concordance.
The options are: \code{n, S, S/G, n/G2} and \code{I},
the last of which giving equal weight to each event time; this would
correspond to a log-rank test.
(We do not recommend this nor are we aware of any literature which does so,
but included it for completeness).
For non-survival data all the weights become identical.
The figure below shows the first three weights for the colon cancer
dataset.
This data is from a clinical trial of 929 subjects, with 3 years of enrollment
followed by 5 years of follow.
Since there is almost no one lost to follow-up in the first 5 years, the choices
are nearly identical over that time.
From 5 to 8 years, $S(t)$ continues its steady decline, $n(t)$ plummets due
to administrative censoring, and $S/G$ explodes.
Even with these changes in weights, the concordance values are all very similar.
<<tmwt, fig=TRUE>>=
colonfit <- coxph(Surv(time, status) ~ rx + nodes + extent, data = colon,
subset = (etype == 2)) # death only
cord1 <- concordance(colonfit, timewt="n", ranks=TRUE)
cord2 <- concordance(colonfit, timewt="S", ranks=TRUE)
cord3 <- concordance(colonfit, timewt="S/G", ranks=TRUE)
cord4 <- concordance(colonfit, timewt="n/G2", ranks=TRUE)
temp <- c("n(t)"= coef(cord1), S=coef(cord2), "S/G"= coef(cord3),
"n/G2"= coef(cord4))
round(temp,5) # 4 different concordance estimates
# Plot the weights over time using the first 3 approaches
matplot(cord1$ranks$time/365.25, cbind(cord1$ranks$timewt,
cord2$ranks$timewt,
cord3$ranks$timewt),
type= "l", lwd=2, col=c(1,2,4),
xlab="Years since enrollment", ylab="Weight")
legend(1, 3000, c("n(t)", "nS(t-)", "nS(t-)/G(t-)"), lwd=2,
col=c(1,2,4), lty=1:3, bty="n")
# Note that n/G2 and S/G are identical
all.equal(cord3$ranks$timewt,cord4$ranks$timewt)
@
\subsubsection*{When do weights matter?}
When might the concordance differ between weight functions?
In order for $S/G$ weights to show an important difference we need to
have two conditions.
\begin{itemize}
\item First, sufficient censoring that the two weights differ for a reasonable
fraction of the data, that is, $G(t)$ is low and $S(t)$ has not flattened
(deaths are still occurring).
\item Second, per the arguments in Schemper and in Uno, is the potential presence
of non-proportional hazards in the fit.
\end{itemize}
\begin{figure}
<<manycurve, fig=TRUE, echo=FALSE>>=
duo <- function(time, status, name, conf.int=FALSE) {
sfit <- survfit(Surv(time, status) ~1)
gfit <- survfit(Surv(time, max(status)-status) ~1)
plot(sfit, conf.int=conf.int, xlab=name, lwd=2)
lines(gfit, col=2, lwd=2, conf.int = conf.int)
}
oldpar <- par(mfrow=c(3,3), mar=c(5,5,1,1))
with(subset(colon, etype==1), duo(time/365.25, status, "NCCTG colon cancer"))
duo(flchain$futime/365.25, flchain$death, "Free light chain")
duo(kidney$time/12, kidney$status, "McGilchrist kidney")
duo(lung$time/365.25, lung$status, "advanced lung cancer")
duo(mgus2$futime/12, mgus2$death, "MGUS")
duo(nafld1$futime/365.25, nafld1$status, "NAFLD")
duo(pbc$time/365.25, pmin(pbc$status,1), "PBC")
with(rotterdam, duo(pmin(rtime, dtime)/365.25, pmax(recur, death), "Rotterdam"))
nfit <- coxph(Surv(futime/365.25, status) ~ age + male, data = nafld1)
znfit <- cox.zph(nfit, transform = 'identity')
plot(znfit[1], resid = FALSE)
par(oldpar)
@
\caption{Survival (black) and censoring (red) curves for 8 datasets found
in the survival package. The final panel shows a proportional hazards
evaluation for the age variable, in a fit of age + male to the NAFLD
data.}
\label{surv8}
\end{figure}
Figure \ref{surv8} shows survival and censoring curves for 8 different
datasets found in the survival package.
Based on these, the dataset with the greatest potential for an $S/G$ difference
is the NAFLD data; the dataset also has some early non-proportionality for age.
The code below shows the calculation of the $S/G$ difference. Surprisingly, the four weightings still yield very similar
concordance values; the Harrell (n) and Uno (n/G2) weighting differ in only
the second decimal place.
<<nafld1>>=
nfit <- coxph(Surv(futime/365.25, status) ~ age + male, data = nafld1)
ncord1 <- concordance(nfit, timewt = "n")
ncord2 <- concordance(nfit, timewt = "S")
ncord3 <- concordance(nfit, timewt = "S/G")
ncord4 <- concordance(nfit, timewt = "n/G2")
temp <- c(n = coef(ncord1), S = coef(ncord2),
"S/G" = coef(ncord3), "n/G2" = coef(ncord4))
round(temp,6)
@
The \code{concordance} function provides a \code{ranks=TRUE} argument which
can be used to further exploration of the weights. I
f set, the output will include a
dataframe that contains one row for each event, containing the time
point, its relative rank in the risk set ($r_i - \overline r$), the case weight
for the observation and the time weight at that time point.
The relative ranks are comparable to Schoenfeld residuals,
and their weighted sum will equal $c -d$.
We can plot them over time and apply a smooth.
Figure \ref{vetfit}, for the veteran dataset, shows a
precipitous drop to 0.
The Veteran's cancer data is perhaps an extreme of this pattern.
Due to the very rapid
progression of disease, baseline measurements soon lose their meaning.
<<rankresid, eval=FALSE>>=
# code for Figure 3
# pick a dataset with a smaller number of points, and non PH
vfit <- coxph(Surv(time/365.25, status) ~ age + karno, data = veteran)
temp <- concordance(vfit, ranks=TRUE)$rank
# Two outliers at 999 days = 2.7 years stretch the axis too far
plot(rank ~ time, data=temp, xlim=c(0,1.6),
xlab="Years", ylab="Rank residual")
lines(lowess(temp$time, temp$rank, iter=1), lwd=2, col=2)
abline(0, 0, lty=3)
@
\begin{figure}
<<rankresid2, echo=FALSE, fig=TRUE>>=
<<rankresid>>
@
\caption{Schoenfeld residuals for the scaled ranks, from a fit to
the veteran dataset.}
\label{vetfit}
\end{figure}
\subsection{Restricted time range}
A first consideration in any model assessment is to stop and
ponder what a ``good fit'' means.
This is far too often ignored in the rush to computation.
Two wonderful papers in this regard are Korn and Simon \cite{Korn90} and
Altman and Royston \cite{Altman00}.
The title of the second makes its purpose clear: ``What do we mean by validating
a prognostic model?''
Korn and Simon, for instance,
point out the importance of restricting the range of comparison, a point
echoed by Altman and Royston.
Though a risk model can be used for long-range prediction, in actual patient
practice this will very often not be the need; the model should be
evaluated over the time range in which it will actually be used.
For example, for a patient who returns every 4 years, a 10 year prediction will
never actually be put to the test; predictions will be updated every 4 years
based on the most current information.
If the model is to be used for ongoing patient management we will want to
restrict attention to a 4 year horizon, with respect to evaluating its utility.
(A 10 year Framingham heart risk might still be useful in \emph{convincing} a
subject to stop smoking, however.)
We sometimes receive push-back on this, with the argument that one should use
all the data.
We disagree, and think that the target of the validation is critical.
Predictions become less
accurate the further out we reach in time; this is true for everything from
weather forecasts to the stock market, and survival models are not immune.
Reaching too far into the future may return an overly pessimistic value of
$C$.
Another reason for using an upper limit is that $1/G$ can become
unstable as the sample size becomes small (large jumps in the KM), or
unreasonably large as $G$ approaches 0. Most authors suggest an upper limit
for this purely technical reason.
<<rotterdam, fig=TRUE, eval=FALSE>>=
# should this be included?
# recurrence free survival = earlier of recurrence and death
rdata <- rotterdam
rdata$rfs <- with(rdata, ifelse(recur==1, 1, death))
rdata$rfstime <- with(rdata, ifelse(recur==1, rtime, dtime))/ 365.25
rfit <- coxph(Surv(rfstime, rfs) ~ age + meno + grade + pspline(nodes), rdata)
ctemp <- matrix(0, 100, 2) # concordance and std err
ctime <- seq(.1, 10, length=100)
for (i in 1:100) {
temp <- concordance(rfit, ymax=ctime[i])
ctemp[i,] <- c(temp$concordance, sqrt(temp$var))
}
yhat <- ctemp[,1] + outer(ctemp[,2], c(0, -1.96, 1.96), '*')
matplot(ctime, yhat, type='l', lty=c(1,2,2), lwd=c(2,1,1), col=1,
xlab="Upper cutoff", ylab="C", ylim=c(0.5,1))
@
Argument could also be made for a lower limit, though this would be uncommon
for censored data. Many laboratory values, for instance, treat all results
less than some threshold as identical.
However, the ability to implement a lower limit correctly is constrained by
censoring. Say for instance that there were values of 5, 8+, and 9, and a
lower limit of 10 were chosen. The approach used for non-censored data
is to treat 5 and 9 as tied values, but this logic does not
correctly extend to the censored
value. This issue and possible solutions will be discussed more fully in
the external validation vignette.
\subsection{Synthetic $C$}
As another method of addressing censoring,
G\"{o}en and Heller \cite{Goen05} show that if the statistical model is
correct, and if proportional hazards holds, then for any pair of
covariate vectors
$$
P(y_i > y_j) = \frac{1}{1 + e^{\eta_j - \eta_i}}
$$
They then order the $\eta$ values from a fitted model, and take an average
over all $n(n-1)/2$ ordered pairs.
The authors argue that this is an estimate that is independent of censoring, and
therefore preferable to Harrell's C. (The estimate can be obtained by using the
\code{royston} function.)
The biggest problem with this approach is that it gives an estimate of
concordance under an assumption that the model is \emph{exactly correct}.
Our goal, rather, is to assess how \emph{well} the model performs,
for our needs, knowing that it will be imperfect.
The G and H formula answers a question that we did not ask, over a time range
$(0, \infty)$ which is not of interest.
The calculation is also $O(n^2)$ so will be slow for large sample sizes.
\subsection{Picking a weight}
So, which weight should we use? As shown in the examples above, it may not
matter that much.
(The fact that we have never found an example where the effect is large does not
mean there are no such datasets, however.) Time weights play no role for
uncensored data.
Further issues to consider:
\begin{enumerate}
\item An important issue
that has not been sorted out is how to extend $1/G$ weighting
arguments to datasets that are subject to delayed entry, e.g., when
using age as the time scale instead of time since enrollment.
There is, in this case, no natural estimate available for $G$.
It is also not possible
for the \code{coxph} routine to reliably tell the difference between such
left truncation and simple time-dependent covariates or strata.
The default action of the routine is to use the safe choice of $n(t)$.
\item Consider setting a time ($y$) restriction using the \code{ymax}
option, based on careful thought about the proper
range of interest. This often has a larger practical effect than the
choice of time weight.
\item Safety. If using the usual Gehan-Wilcoxon weights of $n(t)$,
the Peto-Wilcoxon variant $S(t)$ would appear advantageous, particularly
if there is differential censoring for some subjects.
\item Equality vs. efficiency. On one hand we would like to treat each
data pair equally, but in our quest for ever sharper p-values we want to
be efficient. The first argues for $n(t)$ as the weight and the
second for using equal weights, since the variances of each ranking term are
nearly identical.
This is exactly the argument between the Gehan-Wilcoxon and
the log-rank tests.
\item For uncensored data $n$, $S$ and $S/G$ weights are all identical.
\end{enumerate}
Our current opinion is that the point of the concordance is to evaluate
the model in a more non-parametric way, so a log-rank type of focus on
ideal p-values is misplaced. This suggests using either $S$ or $S/G$ as
weights. Both give more prominence to the later time points as compared to
the default $n(t)$ choice, but if time limits have been thought through
carefully the difference between these three will almost always be ignorable.
We most definitely disagree with Uno's unstated assumption that
the $C$ statistic one would obtain with infinite follow-up and no
censoring is the proper target of estimation, and the ordinary concordance
is therefore biased. That target will never be attainable, and we would argue
that it is largely irrelevant if it was.
Proportional hazards never is true over the long term, simply because it is
almost impossible to predict events that are a decade or more away and thus
the rank residuals shown above will eventually tend to 0.
The starting point should always be to think through exactly \emph{what} one
wants to estimate.
As stated by Yogi Berra ``If you don't know where you are going, you'll end
up someplace else."
\section{Variance}
The variance of the statistic is estimated in two ways. The first is to use
the variance of the equivalent Cox model score statistic.
As pointed out
by Watson, this estimate is both correct and efficient under $H_0: C=.5$,
and so it forms a valid test of $H_0$.
However, when the concordance is over .7 or so, this estimator systematically
overestimates the true variance.
An alternative that remains unbiased is the infinitesimal jackknife (IJ) variance
\begin{align*}
V &= \sum_{i=1}^n w_iU_i^2 \\
U_i &= \frac{\partial C}{\partial w_i}
\end{align*}
The concordance routine calculates an influence matrix $U$ with one row per
subject and columns that contain derivatives for the 5 individual counts:
concordant, discordant, tied on x, tied on y, and tied on xy pairs. From
this it is straightforward to derive the influence of each subject on
the concordance, or on any other
of the other possible association measures such as $\tau$-a mentioned earlier.
The IJ variance is printed by default but the PH variance is also returned;
an earlier \code{survConcordance} function only computed the PH variance.
The \code{condordance} function does not compute Kendall's $\tau$-a or
$\tau$-b, nor Goodman's gamma.
However, since all of the necessary components for those values are returned,
along with IJ influence for each, it can be used as the computational engine
for those measures and their variance, should someone wish to do so.
The variance computation accounts for the influence of each subject on both
the numberator and denominator of $C$; this agrees with parallel development
used by Newson \cite{Newson06} and implemented in STATA.
An alternate approach is to treat the total number of
comparable pairs is an ancillary statistic, thus
${\rm var}(C) = {\rm var}(c-d)/(4 m)$ where $m$ is the number of comparable
pairs ($n(n-1)/2$ for uncensored data).
Arguments about what aspects of a dataset
can or should be treated as ancillary are as old as statistics, e.g., treating
the margins of a 2x2 table as ancillary leads to Fisher's exact test.
In this case we anticipate that the differences will be quite small, but have
done no formal exploration.
\subsection{Multiple concordances}
\label{sec:multiple}
One useful property of using a jackknife variance estimate is that the
variance of the
difference in concordance between two separately fitted models is also easily
obtained. If $c_a$ and $c_b$ are the two concordance statistics and $U_{ia}$
and $U_{ib}$ the corresponding influence values,
the influence vector for $c_a-c_b$ is $U_a - U_b$.
(If subject $i$ increases $c_a$ by .03 and $c_b$ by .01, then he/she
raises the difference between them by .02.)
It is not necessary that the models be nested.
However, it is crucial that they be computed on the exact same set of
observations.
Here is a comparison of concordance values from previous models.
<<test>>=
ctest <- concordance(fit4, fit5, fit6)
ctest
# compare concordance values of fit4 and fit5
contr <- c(-1, 1, 0)
dtest <- contr %*% coef(ctest)
dvar <- contr %*% vcov(ctest) %*% contr
c(contrast=dtest, sd=sqrt(dvar), z=dtest/sqrt(dvar))
@
To do a similar comparison for models which do not have a \code{concordance}
method, use a set of dummy linear model fits as a container.
For instance, assume that \code{y} was continuous, and we have 3 different
predicted values \code{phat1}, \code{phat2}, \code{phat3} from three different
machine learning models, say, and we want to compute and compare concordance.
The one could use the following code:
\begin{verbatim}
dummy1 <- lm(y ~ phat1)
dummy2 <- lm(y ~ phat2)
dummy3 <- lm(y ~ phat3)
cfit <- concordance(dummy1, dummy2, dummy3)
print(cfit)
etc.
\end{verbatim}
In order to produce correct answers, it is necessary that y, phat1, phat2,
and phat3 be results for exactly the same observations, in exactly the same
order. If one model has some subject removed for missing values, say, then
those subjects can not be present in any of the ML fits.
\subsection{Asymmetric confidence intervals}
The infinitesimal jackknife (IJ) has provided us with an honest estimate of
the standard deviation of $C$.
A natural confidence interval for the concordance is then
$C \pm z_{\alpha} sd(C)$.
As with confidence intervals for an ordinary proportion $\hat p$, however,
this simple interval can sometimes be inconsistent, giving CI endpoints that
lie outside of the legal range of $[0,1]$.
In the case of $\hat p$ there is a long history of methods to address this
issue, going back at least as far as the 1956 paper by
Anscombe \cite{Anscombe56}, but there is less
literature for the concordance
or AUC.
Newcombe \cite{Newcombe06} provides corrected methods, but with
the caveat that they ``have the drawback that on account of the large number
of outcomes they are computationally practicable only for very small sample
sizes.''
The tree based computations used in the \code{concordance} function might well
address the speed issue, but have not been implemented.
Here we pursue another avenue, which is to consider a transformation based
confidence interval, in much the same way as is done for confidence intervals
of a survival curve. That is, we use
$$
g^{-1}\left[g(C) \pm z \sigma(g(C)) \right]
$$
for some transformation function $g$. For survival curves, the $g$ functions
$\log(p)$, $\log(p/(1-p)$, $\log(-\log(1-p))$ and $\arcsin(p)$ have all been
found to be superior to the simple interval.
For the concordance,
consider the Fisher z-transform, widely used for the correlation coefficient
$r$
\begin{equation}
z = \frac{1}{2} \log\left(\frac{1+r}{1-r} \right) \label{Fisherz}
\end{equation}
Since Somers' $D$ and $r$ are targeted at similar concepts, we might hazard that
a similar transformation of Somers' $D$, which also ranges from -1 to 1,
would also be close to equivariant.
Since $D = 2C -1$ we have
\begin{align*}
z_c &= \frac{1}{2} \log\left(\frac{1+ (2C-1)}{1- (2C-1)} \right) \\
&= \frac{1}{2} \log\left(\frac{C}{1-C} \right) \label{FisherC}
\end{align*}
which we recognize as the inverse of the logistic link used \code{glm}
models.
We can get the standard error of $z_c$ by retrieving the individual
dfbeta values and performing a transformation.
The dfbeta value is defined as $d_i =C - C_{-i}$ where the latter is the $C$
statistic omitting the $i$th observation.
<<Cztrans>>=
zci <- function(fit, p=.95) {
ilogist <- function(p) log(p/(1-p)) # inverse logistic
logistic <- function(x) exp(x)/(1 + exp(x))
temp <- concordance(fit, influence =1)
cminus <- temp$concordance - temp$dfbeta # values of concordance, without i
newd <- ilogist(temp$concordance) - ilogist(cminus) # dfbeta on new scale
new.sd <- sqrt(sum(newd^2))
old.sd <- sqrt(sum(temp$dfbeta^2)) # same as sqrt(temp$var)
z <- qnorm((1-p)/2)
old.ci <- temp$concordance + c(z, -z)*old.sd
new.ci <- logistic(ilogist(temp$concordance) + c(z, -z)* new.sd)
rbind(old = old.ci, new= new.ci)
}
round(zci(colonfit), 4)
@
The two intervals hardly differ, which is what we would expect for a value
far from 1.
As a second example, create a small dataset with a concordance that is
close to 1. As shown below, the z-transform shifts the CI towards zero,
as it should, but also avoids the out-of-bounds endpoint.
<<close>>=
set.seed(1953)
ytest <- matrix(rexp(20), ncol=2) %*% chol(matrix(c(1, .98, .98, 1), 2))
cor(ytest)
lfit <- lm(ytest[,1] ~ ytest[,2])
zci(lfit)
@
\section{Details}
This section documents a few details - most readers can skip it.
The usual convention for survival data is to assume that censored values
come after deaths, even if they are recorded on the same day.
This corresponds to the common case that a subject who is censored on day
200, say, was actually seen on that day.
That is, their survival is strictly greater than 200.
As a consequence, censoring weights $G$ actually use $G(t-)$ in the
code: if 10 subjects are censored at day 100, and these are the first
censorings in the study, then an event on day 100 should not be given a
larger weight.
(Both the Uno and Schemper papers ignore this detail.)
When using weights of $S(t)$ the program actually uses a weight of $nS(t-)$
where $n$ is the number of observations in the dataset.
The reason is that for a stratified model the weighted number of concordant,
discordant and tied pairs is calculated separately for each stratum, and
then added together.
If one stratum were much smaller or larger than the others we want to
preserved this fact in the sum.
\bibliography{refer}
\bibliographystyle{plain}
\end{document}