934 lines
40 KiB
Plaintext
934 lines
40 KiB
Plaintext
\documentclass{article}[11pt]
|
|
\usepackage{Sweave}
|
|
\usepackage{amsmath}
|
|
\addtolength{\textwidth}{1in}
|
|
\addtolength{\oddsidemargin}{-.5in}
|
|
\setlength{\evensidemargin}{\oddsidemargin}
|
|
%\VignetteIndexEntry{Concordance}
|
|
|
|
\SweaveOpts{keep.source=TRUE, fig=FALSE}
|
|
% Ross Ihaka suggestions
|
|
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
|
|
\fvset{listparameters={\setlength{\topsep}{0pt}}}
|
|
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
|
|
|
|
% I had been putting figures in the figures/ directory, but the standard
|
|
% R build script does not copy it and then R CMD check fails
|
|
\SweaveOpts{prefix.string=compete,width=6,height=4}
|
|
\newcommand{\myfig}[1]{\includegraphics[height=!, width=\textwidth]
|
|
{compete-#1.pdf}}
|
|
\setkeys{Gin}{width=\textwidth}
|
|
<<echo=FALSE>>=
|
|
options(continue=" ", width=60)
|
|
options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1))))
|
|
pdf.options(pointsize=10) #text in graph about the same as regular text
|
|
options(contrasts=c("contr.treatment", "contr.poly")) #ensure default
|
|
|
|
#require("survival")
|
|
#library(survival)
|
|
library(survival)
|
|
library(splines)
|
|
@
|
|
|
|
\title{Concordance}
|
|
\author{Terry Therneau, Elizabeth Atkinson}
|
|
\newcommand{\code}[1]{\texttt{#1}}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
|
|
\section{The concordance statistic}
|
|
\subsection{Overview}
|
|
Let $(x_i, y_i)$ be paired data values, say two measurements on a set of
|
|
subjects, or observed data $y$ and predictions $x$ for those values from a
|
|
statistical model.
|
|
A pair of observations ($i$, $j$) is considered concordant if the prediction
|
|
and the data go in the same direction, i.e.,
|
|
$(x_i > x_j, y_i > y_j)$ or $(x_i < x_j, y_i < y_j)$.
|
|
The concordance statistic $C$ is defined as the fraction of concordant pairs,
|
|
and is an estimate of $P(x_i > x_j | y_i > y_j)$, or equivalently
|
|
$P(y_i>y_j | x_i > x_j)$.
|
|
The first form might make more sense if, for instance, one took a list of
|
|
elected officials, say, and looked back at local newspapers to see how many of
|
|
the election contests had been predicted correctly
|
|
(probablility the prediction is correct, given outcome).
|
|
The second conforms to our though process if we started with the predictions,
|
|
before election day, and then counted up the fraction correct after the
|
|
contest. When looking at a statistical model's predictions either form can
|
|
be argued; they give the same result.
|
|
|
|
One wrinkle is what to do with ties in either $x$ or $y$. Such pairs
|
|
can be ignored in the count (treated as incomparable), treated as discordant,
|
|
or given a score of 1/2.
|
|
Let $c, d, t_x, t_y$ and $t_{xy}$ be a count of the pairs that are concordant,
|
|
discordant, tied on the predictor $x$ (but not $y$),
|
|
tied on $y$ (but not $x$), and tied on both. Then
|
|
\begin{align}
|
|
\tau_a &= \frac{c-d}{c+d + t_x + t_y + t_{xy}} \label{tau.a} \\
|
|
\tau_b &= \frac{c-d}{\sqrt{(c+d+t_x)(c + d + t_y)}} \label{tau.b} \\
|
|
\gamma &= \frac{c-d}{c+d} \label{gamma} \\
|
|
D &= \frac{c-d}{c + d + t_x} \label{somer} \\
|
|
C &= (D+1)/2 = \frac{c+ t_x/2}{c + d + t_x} \label{dC}
|
|
\end{align}
|
|
|
|
\begin{itemize}
|
|
\item Kendall's tau-a \eqref{tau.a} is the most conservative, ties shrink
|
|
the value towards zero.
|
|
\item The Goodman-Kruskal $\gamma$ statistic \eqref{gamma} ignores ties in
|
|
either $y$ or $x$.
|
|
\item Somers' $D$ \eqref{somer} treats ties in $y$ as incomparable;
|
|
pairs that are tied in $x$ (but not $y$) score as 1/2, as we can see from
|
|
equation \eqref{dC} which is the concordance statistic.
|
|
\item Kendall's tau-b \eqref{tau.b} can be viewed as a version of Somers' $D$
|
|
that is symmetric in $x$ and $y$.
|
|
\end{itemize}
|
|
|
|
The first 4 statistics range from -1 to 1, similar to the correlation coefficient
|
|
$r$.
|
|
The concordance \eqref{dC} ranges from 0--1, which matches the scale for a probability.
|
|
|
|
Why is $C$ defined using Somers' $D$ rather than one of the other three?
|
|
|
|
\begin{itemize}
|
|
\item If $y$ is a 0/1 variable, then $C$ = AUROC, the area under the
|
|
receiver operating curve, which is well established for binary outcomes.
|
|
(Proving this simple theorem is harder than it looks, but the result is
|
|
well known.)
|
|
\item For survival data, this choice will agree with Harrell's $C$.
|
|
More importantly, as we will see below, it has strong connections
|
|
to standard tests for equality of survival curves.
|
|
\end{itemize}
|
|
|
|
The concordance has a natural interpretation as an experiment: present pairs
|
|
of subjects one at a time to the physician, statistical model, or some other
|
|
oracle, and count the number of correct
|
|
predictions. Pairs that have the same outcome $y_i=y_j$ are not put forward
|
|
for scoring, since they do not help discriminate a good oracle from a bad one.
|
|
If the oracle cannot decide then a random choice is made.
|
|
This leads to $c + t_x/2$ correct selections out of $c + d + t_x$ choices.
|
|
|
|
This hypothetical experiment gives a baseline insight into the concordance.
|
|
A value of 1/2 corresponds to using a random guess for each subject.
|
|
Values of .5--.55 are not very impressive, since
|
|
the ordering for some pairs of subjects will be obvious, and someone with
|
|
almost no medical knowledge could do
|
|
that well by marking these easy pairs and using a coin flip for the rest.
|
|
Values of less than 1/2 are possible --- some stock market analysts come to
|
|
mind.
|
|
|
|
\subsection{Simple examples}
|
|
The \code{concordance} function accepts simple data or models as input.
|
|
For the latter, it assesses the concordance between $y$ and the model
|
|
prediction $\hat y$. Here is a set of simple examples.
|
|
|
|
\subsubsection*{Direct}
|
|
|
|
<<examples1a>>=
|
|
# x1 and y2 are both continuous variables
|
|
concordance(y2 ~ x1, data= anscombe)
|
|
@
|
|
|
|
\subsubsection*{Logistic regression}
|
|
|
|
<<examples1b>>=
|
|
# Fisher's iris data
|
|
fit1 <- glm(Species=="versicolor" ~ ., family=binomial, data=iris)
|
|
concordance(fit1) # equivalent to an AUC
|
|
@
|
|
|
|
\subsubsection*{Linear regression}
|
|
|
|
<<examples1c>>=
|
|
# Anscombe data (all variables are continuous)
|
|
fit2 <- lm(y2 ~ x1 + x4, data= anscombe)
|
|
concordance(fit2) # C
|
|
sqrt(summary(fit2)$r.squared) # R
|
|
@
|
|
|
|
\subsubsection*{Parametric survival}
|
|
<<examples1d>>=
|
|
# parametric survival
|
|
fit3 <- survreg(Surv(time, status) ~ karno + age + trt, data=veteran)
|
|
concordance(fit3)
|
|
@
|
|
|
|
\subsubsection*{Cox regression}
|
|
|
|
<<examples1e>>=
|
|
# 3 Cox models
|
|
fit4 <- coxph(Surv(time, status) ~ karno + age + trt, data=veteran)
|
|
fit5 <- update(fit4, . ~ . + celltype)
|
|
fit6 <- update(fit5, . ~ . + prior)
|
|
ctest <- concordance(fit4, fit5, fit6)
|
|
ctest
|
|
@
|
|
|
|
As shown in the last example, the concordance for multiple fits can
|
|
be obtained from a single call. The variance-covariance
|
|
matrix for all three concordance values is available using
|
|
\code{vcov(ctest)};
|
|
this is used in Section \ref{sec:multiple} to formally test the equality of
|
|
two concordance values.
|
|
The above also shows that addition of another variable to a fitted
|
|
model can decrease the concordance.
|
|
The larger model will have higher correlation between the linear predictor
|
|
$X \beta$ and the response $y$, by definition, but this does not guarantee
|
|
a greater association between ${\rm rank}(X \beta)$ and ${\rm rank}(y)$.
|
|
|
|
\subsection{Efficient compuatation}
|
|
For continuous data without ties, the concordance involves a comparison
|
|
between all $n(n-1)/2$ pairs.
|
|
This $O(n^2)$ computation will become painfully slow for large datasets.
|
|
To improve this, first order the data by increasing $y$ values, leading to
|
|
\begin{align}
|
|
c-d &= \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n {\rm sign}(y_i-y_j)
|
|
{\rm sign}(x_i-x_j) \label{cd1} \\
|
|
&= \sum_{i=n}^1 \left[ \sum_{y_j > y_i} {\rm sign}(x_i - x_j) \right]
|
|
\label{cd2}
|
|
\end{align}
|
|
|
|
|
|
\begin{figure}
|
|
<<balance, fig=TRUE, echo=FALSE>>=
|
|
# The balance figure for the concordance document
|
|
btree <- function(n) {
|
|
tfun <- function(n, id, power) {
|
|
if (n==1) id
|
|
else if (n==2) c(2*id, id)
|
|
else if (n==3) c(2*id, id, 2*id+1)
|
|
else {
|
|
nleft <- if (n== power*2) power else min(power-1, n-power/2)
|
|
c(tfun(nleft, 2*id, power/2), id,
|
|
tfun(n-(nleft+1), 2*id +1, power/2))
|
|
}
|
|
}
|
|
tfun(n, 1, 2^(floor(logb(n-1,2))))
|
|
}
|
|
|
|
temp <- c(1,2,6,8, 9,12,14, 18, 19, 21, 23, 24, 27 )
|
|
indx <- btree(13)
|
|
|
|
xpos <- 1:15
|
|
xpos[4:7] <- tapply(xpos[8:15], rep(1:4, each=2), mean)
|
|
xpos[2:3] <- tapply(xpos[4:7], rep(1:2, each=2),mean)
|
|
xpos[1] <- mean(xpos[2:3])
|
|
ypos <- rep(4:1, c(1,2,4,8))
|
|
|
|
oldpar <- par(mar=c(1,1,1,1))
|
|
|
|
plot(xpos, ypos, type='n', xaxt='n', yaxt='n', bty='n',
|
|
xlab="", ylab="")
|
|
temp2 <- c(13,7,5,3,3,3,1,1,1,1,1,1,1)
|
|
#text(xpos[indx], ypos[indx], paste(temp, " (", temp2[indx], ")", sep=''))
|
|
text(xpos[indx], ypos[indx], as.character(temp))
|
|
|
|
delta=.1
|
|
for (i in 1:6) {
|
|
segments(xpos[i]-delta, ypos[i]-delta,
|
|
xpos[2*i]+delta, ypos[2*i]+delta)
|
|
segments(xpos[i]+delta, ypos[i]-delta,
|
|
xpos[2*i+1]-delta, ypos[2*i+1] +delta)
|
|
}
|
|
par(oldpar)
|
|
@
|
|
\caption{A balanced binary tree with 13 elements.}
|
|
\label{btree}
|
|
\end{figure}
|
|
|
|
The first equation is the simple definition of concordance as a sum over all
|
|
$n^2$ possible pairs, where ${\rm sign}$ is the R \code{sign} function.
|
|
Equation \eqref{cd2} makes the obvious simplification of counting each pair
|
|
only once, by taking advantage of the fact that $y$ is sorted.
|
|
The key coding insight is to store the $x$ values as part of a balanced
|
|
binary tree; an example of such a tree is shown in Figure \ref{btree}.
|
|
The basic algorithm is to:
|
|
\begin{enumerate}
|
|
\item Create a balanced binary tree for all $n$ $x_i$ values. This can be
|
|
done in $O(n \log_2(n))$ steps. The final tree will have a node for each
|
|
unique $x$ value. Each node contains the value, along with counts for the
|
|
number of observations at that value, for left hand children, and
|
|
for right hand children. Initialize all the counts to 0.
|
|
\item Go through the data from largest $y$ to smallest $y$.
|
|
\begin{enumerate}
|
|
\item For this new $(x_i, y_i)$ pair,
|
|
count the number of $x_j$ values in the
|
|
tree that are smaller or larger than this element, and increment
|
|
the overall counts of concordant, discordant, and tied. This can be
|
|
done in $\log_2(n)$ steps (proceed down from the top).
|
|
\item Add this observation to the tree.
|
|
Each addition will update the count for its node, then
|
|
``walk up'' the tree updating child counts of the parent, grandparent,
|
|
etc.
|
|
\end{enumerate}
|
|
\end{enumerate}
|
|
If there are tied $y$ values, do all the counts for a set of ties first, and
|
|
then add their $x$ values to the tree.
|
|
|
|
\section{Concordance and survival data}
|
|
|
|
As stated above, for continuous data without ties, the concordance involves a comparison
|
|
between all $n(n-1)/2$ pairs.
|
|
If there are tied $y$ values, however, those pairs are ignored.
|
|
For example, in the iris data above $C= 4129/(4129 + 871)$, the
|
|
6175 pairs with tied response values play no role.
|
|
For survival data, this set of incomparables is extended to include those
|
|
pairs for which the time ordering is ambiguous.
|
|
For instance, assume that $y_i$ is censored at time 10 and $y_j$ is an
|
|
event (or censor) at time 20.
|
|
Subject $i$ may or may not survive longer than subject $j$,
|
|
and so it is not possible to tell if a rule has ranked them correctly
|
|
or not.
|
|
Note that if $y_i$ is censored at time
|
|
10 and $y_j$ is an event at time 10 then $y_i > y_j$.
|
|
This same convention is followed for all the survival models, and
|
|
agrees with common clinical data, i.e., a patient censored on day 100
|
|
was observed to be still alive on day 100, their death time must be
|
|
strictly greater than 100.
|
|
|
|
A second, smaller issue for survival data is to recognize that we
|
|
desire to assess the concordance between an observed survival time $y_i$
|
|
and a predicted survival time $\hat y_i$.
|
|
This can be done without creating an explicit survival curve from the
|
|
model: for a \code{survreg} fit, for instance,
|
|
$\eta_1 > \eta_2$ implies $S(t; x_1) > S(t; x_2)$ for all $t$, where
|
|
$\eta_1$ and $\eta_2$ are the respective linear predictors $x_i\beta$ and
|
|
$x_2\beta$. Not creating the survival curves saves considerable computation
|
|
time.
|
|
For a Cox model $\eta_1 > \eta_2$ implies $S(t; x_1) < S(t; x_2)$; the order is
|
|
reversed. This is because the Cox model is a hazard model, and higher hazard
|
|
implies a shorter survival.
|
|
When a \code{coxph} object is used as the argument to
|
|
\code{concordance} the reversal is handled automatically.
|
|
However, when a concordance is done ``by hand'' the user needs to be aware of
|
|
this as seen in the example below.
|
|
In this case the concordance routine does not
|
|
know that the prediction came from a Cox model, resulting in a swap of the
|
|
count for concordant and discordant pairs and $C < .5$.
|
|
The user is responsible for adding the \code{reverse = TRUE} argument in this case.
|
|
|
|
<<>>=
|
|
# Concordance for a coxph object
|
|
concordance(fit4)
|
|
|
|
# Concordance using predictions from a Cox model
|
|
concordance(Surv(time, status) ~ predict(fit4), data = veteran, reverse = TRUE)
|
|
@
|
|
|
|
\subsubsection*{Stratified models}
|
|
|
|
Stratified models present a further variation: if observations $i$ and
|
|
$j$ are in different strata, the survival curves for those strata might
|
|
cross; $S(t; x_i)$ and $S(t; x_j)$ no longer have a simple ordering.
|
|
A solution is to use a stratified concordance, which compares all pairs
|
|
\emph{within} each stratum, and then adds up the result.
|
|
In the example below there is a separate count for each stratum, the
|
|
final concordance is based on the column sums.
|
|
The same issue, and solution, applies to stratified \code{survreg} models.
|
|
(The fact that strata names are not retained as labels for the counts
|
|
matrix is a deficiency in the routine.)
|
|
|
|
<<veteran2>>=
|
|
fit4b <- coxph(formula = Surv(time, status) ~ karno + age + trt +
|
|
strata(celltype), data = veteran)
|
|
concordance(fit4b)
|
|
table(veteran$celltype)
|
|
@
|
|
|
|
\subsection{Time-weighted concordance}
|
|
Look again at equation \eqref{cd2}, rewriting it for survival with $t_i$ as the
|
|
response in order to more closely match standard notation for survival.
|
|
Watson and Therneau \cite{Watson15} show that this can be further rewritten
|
|
as
|
|
\begin{align}
|
|
c-d &= \sum_{i=1}^n \delta_i \left[ \sum_{t_j > t_i} {\rm sign}(x_i- x_j) \right]
|
|
\nonumber \\
|
|
&= \sum_{i=1}^n \delta_i \left[\sum_{t_j \ge t_i} {\rm sign}(x_i - x_j) \right]
|
|
\label{cd4} \\
|
|
&= 2 \sum_i \delta_i n(t_i) \left[ r_i(t_i) - \overline r \right] \label{cscore}
|
|
\end{align}
|
|
Here $\delta$ is 0 = censored, 1 = uncensored; if $t_i$ is censored then all
|
|
other observations with $t_j \ge t_i$ are not comparable.
|
|
Equation \eqref{cd4} rewrites the inner term is a sum over all subjects in the
|
|
risk set at time $t_i$, a familiar concept in survival models.
|
|
In equation \eqref{cscore} the inner terms has been rewritten with
|
|
$n(t)$ as the number of subjects still at risk at time $t$, and
|
|
$r_i(t)$ the rank of $x_i$ among all those still at risk at time $t$,
|
|
where ranks are defined such that $0 \le r \le 1$, and $\overline r$ the
|
|
mean of those ranks.
|
|
(Proofs for \eqref{cd4} and \eqref{cscore} have been omitted.)
|
|
It turns out that
|
|
equation \eqref{cscore} is exactly the score statistic for a Cox model
|
|
with a single time-dependent covariate $n(t) r(t)$.
|
|
One immediate consequence of this connection is a straightforward definition
|
|
of concordance for a risk score containing time dependent covariates. Since
|
|
the Cox model score statistic is well defined for time dependent covariates,
|
|
the concordance is also well defined for a time-dependent risk score: at each
|
|
event time the current risk score of the subject who failed is compared to the
|
|
current (time dependent) scores of all those still at risk.
|
|
A deeper consequence of the equivalence between the concordance and
|
|
the Cox model is a link to alternate weightings of the
|
|
risk scores.
|
|
If the original Cox model has a single 0/1 treatment covariate then
|
|
equation \eqref{cscore} exactly matches the numerator of the
|
|
Gehan-Wilcoxon statistic; replacing $n(t)$ with weights of 1
|
|
will yield the log-rank statistic.
|
|
|
|
There is a deep literature with respect to the ``best'' weight for survival
|
|
tests,
|
|
and we can apply the same historical arguments
|
|
to the concordance as well. We will point out four of interest:
|
|
|
|
\begin{itemize}
|
|
\item Peto and Peto \cite{Peto72} point out that
|
|
$n(t) \approx n(0)S(t-)G(t-)$, where $S$
|
|
is the survival distribution and $G$ the censoring distribution.
|
|
They argue that $S(t-)$ would be a better weight since $G$ may have
|
|
features that are irrelevant to the question being tested.
|
|
For a particular dataset, Prentice \cite{Prentice79} later showed
|
|
that these concerns were indeed justified, and
|
|
most software now uses the Peto-Wilcoxon variant.
|
|
\item Tarone and Ware point out that weights of $n(t)$ and 1 give the
|
|
Gehan-Wilcoxon and log-rank tests, respectively, and suggest $\sqrt{n(t)}$
|
|
as an intermediate value.
|
|
\item Schemper et al
|
|
\cite{Schemper09} argue for
|
|
a weight of $S(t)/G(t)$ in the Cox model.
|
|
When proportional hazards does not hold the coefficient from the Cox model
|
|
is an ``average'' hazard ratio, and they show that using $S/G$ leads to
|
|
a value that remains interpretable in terms of an underlying population
|
|
model.
|
|
The same argument would also apply to the concordance, since our
|
|
goal is an ``assumption free'' assessment of association.
|
|
\item Uno et al \cite{Uno11} recommend the use of $n/G^2$ as a weight based
|
|
on a consistency argument. If we assume that the concordance value that
|
|
would be obtained after full follow-up of all subjects (no censoring) is
|
|
the ``right'' one, and proportional hazards does not hold, then the
|
|
standard concordance will not consistently estimate this target quantity
|
|
when there is censoring.
|
|
\end{itemize}
|
|
|
|
In practice, weights need to be based on left continuous versions of the
|
|
survival curves $S(t-)$ and $G(t-)$, and extra care needs to be exercised
|
|
in computation of $G$. Consider the \code{aml} dataset as an
|
|
example; the first few lines of the relevant survival curve are shown below.
|
|
|
|
<<amlexample>>=
|
|
afit <- survfit(Surv(time, status) ~1, aml, se = FALSE)
|
|
summary(afit, times = afit$time[1:6], censor = TRUE)
|
|
@
|
|
|
|
The first censor is at time 13, when there is both a death and a censor.
|
|
For the event at time 13, however, no one has yet been censored, and no
|
|
``censoring correction'' should be done at this point.
|
|
Weights at any time $t$, like the number at risk at time $t$, must be the values
|
|
in effect just \emph{before} the event. For a weight based
|
|
on $S$; at time 13 the proper value would be .739, dropping to .696 just
|
|
after time 13.
|
|
The value of $G(t-)$ at 13 will be 1.
|
|
Just after time 13 will be 15/16, not the value of 16/17
|
|
that will be obtained from \code{survfit(Surv(time, 1-status) ~1)}.
|
|
Censors happen after deaths, thus there are only 16 at risk for the censoring
|
|
event.
|
|
|
|
Based on the Peto and Peto argument that $n(t) \approx n(0)S(t-)G(t-)$,
|
|
we might expect the Schemper and Uno weights to be similar.
|
|
In testing, we discovered to our surprise that if computations are
|
|
done carefully as per above, then $n(t) = n(0)S(t-)G(t-)$ exactly.
|
|
As a consequence, the Schemper and Uno weights are also identical.
|
|
|
|
The \code{timewt} option in the \code{concordance} function allows
|
|
you to modify weights for the concordance.
|
|
The options are: \code{n, S, S/G, n/G2} and \code{I},
|
|
the last of which giving equal weight to each event time; this would
|
|
correspond to a log-rank test.
|
|
(We do not recommend this nor are we aware of any literature which does so,
|
|
but included it for completeness).
|
|
For non-survival data all the weights become identical.
|
|
|
|
The figure below shows the first three weights for the colon cancer
|
|
dataset.
|
|
This data is from a clinical trial of 929 subjects, with 3 years of enrollment
|
|
followed by 5 years of follow.
|
|
Since there is almost no one lost to follow-up in the first 5 years, the choices
|
|
are nearly identical over that time.
|
|
From 5 to 8 years, $S(t)$ continues its steady decline, $n(t)$ plummets due
|
|
to administrative censoring, and $S/G$ explodes.
|
|
Even with these changes in weights, the concordance values are all very similar.
|
|
|
|
<<tmwt, fig=TRUE>>=
|
|
colonfit <- coxph(Surv(time, status) ~ rx + nodes + extent, data = colon,
|
|
subset = (etype == 2)) # death only
|
|
cord1 <- concordance(colonfit, timewt="n", ranks=TRUE)
|
|
cord2 <- concordance(colonfit, timewt="S", ranks=TRUE)
|
|
cord3 <- concordance(colonfit, timewt="S/G", ranks=TRUE)
|
|
cord4 <- concordance(colonfit, timewt="n/G2", ranks=TRUE)
|
|
temp <- c("n(t)"= coef(cord1), S=coef(cord2), "S/G"= coef(cord3),
|
|
"n/G2"= coef(cord4))
|
|
round(temp,5) # 4 different concordance estimates
|
|
|
|
# Plot the weights over time using the first 3 approaches
|
|
matplot(cord1$ranks$time/365.25, cbind(cord1$ranks$timewt,
|
|
cord2$ranks$timewt,
|
|
cord3$ranks$timewt),
|
|
type= "l", lwd=2, col=c(1,2,4),
|
|
xlab="Years since enrollment", ylab="Weight")
|
|
legend(1, 3000, c("n(t)", "nS(t-)", "nS(t-)/G(t-)"), lwd=2,
|
|
col=c(1,2,4), lty=1:3, bty="n")
|
|
|
|
# Note that n/G2 and S/G are identical
|
|
all.equal(cord3$ranks$timewt,cord4$ranks$timewt)
|
|
@
|
|
|
|
\subsubsection*{When do weights matter?}
|
|
|
|
When might the concordance differ between weight functions?
|
|
In order for $S/G$ weights to show an important difference we need to
|
|
have two conditions.
|
|
|
|
\begin{itemize}
|
|
\item First, sufficient censoring that the two weights differ for a reasonable
|
|
fraction of the data, that is, $G(t)$ is low and $S(t)$ has not flattened
|
|
(deaths are still occurring).
|
|
\item Second, per the arguments in Schemper and in Uno, is the potential presence
|
|
of non-proportional hazards in the fit.
|
|
\end{itemize}
|
|
|
|
\begin{figure}
|
|
<<manycurve, fig=TRUE, echo=FALSE>>=
|
|
duo <- function(time, status, name, conf.int=FALSE) {
|
|
sfit <- survfit(Surv(time, status) ~1)
|
|
gfit <- survfit(Surv(time, max(status)-status) ~1)
|
|
plot(sfit, conf.int=conf.int, xlab=name, lwd=2)
|
|
lines(gfit, col=2, lwd=2, conf.int = conf.int)
|
|
}
|
|
oldpar <- par(mfrow=c(3,3), mar=c(5,5,1,1))
|
|
with(subset(colon, etype==1), duo(time/365.25, status, "NCCTG colon cancer"))
|
|
duo(flchain$futime/365.25, flchain$death, "Free light chain")
|
|
duo(kidney$time/12, kidney$status, "McGilchrist kidney")
|
|
duo(lung$time/365.25, lung$status, "advanced lung cancer")
|
|
duo(mgus2$futime/12, mgus2$death, "MGUS")
|
|
duo(nafld1$futime/365.25, nafld1$status, "NAFLD")
|
|
|
|
duo(pbc$time/365.25, pmin(pbc$status,1), "PBC")
|
|
with(rotterdam, duo(pmin(rtime, dtime)/365.25, pmax(recur, death), "Rotterdam"))
|
|
|
|
nfit <- coxph(Surv(futime/365.25, status) ~ age + male, data = nafld1)
|
|
znfit <- cox.zph(nfit, transform = 'identity')
|
|
plot(znfit[1], resid = FALSE)
|
|
par(oldpar)
|
|
@
|
|
\caption{Survival (black) and censoring (red) curves for 8 datasets found
|
|
in the survival package. The final panel shows a proportional hazards
|
|
evaluation for the age variable, in a fit of age + male to the NAFLD
|
|
data.}
|
|
\label{surv8}
|
|
\end{figure}
|
|
|
|
Figure \ref{surv8} shows survival and censoring curves for 8 different
|
|
datasets found in the survival package.
|
|
Based on these, the dataset with the greatest potential for an $S/G$ difference
|
|
is the NAFLD data; the dataset also has some early non-proportionality for age.
|
|
The code below shows the calculation of the $S/G$ difference. Surprisingly, the four weightings still yield very similar
|
|
concordance values; the Harrell (n) and Uno (n/G2) weighting differ in only
|
|
the second decimal place.
|
|
|
|
<<nafld1>>=
|
|
nfit <- coxph(Surv(futime/365.25, status) ~ age + male, data = nafld1)
|
|
ncord1 <- concordance(nfit, timewt = "n")
|
|
ncord2 <- concordance(nfit, timewt = "S")
|
|
ncord3 <- concordance(nfit, timewt = "S/G")
|
|
ncord4 <- concordance(nfit, timewt = "n/G2")
|
|
temp <- c(n = coef(ncord1), S = coef(ncord2),
|
|
"S/G" = coef(ncord3), "n/G2" = coef(ncord4))
|
|
round(temp,6)
|
|
@
|
|
|
|
The \code{concordance} function provides a \code{ranks=TRUE} argument which
|
|
can be used to further exploration of the weights. I
|
|
f set, the output will include a
|
|
dataframe that contains one row for each event, containing the time
|
|
point, its relative rank in the risk set ($r_i - \overline r$), the case weight
|
|
for the observation and the time weight at that time point.
|
|
The relative ranks are comparable to Schoenfeld residuals,
|
|
and their weighted sum will equal $c -d$.
|
|
We can plot them over time and apply a smooth.
|
|
Figure \ref{vetfit}, for the veteran dataset, shows a
|
|
precipitous drop to 0.
|
|
The Veteran's cancer data is perhaps an extreme of this pattern.
|
|
Due to the very rapid
|
|
progression of disease, baseline measurements soon lose their meaning.
|
|
|
|
<<rankresid, eval=FALSE>>=
|
|
# code for Figure 3
|
|
# pick a dataset with a smaller number of points, and non PH
|
|
vfit <- coxph(Surv(time/365.25, status) ~ age + karno, data = veteran)
|
|
temp <- concordance(vfit, ranks=TRUE)$rank
|
|
# Two outliers at 999 days = 2.7 years stretch the axis too far
|
|
plot(rank ~ time, data=temp, xlim=c(0,1.6),
|
|
xlab="Years", ylab="Rank residual")
|
|
lines(lowess(temp$time, temp$rank, iter=1), lwd=2, col=2)
|
|
abline(0, 0, lty=3)
|
|
@
|
|
|
|
\begin{figure}
|
|
<<rankresid2, echo=FALSE, fig=TRUE>>=
|
|
<<rankresid>>
|
|
@
|
|
\caption{Schoenfeld residuals for the scaled ranks, from a fit to
|
|
the veteran dataset.}
|
|
\label{vetfit}
|
|
\end{figure}
|
|
|
|
\subsection{Restricted time range}
|
|
A first consideration in any model assessment is to stop and
|
|
ponder what a ``good fit'' means.
|
|
This is far too often ignored in the rush to computation.
|
|
Two wonderful papers in this regard are Korn and Simon \cite{Korn90} and
|
|
Altman and Royston \cite{Altman00}.
|
|
The title of the second makes its purpose clear: ``What do we mean by validating
|
|
a prognostic model?''
|
|
|
|
Korn and Simon, for instance,
|
|
point out the importance of restricting the range of comparison, a point
|
|
echoed by Altman and Royston.
|
|
Though a risk model can be used for long-range prediction, in actual patient
|
|
practice this will very often not be the need; the model should be
|
|
evaluated over the time range in which it will actually be used.
|
|
For example, for a patient who returns every 4 years, a 10 year prediction will
|
|
never actually be put to the test; predictions will be updated every 4 years
|
|
based on the most current information.
|
|
If the model is to be used for ongoing patient management we will want to
|
|
restrict attention to a 4 year horizon, with respect to evaluating its utility.
|
|
(A 10 year Framingham heart risk might still be useful in \emph{convincing} a
|
|
subject to stop smoking, however.)
|
|
|
|
We sometimes receive push-back on this, with the argument that one should use
|
|
all the data.
|
|
We disagree, and think that the target of the validation is critical.
|
|
Predictions become less
|
|
accurate the further out we reach in time; this is true for everything from
|
|
weather forecasts to the stock market, and survival models are not immune.
|
|
Reaching too far into the future may return an overly pessimistic value of
|
|
$C$.
|
|
|
|
Another reason for using an upper limit is that $1/G$ can become
|
|
unstable as the sample size becomes small (large jumps in the KM), or
|
|
unreasonably large as $G$ approaches 0. Most authors suggest an upper limit
|
|
for this purely technical reason.
|
|
|
|
<<rotterdam, fig=TRUE, eval=FALSE>>=
|
|
# should this be included?
|
|
# recurrence free survival = earlier of recurrence and death
|
|
rdata <- rotterdam
|
|
rdata$rfs <- with(rdata, ifelse(recur==1, 1, death))
|
|
rdata$rfstime <- with(rdata, ifelse(recur==1, rtime, dtime))/ 365.25
|
|
|
|
rfit <- coxph(Surv(rfstime, rfs) ~ age + meno + grade + pspline(nodes), rdata)
|
|
ctemp <- matrix(0, 100, 2) # concordance and std err
|
|
ctime <- seq(.1, 10, length=100)
|
|
for (i in 1:100) {
|
|
temp <- concordance(rfit, ymax=ctime[i])
|
|
ctemp[i,] <- c(temp$concordance, sqrt(temp$var))
|
|
}
|
|
yhat <- ctemp[,1] + outer(ctemp[,2], c(0, -1.96, 1.96), '*')
|
|
matplot(ctime, yhat, type='l', lty=c(1,2,2), lwd=c(2,1,1), col=1,
|
|
xlab="Upper cutoff", ylab="C", ylim=c(0.5,1))
|
|
@
|
|
|
|
Argument could also be made for a lower limit, though this would be uncommon
|
|
for censored data. Many laboratory values, for instance, treat all results
|
|
less than some threshold as identical.
|
|
However, the ability to implement a lower limit correctly is constrained by
|
|
censoring. Say for instance that there were values of 5, 8+, and 9, and a
|
|
lower limit of 10 were chosen. The approach used for non-censored data
|
|
is to treat 5 and 9 as tied values, but this logic does not
|
|
correctly extend to the censored
|
|
value. This issue and possible solutions will be discussed more fully in
|
|
the external validation vignette.
|
|
|
|
\subsection{Synthetic $C$}
|
|
As another method of addressing censoring,
|
|
G\"{o}en and Heller \cite{Goen05} show that if the statistical model is
|
|
correct, and if proportional hazards holds, then for any pair of
|
|
covariate vectors
|
|
$$
|
|
P(y_i > y_j) = \frac{1}{1 + e^{\eta_j - \eta_i}}
|
|
$$
|
|
They then order the $\eta$ values from a fitted model, and take an average
|
|
over all $n(n-1)/2$ ordered pairs.
|
|
The authors argue that this is an estimate that is independent of censoring, and
|
|
therefore preferable to Harrell's C. (The estimate can be obtained by using the
|
|
\code{royston} function.)
|
|
|
|
The biggest problem with this approach is that it gives an estimate of
|
|
concordance under an assumption that the model is \emph{exactly correct}.
|
|
Our goal, rather, is to assess how \emph{well} the model performs,
|
|
for our needs, knowing that it will be imperfect.
|
|
The G and H formula answers a question that we did not ask, over a time range
|
|
$(0, \infty)$ which is not of interest.
|
|
The calculation is also $O(n^2)$ so will be slow for large sample sizes.
|
|
|
|
\subsection{Picking a weight}
|
|
|
|
So, which weight should we use? As shown in the examples above, it may not
|
|
matter that much.
|
|
(The fact that we have never found an example where the effect is large does not
|
|
mean there are no such datasets, however.) Time weights play no role for
|
|
uncensored data.
|
|
|
|
Further issues to consider:
|
|
|
|
\begin{enumerate}
|
|
\item An important issue
|
|
that has not been sorted out is how to extend $1/G$ weighting
|
|
arguments to datasets that are subject to delayed entry, e.g., when
|
|
using age as the time scale instead of time since enrollment.
|
|
There is, in this case, no natural estimate available for $G$.
|
|
It is also not possible
|
|
for the \code{coxph} routine to reliably tell the difference between such
|
|
left truncation and simple time-dependent covariates or strata.
|
|
The default action of the routine is to use the safe choice of $n(t)$.
|
|
\item Consider setting a time ($y$) restriction using the \code{ymax}
|
|
option, based on careful thought about the proper
|
|
range of interest. This often has a larger practical effect than the
|
|
choice of time weight.
|
|
\item Safety. If using the usual Gehan-Wilcoxon weights of $n(t)$,
|
|
the Peto-Wilcoxon variant $S(t)$ would appear advantageous, particularly
|
|
if there is differential censoring for some subjects.
|
|
\item Equality vs. efficiency. On one hand we would like to treat each
|
|
data pair equally, but in our quest for ever sharper p-values we want to
|
|
be efficient. The first argues for $n(t)$ as the weight and the
|
|
second for using equal weights, since the variances of each ranking term are
|
|
nearly identical.
|
|
This is exactly the argument between the Gehan-Wilcoxon and
|
|
the log-rank tests.
|
|
\item For uncensored data $n$, $S$ and $S/G$ weights are all identical.
|
|
\end{enumerate}
|
|
|
|
Our current opinion is that the point of the concordance is to evaluate
|
|
the model in a more non-parametric way, so a log-rank type of focus on
|
|
ideal p-values is misplaced. This suggests using either $S$ or $S/G$ as
|
|
weights. Both give more prominence to the later time points as compared to
|
|
the default $n(t)$ choice, but if time limits have been thought through
|
|
carefully the difference between these three will almost always be ignorable.
|
|
|
|
We most definitely disagree with Uno's unstated assumption that
|
|
the $C$ statistic one would obtain with infinite follow-up and no
|
|
censoring is the proper target of estimation, and the ordinary concordance
|
|
is therefore biased. That target will never be attainable, and we would argue
|
|
that it is largely irrelevant if it was.
|
|
Proportional hazards never is true over the long term, simply because it is
|
|
almost impossible to predict events that are a decade or more away and thus
|
|
the rank residuals shown above will eventually tend to 0.
|
|
The starting point should always be to think through exactly \emph{what} one
|
|
wants to estimate.
|
|
As stated by Yogi Berra ``If you don't know where you are going, you'll end
|
|
up someplace else."
|
|
|
|
|
|
\section{Variance}
|
|
The variance of the statistic is estimated in two ways. The first is to use
|
|
the variance of the equivalent Cox model score statistic.
|
|
As pointed out
|
|
by Watson, this estimate is both correct and efficient under $H_0: C=.5$,
|
|
and so it forms a valid test of $H_0$.
|
|
However, when the concordance is over .7 or so, this estimator systematically
|
|
overestimates the true variance.
|
|
An alternative that remains unbiased is the infinitesimal jackknife (IJ) variance
|
|
\begin{align*}
|
|
V &= \sum_{i=1}^n w_iU_i^2 \\
|
|
U_i &= \frac{\partial C}{\partial w_i}
|
|
\end{align*}
|
|
The concordance routine calculates an influence matrix $U$ with one row per
|
|
subject and columns that contain derivatives for the 5 individual counts:
|
|
concordant, discordant, tied on x, tied on y, and tied on xy pairs. From
|
|
this it is straightforward to derive the influence of each subject on
|
|
the concordance, or on any other
|
|
of the other possible association measures such as $\tau$-a mentioned earlier.
|
|
The IJ variance is printed by default but the PH variance is also returned;
|
|
an earlier \code{survConcordance} function only computed the PH variance.
|
|
|
|
The \code{condordance} function does not compute Kendall's $\tau$-a or
|
|
$\tau$-b, nor Goodman's gamma.
|
|
However, since all of the necessary components for those values are returned,
|
|
along with IJ influence for each, it can be used as the computational engine
|
|
for those measures and their variance, should someone wish to do so.
|
|
|
|
The variance computation accounts for the influence of each subject on both
|
|
the numberator and denominator of $C$; this agrees with parallel development
|
|
used by Newson \cite{Newson06} and implemented in STATA.
|
|
An alternate approach is to treat the total number of
|
|
comparable pairs is an ancillary statistic, thus
|
|
${\rm var}(C) = {\rm var}(c-d)/(4 m)$ where $m$ is the number of comparable
|
|
pairs ($n(n-1)/2$ for uncensored data).
|
|
Arguments about what aspects of a dataset
|
|
can or should be treated as ancillary are as old as statistics, e.g., treating
|
|
the margins of a 2x2 table as ancillary leads to Fisher's exact test.
|
|
In this case we anticipate that the differences will be quite small, but have
|
|
done no formal exploration.
|
|
|
|
\subsection{Multiple concordances}
|
|
\label{sec:multiple}
|
|
|
|
One useful property of using a jackknife variance estimate is that the
|
|
variance of the
|
|
difference in concordance between two separately fitted models is also easily
|
|
obtained. If $c_a$ and $c_b$ are the two concordance statistics and $U_{ia}$
|
|
and $U_{ib}$ the corresponding influence values,
|
|
the influence vector for $c_a-c_b$ is $U_a - U_b$.
|
|
(If subject $i$ increases $c_a$ by .03 and $c_b$ by .01, then he/she
|
|
raises the difference between them by .02.)
|
|
It is not necessary that the models be nested.
|
|
However, it is crucial that they be computed on the exact same set of
|
|
observations.
|
|
Here is a comparison of concordance values from previous models.
|
|
|
|
<<test>>=
|
|
ctest <- concordance(fit4, fit5, fit6)
|
|
ctest
|
|
|
|
# compare concordance values of fit4 and fit5
|
|
contr <- c(-1, 1, 0)
|
|
dtest <- contr %*% coef(ctest)
|
|
dvar <- contr %*% vcov(ctest) %*% contr
|
|
|
|
c(contrast=dtest, sd=sqrt(dvar), z=dtest/sqrt(dvar))
|
|
@
|
|
|
|
To do a similar comparison for models which do not have a \code{concordance}
|
|
method, use a set of dummy linear model fits as a container.
|
|
For instance, assume that \code{y} was continuous, and we have 3 different
|
|
predicted values \code{phat1}, \code{phat2}, \code{phat3} from three different
|
|
machine learning models, say, and we want to compute and compare concordance.
|
|
The one could use the following code:
|
|
\begin{verbatim}
|
|
dummy1 <- lm(y ~ phat1)
|
|
dummy2 <- lm(y ~ phat2)
|
|
dummy3 <- lm(y ~ phat3)
|
|
cfit <- concordance(dummy1, dummy2, dummy3)
|
|
print(cfit)
|
|
etc.
|
|
\end{verbatim}
|
|
|
|
In order to produce correct answers, it is necessary that y, phat1, phat2,
|
|
and phat3 be results for exactly the same observations, in exactly the same
|
|
order. If one model has some subject removed for missing values, say, then
|
|
those subjects can not be present in any of the ML fits.
|
|
|
|
\subsection{Asymmetric confidence intervals}
|
|
The infinitesimal jackknife (IJ) has provided us with an honest estimate of
|
|
the standard deviation of $C$.
|
|
A natural confidence interval for the concordance is then
|
|
$C \pm z_{\alpha} sd(C)$.
|
|
As with confidence intervals for an ordinary proportion $\hat p$, however,
|
|
this simple interval can sometimes be inconsistent, giving CI endpoints that
|
|
lie outside of the legal range of $[0,1]$.
|
|
In the case of $\hat p$ there is a long history of methods to address this
|
|
issue, going back at least as far as the 1956 paper by
|
|
Anscombe \cite{Anscombe56}, but there is less
|
|
literature for the concordance
|
|
or AUC.
|
|
Newcombe \cite{Newcombe06} provides corrected methods, but with
|
|
the caveat that they ``have the drawback that on account of the large number
|
|
of outcomes they are computationally practicable only for very small sample
|
|
sizes.''
|
|
The tree based computations used in the \code{concordance} function might well
|
|
address the speed issue, but have not been implemented.
|
|
|
|
Here we pursue another avenue, which is to consider a transformation based
|
|
confidence interval, in much the same way as is done for confidence intervals
|
|
of a survival curve. That is, we use
|
|
$$
|
|
g^{-1}\left[g(C) \pm z \sigma(g(C)) \right]
|
|
$$
|
|
for some transformation function $g$. For survival curves, the $g$ functions
|
|
$\log(p)$, $\log(p/(1-p)$, $\log(-\log(1-p))$ and $\arcsin(p)$ have all been
|
|
found to be superior to the simple interval.
|
|
|
|
For the concordance,
|
|
consider the Fisher z-transform, widely used for the correlation coefficient
|
|
$r$
|
|
\begin{equation}
|
|
z = \frac{1}{2} \log\left(\frac{1+r}{1-r} \right) \label{Fisherz}
|
|
\end{equation}
|
|
Since Somers' $D$ and $r$ are targeted at similar concepts, we might hazard that
|
|
a similar transformation of Somers' $D$, which also ranges from -1 to 1,
|
|
would also be close to equivariant.
|
|
Since $D = 2C -1$ we have
|
|
\begin{align*}
|
|
z_c &= \frac{1}{2} \log\left(\frac{1+ (2C-1)}{1- (2C-1)} \right) \\
|
|
&= \frac{1}{2} \log\left(\frac{C}{1-C} \right) \label{FisherC}
|
|
\end{align*}
|
|
which we recognize as the inverse of the logistic link used \code{glm}
|
|
models.
|
|
|
|
We can get the standard error of $z_c$ by retrieving the individual
|
|
dfbeta values and performing a transformation.
|
|
The dfbeta value is defined as $d_i =C - C_{-i}$ where the latter is the $C$
|
|
statistic omitting the $i$th observation.
|
|
<<Cztrans>>=
|
|
zci <- function(fit, p=.95) {
|
|
ilogist <- function(p) log(p/(1-p)) # inverse logistic
|
|
logistic <- function(x) exp(x)/(1 + exp(x))
|
|
temp <- concordance(fit, influence =1)
|
|
cminus <- temp$concordance - temp$dfbeta # values of concordance, without i
|
|
|
|
newd <- ilogist(temp$concordance) - ilogist(cminus) # dfbeta on new scale
|
|
new.sd <- sqrt(sum(newd^2))
|
|
old.sd <- sqrt(sum(temp$dfbeta^2)) # same as sqrt(temp$var)
|
|
|
|
z <- qnorm((1-p)/2)
|
|
old.ci <- temp$concordance + c(z, -z)*old.sd
|
|
new.ci <- logistic(ilogist(temp$concordance) + c(z, -z)* new.sd)
|
|
rbind(old = old.ci, new= new.ci)
|
|
}
|
|
|
|
round(zci(colonfit), 4)
|
|
@
|
|
|
|
The two intervals hardly differ, which is what we would expect for a value
|
|
far from 1.
|
|
As a second example, create a small dataset with a concordance that is
|
|
close to 1. As shown below, the z-transform shifts the CI towards zero,
|
|
as it should, but also avoids the out-of-bounds endpoint.
|
|
|
|
<<close>>=
|
|
set.seed(1953)
|
|
ytest <- matrix(rexp(20), ncol=2) %*% chol(matrix(c(1, .98, .98, 1), 2))
|
|
cor(ytest)
|
|
lfit <- lm(ytest[,1] ~ ytest[,2])
|
|
zci(lfit)
|
|
@
|
|
|
|
|
|
|
|
\section{Details}
|
|
This section documents a few details - most readers can skip it.
|
|
|
|
The usual convention for survival data is to assume that censored values
|
|
come after deaths, even if they are recorded on the same day.
|
|
This corresponds to the common case that a subject who is censored on day
|
|
200, say, was actually seen on that day.
|
|
That is, their survival is strictly greater than 200.
|
|
As a consequence, censoring weights $G$ actually use $G(t-)$ in the
|
|
code: if 10 subjects are censored at day 100, and these are the first
|
|
censorings in the study, then an event on day 100 should not be given a
|
|
larger weight.
|
|
(Both the Uno and Schemper papers ignore this detail.)
|
|
|
|
When using weights of $S(t)$ the program actually uses a weight of $nS(t-)$
|
|
where $n$ is the number of observations in the dataset.
|
|
The reason is that for a stratified model the weighted number of concordant,
|
|
discordant and tied pairs is calculated separately for each stratum, and
|
|
then added together.
|
|
If one stratum were much smaller or larger than the others we want to
|
|
preserved this fact in the sum.
|
|
|
|
\bibliography{refer}
|
|
\bibliographystyle{plain}
|
|
|
|
\end{document}
|
|
|