1373 lines
66 KiB
Plaintext
1373 lines
66 KiB
Plaintext
\documentclass{report}[notitlepage,11pt]
|
|
\usepackage{Sweave}
|
|
\usepackage{amsmath}
|
|
\usepackage{makeidx}
|
|
|
|
\addtolength{\textwidth}{1in}
|
|
\addtolength{\oddsidemargin}{-.5in}
|
|
\setlength{\evensidemargin}{\oddsidemargin}
|
|
%\VignetteIndexEntry{Survival package methods}
|
|
|
|
\SweaveOpts{keep.source=TRUE, fig=FALSE}
|
|
% Ross Ihaka suggestions
|
|
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
|
|
\fvset{listparameters={\setlength{\topsep}{0pt}}}
|
|
%\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topep}}
|
|
|
|
\newcommand{\code}[1]{\texttt{#1}}
|
|
\newcommand{\Lhat}{\hat\Lambda}
|
|
\newcommand{\lhat}{\hat\lambda}
|
|
\newcommand{\KM}{{\rm KM}}
|
|
\newcommand{\NA}{{\rm NA}}
|
|
\newcommand{\FH}{{\rm FH}}
|
|
\newcommand{\Ybar}{\overline Y}
|
|
\newcommand{\Nbar}{\overline N}
|
|
|
|
|
|
\title{Further documentation on code and methods in the survival package}
|
|
\author{Terry Therneau}
|
|
\date{March 2024}
|
|
|
|
<<setup, echo=FALSE>>=
|
|
library(survival)
|
|
@
|
|
\makeindex
|
|
\begin{document}
|
|
\maketitle
|
|
|
|
\chapter{Introduction}
|
|
For many years I have used the noweb package to intersperse technical
|
|
documentation with the source code for the survival package.
|
|
However, despite its advantages, the uptake of noweb by the R community
|
|
in general has been nearly nil.
|
|
It is increasingly clear that no future maintainer will continue the
|
|
work ---
|
|
on the github page I have yet to recieve a suggested update that actually
|
|
fixed the .Rmd source file instead of the .R file derived from it. This means
|
|
that I can't merge a suggested change automatically, but have to replicate it
|
|
myself.
|
|
|
|
This document is a start at addressing this. As routines undergo maintainance,
|
|
I will remove the relevant .Rnw file in the noweb directory and work directly
|
|
on the C and R code, migrating the ``extra'' material into this document.
|
|
In this
|
|
vignette are discussions of design issues, algorithms, and detailed
|
|
formulas. In the .R and .c code I will place comments of the form
|
|
``See methods document, abc:def'', adding an abc:def entry to the index
|
|
of this document.
|
|
I am essentially splitting each noweb document into two parts.
|
|
The three advantages are
|
|
\begin{itemize}
|
|
\item Ease the transition to community involvement and maintainance.
|
|
\item The methods document will be a more ready source of documentation for
|
|
those who want to know technical details, but are not currently modifying
|
|
the code.
|
|
\item (minor) I expect it will be useful to have the methods document and
|
|
R or C code open simultaneously in 2 windows, when editing.
|
|
\end{itemize}
|
|
However, this conversion will be a long process.
|
|
|
|
\paragraph{Notation}
|
|
Throughout this document I will use formal counting process notation,
|
|
so you may as well get used to it. The primary advantage is that it
|
|
is completely precise, and part of the goal for this document is to give
|
|
a fully accurate description of what is computed.
|
|
In this notation we have
|
|
\begin{itemize}
|
|
\item $Y_i(t)$ = 1 if observation $i$ is at risk at time $t$, 0 otherwise.
|
|
\item $N_i(t)$ = total number of events, up to time $t$ for subject $i$.
|
|
\item $dN_i(t)$ = 1 if there was an event at exactly time $t$
|
|
\item $X$ = the matrix of covariates, with $n$ rows, one per observation, and
|
|
a column per covariate.
|
|
\item $X(s)$ = the time-dependent covariate matrix, if there are time-dependent
|
|
covariates
|
|
\end{itemize}
|
|
|
|
For multistate models the extends to $Y_{ij}(t)$, which is 1 if subject $i$ is
|
|
at risk and in state $j$ at time $t$, and $N_{ijk}(t)$ which is the number of
|
|
$j$ to $k$ transitions that have occured, for subject $i$, up to time $t$.
|
|
The number at risk and number of events at time $t$ can be written as
|
|
\begin{align*}
|
|
n(t) &= \sum_{i=1}^n w_iY_i(t) \\
|
|
&= \Ybar(t) \\
|
|
d(t) &= \sum_{i=1}^n w_i dN_i(t) \\
|
|
&= d\Nbar(t)
|
|
\end{align*}
|
|
where $w_i$ are optional weights for each observation.
|
|
$\Nbar(t)$ is the cumulative number of events up to time $t$.
|
|
I will often also use $r_i(t) = \exp(X_i(t)\beta)$ as the per observation risk
|
|
score in a Cox model.
|
|
|
|
|
|
\chapter{Survival Curves}
|
|
\index{survfit}
|
|
The survfit function was set up as a method so that we could apply the
|
|
function to both formulas (to compute the Kaplan-Meier) and to coxph
|
|
objects.
|
|
The downside to this is that the manual pages get a little odd:
|
|
\code{survfit} is generic and \code{survfit.formula} and \code{survfit.coxph}
|
|
are the ones a user will want. But from
|
|
a programming perspective it has mostly been a good idea.
|
|
At one time, long long ago, we allowed the function to be called with
|
|
``Surv(time, status)'' as the formula, i.e., without a right hand side
|
|
of \textasciitilde 1. That was
|
|
a bad idea, now abandoned: \code{survfit.Surv} is now a simple stub that
|
|
prints an error message.
|
|
|
|
\section{Roundoff and tied times}
|
|
\index{tied times}
|
|
One of the things that drove me nuts was the problem of
|
|
``tied but not quite tied'' times. It is a particular issue for both
|
|
survival curves and the Cox model, as both treat a tied time differently than
|
|
two close but untied values.
|
|
As an example consider two values of 24173 = 23805 + 368. These are values from
|
|
an actual study with times in days: enrollment at age 23805 days and then 368
|
|
days of follow-up.
|
|
However, the user chose to use age in years, and saved those values out
|
|
in a CSV file, the left hand side of the above equation becomes
|
|
66.18206708000000 and the right hand side addition yeilds 66.18206708000001.
|
|
The R phrase \code{unique(x)} sees these two values as distinct but
|
|
\code{table(x)} and \code{tapply} see it as a single value since they
|
|
first apply \code{factor} to the values, and that in turn uses
|
|
\code{as.character}.
|
|
A transition through CSV is not necessary to create the problem. Consider the
|
|
small code chunk below. For someone born on 1960-03-10, it caclulates the
|
|
time interval between a study enrollment on each date from 2010-01-01 to
|
|
2010-07-29 (200 unique dates) and a follow up that is exactly 29 days later,
|
|
but doing so on age scale.
|
|
<<test, echo=TRUE>>=
|
|
tfun <- function(start, gap, birth= as.Date("1960-01-01")) {
|
|
as.numeric(start-birth)/365.25 - as.numeric((start + gap)-birth)/365.25
|
|
}
|
|
|
|
test <- logical(200)
|
|
for (i in 1:200) {
|
|
test[i] <- tfun(as.Date("2010/01/01"), 29) ==
|
|
tfun(as.Date("2010/01/01") + i, 29)
|
|
}
|
|
table(test)
|
|
@
|
|
The number of FALSE entries in the table depends on machine, compiler,
|
|
and possibly several other issues.
|
|
There is discussion of this general issue in the R FAQ: ``why doesn't R
|
|
think these numbers are equal''.
|
|
The Kaplan-Meier and Cox model both pay careful attention to ties, and
|
|
so both now use the \code{aeqSurv} routine to first preprocess
|
|
the time data. It uses the same rules as \code{all.equal} to
|
|
adjudicate ties and near ties. See the vignette on tied times for more
|
|
detail.
|
|
|
|
|
|
The survfit routine has been rewritten more times than any other in the package,
|
|
as we trade off simplicty of the code with execution speed.
|
|
The current version does all of the oranizational work in S and calls a C
|
|
routine for each separate curve.
|
|
The first code did everything in C but was too hard to maintain and the most
|
|
recent prior function did nearly everything in S.
|
|
Introduction of robust variance
|
|
prompted a movement of more of the code into C since that calculation
|
|
is computationally intensive.
|
|
|
|
The \code{survfit.formula} routine does a number of data checks, then
|
|
hands the actual work off to one of three computational routes: simple survival
|
|
curves using the \code{survfitKM.R} and \code{survfitkm.c} functions,
|
|
interval censored data to \code{survfitTurnbull.R},
|
|
and multi-state curves using the \code{survfitAJ.R} and \code{survfitaj.c} pair.
|
|
|
|
The addition of \code{+ cluster(id)} to the formula was the suggested form
|
|
at one time, we now prefer \code{id} as a separate argument. Due to the
|
|
long comet's tail of usage we are unlikely to formally depreciate the older
|
|
form any time soon.
|
|
The istate argument applies to multi-state models (Aalen-Johansen), or any data
|
|
set with multiple rows per subject; but the code refrains from complaint if
|
|
it is present when not needed.
|
|
|
|
\section{Single outcome}
|
|
\index{Kaplan-Meir}
|
|
We bein with classic survival, where there is a single outcome. Multistate
|
|
models will follow in a separate section.
|
|
|
|
At each event time we have
|
|
\begin{itemize}
|
|
\item n(t) = weighted number at risk
|
|
\item d(t) = weighted number of events
|
|
\item e(t) = unweighted number of events
|
|
\end{itemize}
|
|
leading to the Nelson-Aalen estimate of cumulative hazard and the
|
|
Kaplan-Meir estimate of survival, and the estimate of hazard at each
|
|
time point.
|
|
\begin{align*}
|
|
KM(t) &= KM(t-) (1- d(t)/n(t) \\
|
|
NA(t) &= NA(t-) + d(t)/n(t)
|
|
h(t) &= d(t)/n(t) \\
|
|
\end{align*}
|
|
|
|
An alternative estimate in the case of tied times is the Fleming-Harrington.
|
|
When there are no case weights the FH idea is quite simple.
|
|
The assumption is that the real data is not tied, but we saw a coarsened version.
|
|
If we see 3 events out of 10 subjects at risk the NA increment is 3/10, but the
|
|
FH is 1/10 + 1/9 + 1/8, it is what we would have seen with the
|
|
uncoarsened data.
|
|
If there are case weights we give each of the 3 terms a 1/3 chance of being
|
|
the first, second, or third event
|
|
\begin{align*}
|
|
KM(t) &= KM(t-) (1- d(t)/n(t) \\
|
|
NA(t) &= NA(t-) + d(t)/n(t) \\
|
|
FH(t) &= FH(t-) + \sum_{i=1}^{3} \frac{(d(t)/3}{n(t)- d(t)(i-1)/3}
|
|
\end{align*}
|
|
If we think of the size of the denominator as a random variable $Z$, an
|
|
exact solution would use $E(1/Z)$, the FH uses $1/E(Z)$ and the NA uses
|
|
$1/\max(Z)$ as the denominator for each of the 3 deaths.
|
|
Although Fleming and Harrington were able to show that the FH has a lower
|
|
MSE than the KM, it little used. The primary reasons for its inclusion
|
|
in the package are first, that I was collaborating with them at the time so
|
|
knew of these results, but second and more importantly, this is identical to
|
|
the argument for the Efron approximation in a Cox model. (The NA hazard
|
|
corresponds to the Breslow approximation.)
|
|
The Efron approx is the default for the coxph function, so post coxph survival
|
|
curves need to deal with it.
|
|
|
|
When one of these 3 subjects has an event but continues to be at risk,
|
|
which can happen with start/stop data, then the argument gets trickier.
|
|
Say that the first of the 3 continues, the others do not. We can argue that
|
|
subject 1 remains at risk for all 3 denominators, or not.
|
|
The first is more a mathematical viewpoint, the second more medical, e.g.,
|
|
in a study with repeated infections you will never have a second one
|
|
recorded on the same day.
|
|
For multi-state, we finally ``tossed in the towel'' and now use the Breslow
|
|
approx as default for multi-state hazard (coxph) models.
|
|
For single state, the FH estimate is rarely reqested but we still do our best
|
|
to handle all aspects of it (pride and history), but there would be few tears
|
|
if it were dropped.
|
|
|
|
When there is (time1, time2) data the code uses a \code{position} vector,
|
|
explained further below which is 1 if this is obs is the leftmost for a subjet
|
|
and 2 if it is the rightmost, 3 if it is both. A primary purpose was to not
|
|
count a subject as an extra entry and censor in the middle of a string of
|
|
times such as (0, 10), (20, 35), (35, 50), but we also use it to moderate the
|
|
FH estimate: only those with an event at the end of their intervals participate
|
|
in the special computation for ties.
|
|
|
|
The survfitKM call has arguments of
|
|
\begin{itemize}
|
|
\item y: a matrix y containing survival times, either 2 or 3 columns
|
|
\item weight: vector of case weights
|
|
\item ctype: compuation for the cumulative hazard: either the Nelson-Aalen (1)
|
|
or Efron approximation for ties (2) approach. Number 2 is very rarely used.
|
|
\item stype: computation for the survival curve: either product-limit (1),
|
|
also known as the Kaplan-Meier or exp(-cumulative hazard) (2).
|
|
Use of ctype=2, stype=2 matches a Cox model using the Efron approximation
|
|
for ties, ctype=1, stype=2 is the Fleming-Harrington estimate of survival
|
|
or a Cox model with the Breslow approximation for ties.
|
|
\item type: older form of the ctype/stype argument, retained for backwards
|
|
compatability. Type 1= (ctype 1/ stype 1), 2= (2, 1), 3 = (1,2), 4= (2,2).
|
|
\item id: subject id, used for data checks when there are multiple rows per
|
|
subject
|
|
\item cluster: clustering used for the robust variance. Clustering is based
|
|
on id if cluster is missing (the usual case)
|
|
\item influence: return the influence matrix for the survival
|
|
\item start.time: optional starting time for the curve
|
|
\item entry: return entry times for (time1, time2) data
|
|
\item time0: include time 0 in the output
|
|
\item se.fit: compute the standard errors
|
|
\item conf.type, conf.int, conf.lower: options for confidence intervals
|
|
\end{itemize}
|
|
|
|
|
|
\subsection{Confidence intervals}
|
|
If $p$ is the survival probability and $s(p)$ its standard error,
|
|
we can do confidence intervals on the simple scale of
|
|
$ p \pm 1.96 s(p)$, but that does not have very good properties.
|
|
Instead use a transformation $y = f(p)$ for which the standard error is
|
|
$s(p) f'(p)$, leading to the confidence interval
|
|
\begin{equation*}
|
|
f^{-1}\left(f(p) +- 1.96 s(p)f'(p) \right)
|
|
\end{equation*}
|
|
Here are the supported transformations.
|
|
\begin{center}
|
|
\begin{tabular}{rccc}
|
|
&$f$& $f'$ & $f^{-1}$ \\ \hline
|
|
log & $\log(p)$ & $1/p$ & $ \exp(y)$ \\
|
|
log-log & $\log(-\log(p))$ & $1/\left[ p \log(p) \right]$ &
|
|
$\exp(-\exp(y)) $ \\
|
|
logit & $\log(p/1-p)$ & $1/[p (1-p)]$ & $1- 1/\left[1+ \exp(y)\right]$ \\
|
|
arcsin & $\arcsin(\sqrt{p})$ & $1/(2 \sqrt{p(1-p)})$ &$\sin^2(y)$ \\
|
|
|
|
\end{tabular} \end{center}
|
|
Plain intervals can give limits outside of (0,1), we truncate them when this
|
|
happens. The log intervals can give an upper limit greater than 1 (rare),
|
|
again truncated to be $\le 1$; the lower limit is always valid.
|
|
The log-log and logit are always valid. The arcsin requires some fiddling at
|
|
0 and $\pi/2$ due to how the R function is defined, but that is only a
|
|
nuisance and not a real flaw in the math. In practice, all the intervals
|
|
except plain appear to work well.
|
|
In all cases we return NA as the CI for survival=0: it makes the graphs look
|
|
better.
|
|
|
|
Some of the underlying routines compute the standard error of $p$ and some
|
|
the standard error of $\log(p)$. The \code{selow} argument is used for the
|
|
modified lower limits of Dory and Korn. When this is used for cumulative
|
|
hazards the ulimit arg will be FALSE: we don't want to impose an upper limit
|
|
of 1.
|
|
|
|
\subsection{Robust variance}
|
|
\index{survfitkm!robust variance}
|
|
For an ordinary Kapan-Meier curve it can be shown that the infinitesimal
|
|
jackknife (IJ) variance is identical to the Greenwood estimate, so the extra
|
|
compuational burden of a robust estimate is unnecessary.
|
|
The proof does not carry over to curves with (time1, time2) data.
|
|
The use of (time1, time2) data for a single outcomes arises in 3 cases, with
|
|
illustrations shown below.
|
|
\begin{itemize}
|
|
\item Delayed entry for a subset of subjects. The robust variance in
|
|
this case is not identical to the Greenwood formula, but is very close in
|
|
all the situations I have seen.
|
|
\item Multiple events of the same type, as arises in reliability data. In this
|
|
case the cumulative hazard rather than P(state) is the estimate of most
|
|
interest; it is known as the mean cumulative function (MCF)
|
|
in the reliability literature. If there are multiple events per subject
|
|
(the usual) an id variable is required.
|
|
A study with repeated glycemic events as the endpoint (not shown) had a
|
|
mean of 48 events per subject over 5.5 months. In this case
|
|
accounting for within-subject correlation was crucial;
|
|
the ratio of robust/asymptotic standard error was 4.4. For the valveSeat
|
|
data shown below the difference is minor.
|
|
\item Survival curves for a time-dependent covariate, the so called ``extended
|
|
Kaplan-Meier'' \cite{Snapinn05}.
|
|
I have strong statistical reservations about this method.
|
|
\end{itemize}
|
|
Here are examples of each. In the myeloma data set the desired time scale was
|
|
time since diagnosis, and some of the patients in the study had been diagnosed
|
|
at another institution before referral to the study institution.
|
|
|
|
<<survrepeat, fig=TRUE, echo=TRUE>>=
|
|
fit1a <- survfit(Surv(entry, futime, death) ~ 1, myeloma)
|
|
fit1b <- survfit(Surv(entry, futime, death) ~ 1, myeloma, id=id, robust=TRUE)
|
|
matplot(fit1a$time/365.25, cbind(fit1a$std.err, fit1b$std.err/fit1b$surv),
|
|
type='s',lwd=2, lty=1, col=2:3, #ylim=c(0, .6),
|
|
xlab="Years post diagnosis", ylab="Estimated sd of log(surv)")
|
|
#
|
|
# when two valve seats failed at the same inspection, we need to jitter one
|
|
# of the times, to avoid a (time1, time2) interval of length 0
|
|
ties <- which(with(valveSeat, diff(id)==0 & diff(time)==0)) #first of a tie
|
|
temp <- valveSeat$time
|
|
temp[ties] <- temp[ties] - .1
|
|
vdata <- valveSeat
|
|
vdata$time1 <- ifelse(!duplicated(vdata$id), 0, c(0, temp[-length(temp)]))
|
|
vdata$time2 <- temp
|
|
fit2a <- survfit(Surv(time1, time2, status) ~1, vdata)
|
|
fit2b <- survfit(Surv(time1, time2, status) ~1, vdata, id=id)
|
|
plot(fit2a, cumhaz=TRUE, xscale=365.25, xlab="Years in service",
|
|
ylab="Estimated number of repairs")
|
|
lines(fit2b, cumhaz=TRUE, lty=c(1,3,3))
|
|
legend(150, 1.5, c("Estimate", "asymptotic se", "robust se"), lty=1:3, bty='n')
|
|
|
|
#
|
|
# PBC data, categorized by most recent bilirubin
|
|
# as an example of the EKM
|
|
pdata <- tmerge(subset(pbcseq, !duplicated(id), c(id, trt, age, sex, stage)),
|
|
subset(pbcseq, !duplicated(id, fromLast=TRUE)), id,
|
|
death= event(futime, status==2))
|
|
bcut <- cut(pbcseq$bili, c(0, 1.1, 5, 100), c('normal', 'moderate', 'high'))
|
|
pdata <- tmerge(pdata, pbcseq, id, cbili = tdc(day, bcut))
|
|
pdata$ibili <- pdata$cbili[match(pdata$id, pdata$id)] # initial bilirubin
|
|
|
|
ekm <- survfit(Surv(tstart, tstop, death) ~ cbili, pdata, id=id)
|
|
km <- survfit(Surv(tstart, tstop, death) ~ ibili, pdata, id=id)
|
|
plot(ekm, fun='event', xscale=365.25, lwd=2, col=1:3, conf.int=TRUE,
|
|
lty=2, conf.time=c(4,8,12)*365.25,
|
|
xlab="Years post enrollment", ylab="Death")
|
|
lines(km, fun='event', lwd=1, col=1:3, lty=1)
|
|
# conf.time= c(4.1, 8.1, 12.1)*365.25)
|
|
text(c(4600, 4300, 2600), c(.23, .56, .78), c("Normal", "Moderate", "High"),
|
|
col=1:3, adj=0)
|
|
legend("topleft", c("KM", "EKM"), lty=1:2, col=1, lwd=2, bty='n')
|
|
@
|
|
|
|
The EKM plot is complex: dashed are the EKM along with
|
|
confidence itervals, solid lines are the KM stratified by enrollment bilirubin.
|
|
The EKM bias for normal and moderate is large, but confidence intervals differ
|
|
little between the robust and asymptotic se (latter not shown).
|
|
|
|
As shown above, more often than not the robust IJ variance is not needed for
|
|
the KM curves themselves.
|
|
However, they are the central computation for psuedovalues, which are steadily
|
|
increasing in popularity.
|
|
Let $U_k(t)$ be the IJ for observation $k$ at time $t$.
|
|
This is a vector, but in section \ref{sect:IJ2} for multi-state data it will
|
|
be a matrix with 1 column per state.
|
|
Likewise let $C(t)$ and $A(t)$ be influence vectors for the cumulative hazard
|
|
and the area under the curve at time $t$.
|
|
Let $h(t)$ be the hazard increment at time $t$.
|
|
Then
|
|
|
|
\begin{align}
|
|
h(t) &= \frac{\sum_i w_idN_i(t)}{\sum_i w_i Y_i(t)} \label{haz1}\\
|
|
\frac{\partial h(t)}{\partial w_k} &= \frac{dN_i(t) - Y_k(t)h(t)}
|
|
{\sum_i w_i Y_i(t)} \label{dhaz1}\\
|
|
&= C_k(t) - C_k(t-) \nonumber\\
|
|
S(t) &= \prod_{s\le t} 1-h(s) \label {KM1}\\
|
|
&= S(t-) [1-h(t)] \nonumber \\
|
|
U_k(t) &= \frac{\partial S(t)}{\partial w_k} \\
|
|
&= U_k(t-)[1-h(t)] - S(t-)\frac{dN_i(t) - Y_k(t)h(t)}{\sum_i w_i Y_i(t)}
|
|
\label{dsurv1} \\
|
|
U_k(t) &= \frac{\partial \prod_{s\le t} 1-h(s)}{\partial w_k} \nonumber \\
|
|
&= \sum_{s\le t} \frac{S(t)}{1-h(s)}
|
|
\frac{dN_i(s) - Y_k(s)h(s)}{\sum_i w_i Y_i(s)} \nonumber \\
|
|
&= S(t) \sum_{s\le t} \frac{dN_i(s) - Y_k(s)h(s)}
|
|
{[1-h(s)] \sum_i w_i Y_i(s)} \label{dsurv2} \\
|
|
&= S(t) \int_0^t \frac{\partial \log[1-h_k(s)]}{w_k} d\Nbar(s)
|
|
\label{dsurv3}
|
|
\end{align}
|
|
|
|
The simplest case is $C_k(\tau)$ at a single user requested reporting time
|
|
$\tau$, i.e., a simple use of the pseudo routine.
|
|
The obvious code will update all $n$ elements of $C$ at each event time,
|
|
an $O(nd)$ computation where $d$ is the number of unique event times $\le \tau$.
|
|
For most data $d$ and $n$ grow together, and $O(n^2)$ computations are something
|
|
we want to avoid at all costs, a large data set will essentially freeze the
|
|
computer.
|
|
|
|
The code solution is to create a running total $z$ of the right hand term of
|
|
\eqref{dhaz1}, which applies to all subjects at risk. Then the leverage
|
|
for an observation at risk over the interval $(a,b)$ is
|
|
\begin{align*}
|
|
z(t) &= \sum_{s\le t} \frac{h(s)}{\sum_i w_i Y_i(s)} \\
|
|
C_k(\tau) &= frac{N_k(\tau)}{\sum_i w_i Y_i(b)} +z(\min(a,\tau)) -
|
|
z(\min(b,\tau))
|
|
\end{align*}
|
|
In R code this becomes an indexing problem. We can create 3 $n$ by $m$ =
|
|
number of
|
|
reporting times matrices that point to the correct elements of the vectors
|
|
\code{c(0, 1/n.risk)} and \code{c(0, n.event/n.risk\^2)} obtained from the
|
|
survfit object.
|
|
|
|
The computation for survival starts out exactly the same if using the exponential
|
|
estimate $S(t) = \exp(-\Lambda(t))$, otherwise do the same computation but
|
|
using the rightmost term of equation \eqref{dsurv2}.
|
|
At the end multiply the column for reporting time $\tau$ by $-S(\tau)$, for
|
|
each $\tau$.
|
|
|
|
There are two ways to look at the influence on the sojuourn time, which is
|
|
calculated as the area under the curve.
|
|
\index{survfit!AUC}
|
|
They are shown in the figure \ref{AUCfig} for an AUC at time 55.
|
|
|
|
\begin{figure}
|
|
<<auc, echo=FALSE, fig=TRUE>>=
|
|
test <- survfit(Surv(time, status) ~1, aml, subset=(x=="Maintained"))
|
|
ntime <- length(test$time)
|
|
oldpar <- par(mfrow=c(1,2), mar=c(5,5,1,1))
|
|
plot(test, conf.int=FALSE, xmax=60)
|
|
jj <- (test$n.event > 0)
|
|
segments(test$time[jj], test$surv[jj], test$time[jj], 0, lty=2)
|
|
segments(55, test$surv[9], 55, 0, lty=2)
|
|
points(c(0, test$time[jj]), c(1, test$surv[jj]))
|
|
segments(0,0,55,0)
|
|
segments(0,0,0,1)
|
|
|
|
plot(test, conf.int=FALSE, xmax=60)
|
|
segments(pmin(test$time,55), test$surv, 55, test$surv, lty=2)
|
|
segments(55,test$surv[ntime],55,1)
|
|
segments(test$time[1], 1, 55, 1, lty=2)
|
|
points(c(test$time[jj],100), .5*(c(1, test$surv[jj]) + c(test$surv[jj], 0)))
|
|
par(oldpar)
|
|
@
|
|
\caption{Two ways of looking at the AUC}
|
|
\label{AUCfig}
|
|
\end{figure}
|
|
|
|
The first uses the obvious rectangle rule. The leverage of observation
|
|
$i$ on AUC(55) is the sum of the leverage of observation $i$ on the height
|
|
of each of the circles times the width of the associated rectangle.
|
|
The leverage on the height of each circle are the elements of $U$.
|
|
The second figure depends on the fact that the influence on the AUC must be
|
|
the same as the influence on 55-AUC. It will be the influence of each
|
|
observation on the length
|
|
of each circled segment, times the distance to 55.
|
|
This is closely related to the asymptotic method used for the ordinary KM,
|
|
which assumes that the increments are uncorrelated.
|
|
|
|
Using the first approach, let
|
|
Let $w_0, w_1, \ldots, w_m$ be the widths of the $m+1$ rectangles under the
|
|
survival curve $S(t)$ from 0 to $\tau$, and $d_1, \ldots, d_m$ the set of
|
|
event times that are $\le \tau$.
|
|
Further, let $u_{ij}$ be the additive elements that make up $U$ as found in
|
|
the prior formulas, i.e.,
|
|
\begin{align*}
|
|
U_{ik} &= S(d_k) \sum_{j=1}^k u_{ij} \\
|
|
u_{ij} &= \frac{\partial \log(1 - h(d_j))}{\partial w_i} \; \mbox{KM}\\
|
|
&= -\frac{dN_i(d_j) - Y_i(d_j) h(d_j)}{[1-h(d_j)] \Ybar(d_j)} \\
|
|
u_{ij}^*&= \frac{\partial - h(d_j)}{\partial w_i} \; \mbox{exp(chaz)}\\
|
|
&= -\frac{dN_i(d_j) - Y_i(d_j) h(d_j)}{\Ybar(d_j)} \\
|
|
\end{align*}
|
|
The second varianct holsd when using the exponential form of $S$.
|
|
|
|
Finally, let $A(a,b)$ be the integral under $S$ from $a$ to $b$.
|
|
\begin{align}
|
|
A(0,\tau) &= w_0 1 +
|
|
\sum_{j=1}^m w_j S(d_j) \\nonumber \\
|
|
\frac{\partial A(0,\tau}{\partial w_k} &=
|
|
\frac{\partial A(d_1,\tau}{\partial w_k} \nonumber \\
|
|
\frac{\partial A(d_1,\tau}{\partial w_k} &=
|
|
\sum_{j=1}^m w_j U_{kj} \nonumber \\
|
|
&= \sum_{j=1}^m w_j S(d_j) \sum_{i=1}^j u_{ki} \nonumber \\
|
|
&= \sum_{i=1}^m u_{ki} \left[\sum_{j=i}^m w_j S(d_j) \right] \nonumber \\
|
|
&= \sum_{i=1}^m A(d_i, \tau) u_{ki} \label{eq:auctrick}
|
|
\end{align}
|
|
|
|
This is a similar sum to before, with a new set of weights. Any given
|
|
(time1, time2) observation involves $u$ terms over that (time1,time2) range,
|
|
we can form a single cumulative sum and use the value at time2 minus the value
|
|
at time1.
|
|
A single running total will not work for multiple $\tau$ values, however,
|
|
since the weights depend on $\tau$.
|
|
Write $A(d_i, \tau) = A(d_1, \tau) - A(d_1,d_i)$ and separate the two sums.
|
|
For the first $A(d_1,\tau)$ can be moved outside the sum, and so will play
|
|
the same role as $S(d)$ in $U$, the second becomes a weighted cumulative
|
|
sum with weights that are independent of tau.
|
|
|
|
Before declaring victory with this last modification take a moment assess
|
|
the computational impact of replacing $A(d_i,\tau)$ with two terms.
|
|
Assume $m$ time points, $d$ deaths, and $n$ observations.
|
|
Method 1 will need to $O(d)$ to compute the weights, $O(d)$ for the weighted
|
|
sum, and $O(2n)$ to create each column of the output for (time1, time2) data,
|
|
for a total of $O(2m (n+d))$. The second needs to create two running sums, each
|
|
$O(d)$ and $O(4n)$ for each column of output for $O(2d + 4mn)$.
|
|
The more 'clever' appoach is always slightly slower!
|
|
(The naive method of adding up rectangles is much slower than either, however,
|
|
because it needs to compute $U$ at every death time.)
|
|
|
|
\paragraph{Variance}
|
|
|
|
Efficient computation of a the variance of the cumulative hazard within
|
|
survfit is more complex,
|
|
since this is computed at every event time. We consider three cases below.
|
|
|
|
Case 1: The simplest case is unclustered data without delayed entry;
|
|
not (time1,time2) data.
|
|
\begin{itemize}
|
|
\item Let $W= \sum_i w_i^2$ be the sum of case weights for all those at
|
|
risk. At the start everyone is in the risk set. Let $v(t)=0$ and
|
|
and $z(t)=0$ be running sums.
|
|
\item At each event time $t$
|
|
\begin{enumerate}
|
|
\item Update $z(t) = z(t-) - h(t)/\Ybar(t)$, the running sum of the
|
|
right hand term in equation \eqref{dhaz1}.
|
|
\item For all $i$= (event or censored at $t$), fill in their element
|
|
in $C_i$, add $(w_i C_i)^2$ to $v(t)$, and subtract $w_i^2$ from $W$.
|
|
\item The variance at this time point is $v(t)+ Wz^2(t)$; all those still
|
|
at risk have the same value of $C_k$ at this time point.
|
|
\end{enumerate}
|
|
\end{itemize}
|
|
|
|
Case 2: Unclustered with delayed entry.
|
|
Let the time interval for each observation $i$ be $(s_i, t_i)$. In this case
|
|
the contribution at each event time, in step 3 above, for those currently at
|
|
risk is
|
|
\begin{equation*}
|
|
\sum_j w_j^2 [z(t)- z(s_j)]^2 = z^2(t)\sum_jw_j^2 + \sum_jw_j^2z^2(s_j)
|
|
- 2 z(t) \sum(w_j z(s_j)
|
|
\end{equation*}
|
|
This requires two more running sums. When an observation leaves the risk set
|
|
all three of them are updated.
|
|
In both case 1 and 2 our algorithm is $O(d +n)$.
|
|
|
|
Case 3: Clustered variance. The variance at each time point is now
|
|
\begin{equation}
|
|
\rm{var}\NA(t) = \sum_g\left( \sum_{i \in g} w_i \frac{\partial \NA(t)}
|
|
{\partial w_i} \right)^2 \label{groupedC}
|
|
\end{equation}
|
|
The observations within a cluster do not necessarily have the same case
|
|
weight, so this does not collapse to one of the prior cases.
|
|
It is also more difficult to identify when a group is no longer at risk
|
|
and could be moved over to the $v(t)$ sum.
|
|
In the common case of (time1, time2) data the cluster is usually a single
|
|
subject and the weight will stay constant, but we can not count on that.
|
|
In a marginal structural model (Robbins) for instance the inverse probability
|
|
of censoring weights (IPCW) will change over time.
|
|
|
|
The above, plus the fact that the number of groups may be far less than the
|
|
number of observations, suggests a different approach.
|
|
Keep the elements of the weighted grouped $C$ vector, one row per group rather
|
|
than one row per observation.
|
|
This corresponds to the inner term of equation \eqref{groupedC}.
|
|
Now update each element at each event time.
|
|
This leads to an $O(gd)$ algorithm.
|
|
A slight improvement comes from realizing that the increment for group $g$
|
|
at a given time involves the sum of $Y_i(t)w_i$, and only updating those
|
|
groups with a non-zero weight.
|
|
|
|
For the Fleming-Harrington update, the key realization is that if there are $d$
|
|
death at a given timepoint, the increment in the cumulative hazard NA(t) is
|
|
a sum of $d$ 'normal' updates, but with perturbed weights for the tied deaths.
|
|
Everyone else at risk gets a common update to the hazard.
|
|
The basic computation and equations for $C$ involve an small loop over $d$ to
|
|
create the increments, sort of like an aside in a stage play, but then
|
|
adding them into $C$ is the same as before. The basic steps for case 1--3
|
|
do not change.
|
|
|
|
The C code for survfit has adapted case 3 above, for all three of the hazard,
|
|
survival, and AUC.
|
|
A rationale is that if the data set is very large the user can choose
|
|
\code{std.err =FALSE} as an option, then later use the residuals function to
|
|
get values at selected times. When the number of deaths gets over 1000, which
|
|
is where we begin to really notice the $O(nd)$ slowdown, a plot only needs
|
|
100 points to look smooth. Most use cases will only need variance at a few
|
|
points. Another reason is that for ordinary survival, robust variance is almost
|
|
never requested unless there is clustering.
|
|
|
|
|
|
|
|
\section{Aalen-Johansen}
|
|
In a multistate model let the hazard for the $jk$ transition be
|
|
\begin{align*}
|
|
\hat\lambda_{jk}(t) &= \frac{\sum_i dN_{ijk}(t)}{\sum_i Y_{ij}(t)}\\
|
|
&= \frac{{\overline N}_{jk}(t)}{\Ybar(t)}
|
|
\end{align*}
|
|
and gather terms together into the nstate by nstate matrix $A(t)$
|
|
with $A_{jk}(t) = \hat\lambda_{jk}(t)$, with $A_{jj}(t) = -\sum_{k \ne j} A_{jk}(t)$.
|
|
That is, the row sums of $A$ are constrainted to be 0.
|
|
|
|
The cumulative hazard estimate for the $j:k$ transition is the cumulative sum of
|
|
$\hat\lambda_{jk}(t)$. Each is treated separately, there is no change in
|
|
methods or formula from the single state model.
|
|
|
|
The estimated probability in state $p(t)$ is a vector with one element per
|
|
state and the natural constraint that the elements sum to 1; it is a probability
|
|
distribution over the states.
|
|
Two estimates are
|
|
\begin{align}
|
|
p(t) &= p(0) \prod_{s \le t} e^{A(s)} \label{AJexp}\\
|
|
p(t) &= p(0) \prod_{s\le t} (I + A(s)) \nonumber \\
|
|
&= p(0) \prod_{s\le t} H(s) \label{AJmat}
|
|
\end{align}
|
|
Here $p(0)$ is the estimate at the starting time point.
|
|
Both $\exp(A(s))$ and $I + A(s)$ are transition matrices, i.e., all elements are
|
|
positive and rows sum to 1. They encode the state transition probabilities
|
|
at time $s$, the diagonal is the probability of no transition for observations
|
|
in the given state. At any time point with no observed transitions $A(s)=0$
|
|
and $H(s)= I$; wlog the product is only over the discrete times $s$ at which
|
|
an event (transition) actually occured.
|
|
For matrices $\exp(C)\exp(D) \ne \exp(C+D)$ unless $DC= CD$, i.e.,
|
|
equation \eqref{AJexp} does not simplify.
|
|
|
|
Equation \eqref{AJmat} is known as the Aalen-Johansen estimate. In the case
|
|
of a single state it reduces to the Kaplan-Meier, for competing risks it
|
|
reduces to the cumulative incidence (CI) estimator.
|
|
Interestingly, the exponential form \eqref{AJexp} is
|
|
almost never used for survfit.formula, but is always used for the curves after
|
|
a Cox model.
|
|
For survfit.formula, but of the forms obey the basic probability rules that
|
|
$0 \le p_k(t) \le 1$ and $\sum_k p_k(t) =1$; all probabilities are between 0 and
|
|
1 and the sum is 1.
|
|
The analog of \eqref{AJmat} has been proposed by some authors for Cox model
|
|
curves, but it can lead to negative elements of $p(t)$ in certain cases, so the
|
|
survival package does not support it.
|
|
|
|
The compuations are straightforward. The code makes two passes through the
|
|
data, the first to create all the counts: number at risk, number of events of
|
|
each type, number censored, etc. Since for most data sets the total number at
|
|
risk decreses over time this pass is done in reverse time order since it leads
|
|
to more additions than subtractions in the running number at risk and so less
|
|
prone to round off error.
|
|
(Unless there are extreme weights this is gilding the lily - there will not be
|
|
round off error for either case.)
|
|
The rule for tied times is that events happen first, censors second, and
|
|
entries third; imagine the censors at $t+\epsilon$ and entries at $t+ 2\epsilon$.
|
|
|
|
When a particular subject has multiple observations, say times of (0,10), (10,15)
|
|
and (15,24), we don't want the output to count this as a ``censoring''
|
|
and/or entry at time 10 or 15.
|
|
The \code{position} vector is 1= first obs for a
|
|
subject, 2= last obs, 3= both (someone with only one row of data), 0=neither.
|
|
This subject would be 1,0,0,2.
|
|
The position vector is created by the \code{survflag} routine.
|
|
If the same subject id were used in two curves the counts are separate for each
|
|
curve; for example curves by enrolling institution in a multi-center trial,
|
|
where two centers happened to have an overlapping id value.
|
|
|
|
A variant of this is if a given subject changed curves at time 15, which occurs
|
|
when users are estimating the ``extended Kaplan-Meier'' curves proposed by
|
|
Snapinn \cite{Snapinn05}, in which case the subject will be counted as censored
|
|
at time 15 wrt the first curve, and an entry at time 15 for the second.
|
|
(I myself consider this approach to be statistically unsound.)
|
|
|
|
If a subject changed case weights between the (0,10) and (10,15) interval we
|
|
do not count them as separate entry and exits for the weighted n.enter
|
|
and n.censor counts. This means that the running total of weighted n.enter,
|
|
n.event, and n.censor will not recreate the weighted n.risk value; the latter
|
|
\emph{does} change when weights change in this way. The rationale for this
|
|
is that the entry and censoring counts are mostly used for display, they
|
|
do not participate in further computations.
|
|
|
|
|
|
\subsection{Data}
|
|
The survfit routine uses the \code{survcheck} routine, internally, to verify
|
|
that the data set follows an overall rule that each
|
|
subject in the data set follows a path which is physically possible:
|
|
\begin{enumerate}
|
|
\item Cannot be in two places at once (no overlaps)
|
|
\item Over the interval from first entry to last follow-up, they have to be
|
|
somewhere (no gaps).
|
|
\item Any given state, if entered, must be occupied for a finite amount of
|
|
time (no zero length intervals).
|
|
\item States must be consistent. For an interval (t1, t2, B) = entered state B
|
|
at time t2, the current state for the next interval must be B
|
|
(no teleporting).
|
|
\end{enumerate}
|
|
|
|
I spent some time thinking about whether I should allow the survfit routine to
|
|
bend the rules, and allow some of these.
|
|
First off, number 1 and 3 above are not negotiable; otherwise it becomes
|
|
impossible to clearly define the number at risk.
|
|
But perhaps there are use cases for 2 and/or 4?
|
|
|
|
One data example I have used in the past for the Cox
|
|
model is a subject on a research protocol, over 20 years ago now,
|
|
who was completely lost to follow-up for nearly 2 years and then reappeared.
|
|
Should they be counted as ``at risk'' in that interim?
|
|
I argue no, on the grounds that the were not at risk for an ``event that would
|
|
have been captured'' in our data set --- we would never have learned of a
|
|
disease progression.
|
|
Someone should not be in the denominator of the Cox model (or KM) at a
|
|
given time point if they cannot be in the numerator.
|
|
We decided to put them into the data set with a gap in their follow-up time.
|
|
|
|
But in the end I've decided no to bending the rules. The largest reason
|
|
is that in my own experience a data set that breaks any of 1--4 above
|
|
is almost universally the outcome of a programming mistake. The above example
|
|
is one of only 2 non-error cases I can think of, in nearly 40 years of
|
|
clinical research:
|
|
I \emph{want} the routine to complain. (Remember, I'd like the package to be
|
|
helpful to others, but I wrote the code for me.)
|
|
Second is that a gap can easily be accomodated by creating extra state which
|
|
plays the role of a waiting room.
|
|
The P(state) estimate for that new state is not interesting, but the
|
|
others will be identical to what we would have obtained by allowing a gap.
|
|
Instances of case 4 can often be solved by relabeling the states.
|
|
I cannot think of an actual use case.
|
|
|
|
\subsection{Counting the number at risk}
|
|
\index{surfit!number at risk}
|
|
For a model with $k$ states, counting the number at risk is fairly
|
|
straightforward, and results in a matrix with $k$ columns and
|
|
one row per time point.
|
|
Each element contains the number who are currently in that state and
|
|
not censored. Consider the simple data set shown in figure \ref{entrydata}.
|
|
|
|
\begin{figure}
|
|
<<entrydata, fig=TRUE, echo=FALSE>>=
|
|
mtest <- data.frame(id= c(1, 1, 1, 2, 3, 4, 4, 4, 5, 5, 5, 5),
|
|
t1= c(0, 4, 9, 0, 2, 0, 2, 8, 1, 3, 6, 8),
|
|
t2= c(4, 9, 10, 5, 9, 2, 8, 9, 3, 6, 8, 11),
|
|
st= c(1, 2, 1, 2, 3, 1, 3, 0, 2, 0,2, 0))
|
|
|
|
mtest$state <- factor(mtest$st, 0:3, c("censor", "a", "b", "c"))
|
|
|
|
temp <- survcheck(Surv(t1, t2, state) ~1, mtest, id=id)
|
|
plot(c(0,11), c(1,5.1), type='n', xlab="Time", ylab= "Subject")
|
|
with(mtest, segments(t1+.1, id, t2, id, lwd=2, col=as.numeric(temp$istate)))
|
|
event <- subset(mtest, state!='censor')
|
|
text(event$t2, event$id+.2, as.character(event$state))
|
|
@
|
|
\caption{A simple multi-state data set. Colors show the current state of
|
|
entry (black), a (red), b(green) or c (blue), letters show the occurrence
|
|
of an event.}
|
|
\label{entrydata}
|
|
\end{figure}
|
|
|
|
The obvious algorithm is simple: at any given time point draw a vertical line
|
|
and count the number in each state who intersect it.
|
|
At a given time $t$ the rule is that events happen first, then censor,
|
|
then entry. At time 2 subject 4 has an event and subject 3 enters:
|
|
the number at risk for the four states at time point 2 is (4, 0, 0, 0).
|
|
At time point $2+\epsilon$ it is (4, 1, 0, 1).
|
|
At time 9 subject 3 is still at risk.
|
|
|
|
The default is to report a line of output at each unique censoring and
|
|
event time, adding rows for the unique entry time is an optional argument.
|
|
Subject 5 has 4 intervals of (1,3b), (3,6+), (6,8b) and (8,11+):
|
|
time 6 is not reported as a censoring time, nor as an entry time.
|
|
Sometimes a data set will have been preprocessed by the user
|
|
with a tool like survSplit, resulting in hundreds of rows
|
|
for some subjects, most of which are `no event here', and there is no reason
|
|
to include all these intermediate times in the output.
|
|
We make use of an ancillary vector \code{position} which is 1 if this interval is
|
|
the leftmost of a subject sequence, 2 if rightmost, 3 if both.
|
|
If someone had a gap in followup we would treat their re-entry to the risk
|
|
sets as an entry, however; the survcheck routine will have already generated
|
|
an error in that case.
|
|
|
|
The code does its summation starting at the largest time and moving left,
|
|
adding and subtracting from the number at risk (by state) as it goes.
|
|
In most data sets the number at risk decreases over time, going from right
|
|
to left results in more additions than subtractions and thus a smidge less
|
|
potential roundoff error.
|
|
Since it is possible for a subject's intervals to have different case
|
|
weights, the number at risk \emph{does} need to be updated at time 6 for
|
|
subject 5, though that time point is not in the output.
|
|
|
|
In the example outcome b can be a repeated event, e.g., something like repeated
|
|
infections. At time 8 the cumulative hazard for event b will change, but
|
|
the proability in state will not.
|
|
It would be quite unusual to have a data set with some repeated states and
|
|
others not, but the code allows it.
|
|
|
|
|
|
Consider the simple illness-death model in figure \ref{illdeath}.
|
|
There are 3 states and 4 transtions in the model. The number at risk (n.risk),
|
|
number censored (n.censor), probability in state (pstate) and number of
|
|
entries (n.enter) comonents of the survfit call will have 3 columns, one
|
|
per state.
|
|
The number of events (n.event) and cumulative hazard (cumhaz) components
|
|
will have 4 columns, one per transition, labeled as 1.2, 1.3, 2.1, and 2.3.
|
|
In a matrix of counts with row= starting state and column = ending state,
|
|
this is the order in which the non-zero elements appear.
|
|
The order of the transitions is yet another consequence of the
|
|
fact that R stores matrices in column major order.
|
|
|
|
There are two tasks for which the standard output is insufficient: drawing the
|
|
curves and computing the area under the curve.
|
|
For both these we need to know $t0$, the starting point for the curves.
|
|
There are 4 cases:
|
|
\begin{enumerate}
|
|
\item This was specified by the user using \code{start.time}
|
|
\item All subjects start at the same time
|
|
\item All subjects start in the same state
|
|
\item Staggered entry in diverse states
|
|
\end{enumerate}
|
|
For case 1 use what the user specified, of course.
|
|
For (time, status) data, i.e. competing risks use the same rule as for simple
|
|
survival, which is min(time, 0). Curves are assumed to start at 0 unless there
|
|
are negative time values.
|
|
Otherwise for case 2 and 3 use min(time1), the first time point found; in many
|
|
cases this will be 0 as well.
|
|
|
|
The hardest case most often arises for data that is on age scale where there
|
|
may be a smattering of subjects at early ages.
|
|
Imagine that we had subjects at ages 19, 23, 28, another dozen from 30-40, and
|
|
the first transition at age 41, 3 states A, B, C, and the user did not specify
|
|
a starting time.
|
|
Suppose we start at age=19 and that subject is in state B.
|
|
Then p(19)= c(0,1,0), and AJ p(t) estimate will remain (0,1,0) until there
|
|
is a B:A or B:C transition, no matter what the overall prevalence of inital
|
|
states as enrollment progresses. Such an outcome is not useful.
|
|
The default for case 4 is to use the smallest event time at time 0, with intial
|
|
prevalence the distribution of state for those observations which are at risk
|
|
at that time. The prevalence at the starting time is saved in the fit object
|
|
as p0.
|
|
The best choice for case 4 is a user specified time, however.
|
|
|
|
Since we do not have an estimate of $p$= probability in state before time 0,
|
|
the output will not contain any rows before that time point.
|
|
Note that if there is an event exactly at the starting time: a death at time
|
|
0 for competing risks, a transition at start.time, or case 4 above,
|
|
that the first row of the pstate matrix will not be the same as p0.
|
|
The output never has repeats of the same time value (for any given curve).
|
|
One side effect of this is that a plotted curve never begins with a vertical
|
|
drop.
|
|
|
|
\section{Influence}
|
|
\index{Andersen-Gill!influence}
|
|
\index{multi-state!influence}
|
|
Let $C(t)$ be the influence matrix for the cumulative hazard $\Lambda(t)$ and
|
|
$U(t)$ that for the probability in state $p(t)$.
|
|
In a multistate model with $s$ states there are $s^2$ potential transitions,
|
|
but only a few are observed in any given data set.
|
|
The code returns the estimated cumulative hazard $\Lhat$ as a matrix with
|
|
one column for each $j:k$ transition that is actually observed, the same
|
|
will be true for $C$.
|
|
We will slightly abuse the usual subscript notation; read
|
|
$C_{ijk}$ as the $i$th row and the $j:k$ transition (column).
|
|
Remember that
|
|
a transition from state $j$ to the same state can occur with multiple events of
|
|
the same type.
|
|
Let
|
|
\begin{equation*}
|
|
\Ybar_j(t) = \sum_i Y_{ij}(t)
|
|
\end{equation*}
|
|
be the number at risk in state $j$ at time $t$. Then
|
|
|
|
\begin{align}
|
|
C_i(t) &= \frac{\partial \Lambda(t)}{\partial w_i} \nonumber \\
|
|
&= C_i(t-) + \frac{\partial \lambda(t)}{\partial w_i} \nonumber\\
|
|
\lambda_{jk}(t) & = \frac{\sum w_i dN_{ijk}}{\Ybar_j(t)} \label{ihaz1}\\
|
|
\frac{\partial \lambda_{jk}(t)}{\partial w_i} &=
|
|
\frac{dN_{ijk}(t)- Y_{ij}(t)\lambda_{jk}(t)}{\Ybar_j(t)}
|
|
\label{hazlev} \\
|
|
V_c(t) &= \sum_i [w_i C_i(t)]^2 \label{varhaz}
|
|
\end{align}
|
|
At any time $t$, formula \eqref{hazlev} captures the fact that an observation has
|
|
influence only on potential transitions from its current state at time $t$,
|
|
and only over the (time1, time2) interval it spans.
|
|
At any given event time, each non-zero $h_{jk}$ at that time will add an
|
|
increment to only those rows of $C$ representing observations currently
|
|
at risk and in state $j$. The IJ variance is the weighted sum of
|
|
squares. (For replication weights, where w=2 means two actual observations,
|
|
the weight is outside the brackets. We will only deal with sampling weights.)
|
|
|
|
The (weighted) grouped estimates are
|
|
\begin{align}
|
|
C_{gjk}(t) &= \sum_{i \in g} w_i C_{ijk}(t) \nonumber \\
|
|
&= C_{gjk}(t-) + \sum_{i \in g} w_i
|
|
\frac{dN_{ijk}(t) - Y_{ij}(t)lambda_{jk}(t)}{\Ybar_j(t)} \label{hazlev2} \\
|
|
&=C_{gjk}(t-) +\sum_{i \in g} w_i\frac{dN_{ijk}(t)}{n_j(t)} -
|
|
\left(\sum_{i \in g}Y_{ij}(t)w_i \right) \frac{\lambda_{jk}(t)}{\Ybar_j(t)}
|
|
\label{hazlev2b} \\
|
|
\end{align}
|
|
A $j:k$ event at time $t$ adds only to the $jk$ column of $C$.
|
|
The last line above \eqref{hazlev2b} spits out the $dN$ portion, which will
|
|
normally be 1 or at most a few events at this time point, from the second,
|
|
which is identical for subjects at risk in group $g$.
|
|
This allows us to split out the sum of weights.
|
|
Let $w_{gj}(t)$ be the sum of weights by group and state, which is kept in
|
|
a matrix of the same form as $U$ and can be efficiently updated.
|
|
Each row of the grouped $C$ is computed with the same effort as a row of
|
|
the ungrouped matrix.
|
|
|
|
Let $C$ be the per observation influence matrix, $D$ a diagonal matrix
|
|
of per-observation weights, and $B$ the $n$ by $g$ 01/ design matrix that
|
|
collapses observations by groups.
|
|
For instance, \code{B = model.matrix(~ factor(grp) -1)}.
|
|
The the weighted and collapsed influence is $V= B'DC$, and the infinitesimal
|
|
jackknife variances are the column sums of the squared elements of V,
|
|
\code{colSums(V*V)}. A feature of the IJ is that the column sums of $DC$ and of
|
|
$B'DC$ are zero.
|
|
In the special case where each subject is a group and the weight for a
|
|
subject is constant over time, then $B'D = WB'$ where $W$ is a diagonal matrix
|
|
of per-subject weights, i.e, one can collapse and then weight rather than
|
|
weight and then collapse.
|
|
The survfitci.c routine (now replaced by survfitaj.c) made this assumption,
|
|
but unfortunately the code had no test cases with disparate weights within
|
|
a group and so the error was not caught for several years.
|
|
There is now a test case.
|
|
|
|
For $C$ above, where each row is a simple sum over event times,
|
|
formula \eqref{hazlev2b} essentially does the scale + sum step at each event
|
|
time and so only needs to keep one row per group.
|
|
The parent routine will warn if any subject id is
|
|
in two different groups.
|
|
Doing otherwise makes no statistical sense, IMHO, so any instance if this
|
|
is almost certainly an error in the data setup.
|
|
However, someone may one day find a counterexample, hence a warning rather than
|
|
an error.
|
|
|
|
For the probability in state $p(t)$,
|
|
if the starting estimate $p(0)$ is provided by the user, then $U(0) =0$. If
|
|
not, $p(0)$ is defined as the distribution of states among those at risk at
|
|
the first event time $\tau$.
|
|
\begin{align*}
|
|
p_j &= \frac{\sum w_i Y_{ij}(\tau)}{\sum w_i Y_i(\tau)} \\
|
|
\frac{\partial p_j}{\partial w_k} &=
|
|
\frac{Y_{ij}(\tau) - Y_i(\tau)p_j}{\sum w_i Y_i(\tau)}
|
|
\end{align*}
|
|
Assume 4 observations at risk at the starting point, with weights of
|
|
(1, 4, 6, 9) in states (a, b, a, c), respectively.
|
|
Then $p(0) = (7/20, 4/20, 9/20)$ and the unweighted $U$ matrix for those
|
|
observations is
|
|
\begin{equation*}
|
|
U(0) = \left( \begin{array}{ccc}
|
|
(1- 7/20)/20 & (0 - 4/20)/20 & (0- 9/20)/20 \\
|
|
(0- 7/20)/20 & (1 - 4/20)/20 & (0- 9/20)/20 \\
|
|
(1- 7/20)/20 & (0 - 4/20)/20 & (0- 9/20)/20 \\
|
|
(0- 7/20)/20 & (0- 4/20)/20 & (1- 9/20)/20
|
|
\end{array} \right)
|
|
\end{equation*}
|
|
with rows of 0 for all observations not at risk at the starting time.
|
|
Weighted column sums are $wU =0$.
|
|
|
|
The AJ estimate of the probablity in state vector $p(t)$ is
|
|
defined by the recursive formula $p(t)= p(t-)H(t)$.
|
|
Remember that the derivative of a matrix product $AB$ is $d(A)B + Ad(B)$ where
|
|
$d(A)$ is the elementwise derivative of $A$ and similarly for $B$.
|
|
(Write out each element of the matrix product.)
|
|
Then $i$th row of U satisfies
|
|
\begin{align}
|
|
U_i(t) &= \frac{\partial p(t)}{\partial w_i} \nonumber \\
|
|
&= \frac{\partial p(t-)}{\partial w_i} H(t) +
|
|
p(t-) \frac{\partial H(t)}{\partial w_i} \nonumber \\
|
|
&= U_i(t-) H(t) + p(t-) \frac{\partial H(t)}{\partial w_i}
|
|
\label{ajci}
|
|
\end{align}
|
|
The first term of \ref{ajci} collapses to ordinary matrix multiplication,
|
|
the second to a sparse multiplication.
|
|
Consider the second term for any chosen observation $i$, which is in state $j$.
|
|
This observation appears only in row $j$ of $H(t)$, and thus $dH$ is zero
|
|
for all other rows of $H(t)$ by definition. If observation $i$ is not at risk
|
|
at time $t$ then $dH$ is zero.
|
|
The derviative vector thus collapses an elementwise multiplication of
|
|
$p(t-)$ and the appropriate row of $dH$, or 0 for those not at risk.
|
|
|
|
Let $Q(t)$ be the matrix containing $p(t-) Y_i(t)dH_{j(i).}(t)/dw_i$ as its
|
|
$i$th row, $j(i)$ the state for row $j$.
|
|
Then
|
|
\begin{align*}
|
|
U(t_2) &= U(t_1)H(t_2) + Q(t_2) \\
|
|
U(t_3) &= \left(U(t_1)H(t_2) + Q(t_2) \right)H(t_3) + Q(t_3) \\
|
|
\vdots &= \vdots
|
|
\end{align*}
|
|
Can we collapse this in the same way as $C$, retaining only one row of $U$
|
|
per group containing the weighted sum? The equations above are more complex
|
|
due to matrix multiplication. The answer is yes,
|
|
because the weight+collapse operation can be viewed as multiplication by a
|
|
design matrix $B$ on the
|
|
left, and matrix multiplication is associative and additive.
|
|
That is $B(U + Q) = BU + BQ$ and $B(UH) = (BU)H$.
|
|
However, we cannot do the rearrangment that was used in the single endpoint
|
|
case to turn this into a sum, equation \eqref{dsurv2}, as matrix multiplication
|
|
is not commutative.
|
|
|
|
For the grouped IJ, we have
|
|
\begin{align*}
|
|
U_{g}(t) &= \sum_{i \in g} w_i U_{i}(t) \\
|
|
&= U_{g}(t-)H(t) + \sum_{i \in g}w_i Y_{ij}(t)
|
|
\frac{\partial H(t)}{\partial w_i}
|
|
\end{align*}
|
|
|
|
The above has the potential to be a very slow computation. If $U$ has
|
|
$g$ rows and $p$ columns, at each event time there is a matrix multiplication
|
|
$UH$, creation of the $H$ and $H'$ matrices,
|
|
addition of the correct row of $H'$ to each row of $U$,
|
|
and finally adding up the sqared elements of $U$ to get a variance.
|
|
This gives $O(d(gp^2 + 2p^2 + gp + gp))$.
|
|
Due to the matrix multiplication, every element of $U$ is modified at
|
|
every event time, so there is no escaping the final $O(gp)$ sum of
|
|
squares for each column. The primary gain is to make $g$ small.
|
|
|
|
At each event time $t$ the leverage for the AUC must be updated by
|
|
$\delta U(t-)$, where $\delta$ is the amount of time between this event and
|
|
the last. At each reporting which does not coincide with an event time,
|
|
there is a further update with $\delta$ the time between the last event and
|
|
the reporting time, before the sum of squares is computed.
|
|
The AUC at the left endpoint of the curve is 0 and likewise the leverage.
|
|
If a reporting time and event time coincide, do the AUC first.
|
|
|
|
At each event time
|
|
\begin{enumerate}
|
|
\item Update $w_{gj}(\tau) = \sum_{i \in g} Y_{ij}(\tau) w_i$, the number
|
|
currently at risk in each group.
|
|
\item Loop across the events ($dN_{ijk}(\tau)=1$). Update $C$ using equation
|
|
\eqref{e1} and create the transformation matrix $H$.
|
|
\item Compute $U(\tau) = U(\tau-)H$ using sparse methods.
|
|
\item Loop a second time over the $dN_{ijk}(\tau)$ terms, and for all those
|
|
with $j\ne k$ apply \eqref{e2} and\eqref{e3}.
|
|
\item For each type of transition $jk$ at this time, apply equation
|
|
\eqref{e4}, and if $j \ne k$, \eqref{e5} and \eqref{e6}.
|
|
\item Update the variance estimates, which are column sums of squared elements
|
|
of C, U, and AUC leverage.
|
|
\end{enumerate}
|
|
|
|
\begin{align}
|
|
\lambda_{jk}(\tau) &= \frac{\sum_i dN_{ijk}(\tau)}{\sum_i w_i Y_{ij}(\tau)}
|
|
\nonumber \\
|
|
C_{g(i)jk}(\tau) &= C_{gjk}(\tau-) + w_{g(i)}\frac{dN_{ijk}(\tau)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e1} \\
|
|
U_{g(i)j}(\tau) &=U_{gj}(\tau-) - w_{g(i)}\frac{dN_{ijk}(\tau) p_j(\tau-)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e2} \\
|
|
U_{g(i)k}(\tau) &=U_{gk}(\tau-) + w_{g(i)}\frac{dN_{ijk}(\tau) p_j(\tau-)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e3} \\
|
|
C_{ghk}(\tau) &= C_{gjk}(\tau-) - w_{gj}(\tau)\frac{\lambda_{jk}(\tau)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e4} \\
|
|
U_{gj}(\tau) &=U_{gj}(\tau-) + w_{gj}(\tau)\frac{\lambda_{jk}(\tau) p_j(\tau-)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e5} \\
|
|
U_{gk}(\tau) &=U_{gk}(\tau-) - w_{gj}(\tau)\frac{\lambda_{jk}(\tau) p_j(\tau-)}
|
|
{\sum_i w_i Y_{ij}(\tau)} \label{e6}
|
|
\end{align}
|
|
|
|
In equations \eqref{e1}--\eqref{e3} above $g(i)$ is the group to which
|
|
observed event $dN_i(\tau)$ belongs.
|
|
At any reporting time there is often 1 or at most a handful of events.
|
|
In equations \eqref{e4}--\eqref{e6} there is an update for each row of
|
|
$C$ and $U$, each row a group.
|
|
Most often, the routine will be called with the default where each unique
|
|
subject is their own group:
|
|
if only 10\% of the subjects are in group $j$ at time $\tau$ then
|
|
90\% of the $w_{gj}$ weights will be 0.
|
|
It may be slightly faster to test for $w$ positive before multiplying by 0?
|
|
|
|
For the `sparse multiplication' above we employ simple approach: if the
|
|
$H-I$ matrix has only 1 non-zero diagonal, then $U$ can be updated
|
|
in place.
|
|
For the NAFLD example in the survival vignette, for instance, this is true for
|
|
all but 161 out of the 6108 unique event times (5\%). For the remainder of
|
|
the event times it suffices to create a temporary copy of $U(t-)$.
|
|
|
|
|
|
\subsection{Residuals}
|
|
\index{Aalen-Johansen!residuals}
|
|
The residuals.survfit function will return the IJ values at selected times,
|
|
and is the more useful interface for a user.
|
|
One reason is that since the user specifies the times, the result can be
|
|
returned as an array with (obervation, state, time) as the dimensions, even for
|
|
a fit that has multiple curves.
|
|
Full influence from the \code{survfit} routine, on the other hand, are returned
|
|
as a list with one per curve. The reason is that each curve might have a
|
|
different set of event times.
|
|
|
|
Because only a few times are reported, an important question is whether a
|
|
more efficient algorithm can be created.
|
|
For the cumulative hazard, each transition is a separte computation, so we
|
|
can adapt the prior fast computation.
|
|
We now have separate hazards for each observed transition $jk$, with numerator
|
|
in the n.transitions matrix and denominator in the n.risk matrix.
|
|
An observation in state $j$ is at risk for all $j:k$ transitions for the entire
|
|
(time1, time2) interval; the code can use a single set of yindex, tindex,
|
|
sindex and dindex intervals.
|
|
The largest change is to iterate over the transtions, and the 0/1 for each
|
|
j:k transition replaces simple status.
|
|
|
|
The code for this section is a bit opaque, here is some further explanation.
|
|
Any given observation is in a single state over the (time1, time2) interval
|
|
for that transition and accumulates the $-\lambda_{jk}(t)/n_j(t)$ portion of the
|
|
IJ for all $j:k$ transitions that it overlaps, but only those that occur before
|
|
the chosen reporting time $\tau$.
|
|
Create a matrix hsum, one column for each observed $jk$ transition, each column
|
|
contains the cumulative sum of $-\lambda_{jk}(t)/n_j(t)$ for that transition,
|
|
and one row for each event time in the survfit object.
|
|
|
|
To index the hsum matrix create a matrix ymin which has a row per observation
|
|
and a column per reporting time (so is the same size as the desired output),
|
|
whose i,j element contains the index of the largest event time which is
|
|
$\le$ min(time2[i], $\tau_j$), that is, the last row of hsum that would apply
|
|
to this (observation time, reporting time) pair. Likewise let smin be a
|
|
matrix which points to the largest event time which does \emph{not} apply.
|
|
The ymin and smin matrices apply to all the transitions; they are created
|
|
using the outer and pmin functions.
|
|
The desired $d\lambda$ portion of the residual for the $jk$ transition will be
|
|
\code{hsum[smin, jk] - hsum[ymin, jk]}, where $jk$ is a shorthand for the
|
|
column of hmat that encodes the $j:k$ transition. A similar sort of trickery
|
|
is used for the $dN_{jk}$ part of the residual.
|
|
Matrix subscripts could be used to make it slightly faster, at the price of
|
|
further impenetrability.
|
|
|
|
Efficient computation of residuals for the probability in state are more
|
|
difficult. The key issue is that at
|
|
every event time there is a matrix multiplication $U(t-)H(t)$ which visits
|
|
all $n$ by $p$ elements of $U$.
|
|
For illustration assume 3 event times at 1,2,3, and the user only wants to
|
|
report the leverate at time 3. Let $B$ be the additions at each time point.
|
|
Then
|
|
\begin{align*}
|
|
U(1) &= U(0)H(1) + B(1) \\
|
|
U(2) &= U(1)H(2) + B(2) \\
|
|
U(3) &= U(2)H(3) + B(3) \\
|
|
U(4) &= U(3)H(4) + B(4) \\
|
|
&= \left([U(0)H(1) + B(1)] H(2) + B(2)\right) H(3) + B(3)\\
|
|
&= U(0)[H(1)H(2)H(3)] + B(1)[H(2)H(3)] + B(2)H(3) + B(3)
|
|
\end{align*}
|
|
|
|
The first four rows are the standard iteration. The $U$ matrix will quickly
|
|
become dense, since any $j:k$ transition adds to any at-risk row in state $j$.
|
|
Reporting back only at only a few time point does not change the computational
|
|
burden of the matrix multiplications.
|
|
We cannot reuse the logarithm trick from the KM residuals, as
|
|
$\log(AB) \ne \log(A) + \log(B)$ when $A$ and $B$ are matrices.
|
|
|
|
The expansion in the last row above shows another way to arrange the computation.
|
|
For any time $t$ with only a $j:k$ event, $B(t)$ is non-zero only in columns
|
|
$j$ and $k$, and only for rows with $Y_j(t) =1$,
|
|
those at risk and in state $j$ at time $t$.
|
|
For example, assume that $d_{50}$ were the largest death time less than or equal
|
|
to the first reporting time.
|
|
Accumulate the product in reverse order as $B(50) + B(49)H(50) +
|
|
B(48)[H(49)H(50)], \ldots$, updating the product of $H$ terms at each step.
|
|
Most updates to $H$ involve a sparse matrix (only one event) so are $O(s^2)$
|
|
where $s$ is the number of states. Each $B$ multiplication is also sparse.
|
|
We can then step ahead to the next reporting time using the same algorithm,
|
|
along with a final matrix update to carry forward $U(50)$.
|
|
At the end there will have been $d$=number of event times matrix multipications
|
|
of a sparse update $H(t)$ times the dense product of later $H$ matrices, and
|
|
each row $i$ of $U$ will have been 'touched'
|
|
once for every death time $k$ such that $Y_i(d_k) =1$ (i.e., each $B$ term
|
|
that it is part of), and once more for each prior reporting time.
|
|
|
|
I have not explored whether this idea will work in practice, nor do I see how
|
|
to extend ti to the AUC computation.
|
|
|
|
|
|
<<competecheck, echo=FALSE>>=
|
|
m2 <- mgus2
|
|
m2$etime <- with(m2, ifelse(pstat==0, futime, ptime))
|
|
m2$event <- with(m2, ifelse(pstat==0, 2*death, 1))
|
|
m2$event <- factor(m2$event, 0:2, c('censor', 'pcm', 'death'))
|
|
m2$id <-1:nrow(m2)
|
|
|
|
# 20 year reporting time (240 months)
|
|
mfit <- survfit(Surv(etime, event) ~1, m2, id=id)
|
|
|
|
y3 <- (mfit$time <= 360 & rowSums(mfit$n.event) >0) # rows of mfit, of interest
|
|
etot <- sum(m2$n.event[y3,])
|
|
nrisk <- mean(mfit$n.risk[y3,1])
|
|
@
|
|
|
|
As a first example, consider the mgus2 data set,
|
|
set, a case with 2 competing risks of death and plasma cell malignancy (PCM),
|
|
and use a reporting times of 10, 20, and 30 years.
|
|
There are $n=$`r nrow(m2)` observations, `r etot` events and `r sum(etime)`
|
|
unique event times before 30 years, with an average of $r=$ `r round(nisk,1)`
|
|
subjects at risk at each of these event times.
|
|
As part of data blinding follow up times in the MGUS data set were rounded to
|
|
months, and as a consequence there are very few singleton event times.
|
|
|
|
Both the $H$ matrix and products of $H$ remain sparse for any row corresponding
|
|
to an absorbing state (a single 1 on the diagonal), so for a competing risk
|
|
problem the $UH$ multiplication is $O(3n)$ multiplications, those for the
|
|
nested algorithm will be $O(3r)$ where $r$ is the average number at risk.
|
|
In this case the improvement for the nested algorithm is modest, and
|
|
at an additional price of $9d$ multiplications to accumulate $H$.
|
|
|
|
<<checknafld>>=
|
|
ndata <- tmerge(nafld1[,1:8], nafld1, id=id, death= event(futime, status))
|
|
ndata <- tmerge(ndata, subset(nafld3, event=="nafld"), id,
|
|
nafld= tdc(days))
|
|
ndata <- tmerge(ndata, subset(nafld3, event=="diabetes"), id = id,
|
|
diabetes = tdc(days), e1= cumevent(days))
|
|
ndata <- tmerge(ndata, subset(nafld3, event=="htn"), id = id,
|
|
htn = tdc(days), e2 = cumevent(days))
|
|
ndata <- tmerge(ndata, subset(nafld3, event=="dyslipidemia"), id=id,
|
|
lipid = tdc(days), e3= cumevent(days))
|
|
ndata <- tmerge(ndata, subset(nafld3, event %in% c("diabetes", "htn",
|
|
"dyslipidemia")),
|
|
id=id, comorbid= cumevent(days))
|
|
ndata$cstate <- with(ndata, factor(diabetes + htn + lipid, 0:3,
|
|
c("0mc", "1mc", "2mc", "3mc")))
|
|
temp <- with(ndata, ifelse(death, 4, comorbid))
|
|
ndata$event <- factor(temp, 0:4,
|
|
c("censored", "1mc", "2mc", "3mc", "death"))
|
|
ndata$age1 <- ndata$age + ndata$tstart/365.25 # analysis on age scale
|
|
ndata$age2 <- ndata$age + ndata$tstop/365.25
|
|
|
|
ndata2 <- subset(ndata, age2 > 50 & age1 < 90)
|
|
nfit <- survfit(Surv(age1, age2, event) ~1, ndata, id=id, start.time=50,
|
|
p0=c(1,0,0,0,0), istate=cstate, se.fit = FALSE)
|
|
netime <- (nfit$time <=90 & rowSums(nfit$n.event) > 0)
|
|
# the number at risk at any time is all those in the intial state of a transition
|
|
# at that time
|
|
from <- as.numeric(sub("\\.[0-9]*", "", colnames(nfit$n.transition)))
|
|
fmat <- model.matrix(~ factor(from, 1:5) -1)
|
|
temp <- (nfit$n.transition %*% fmat) >0 # TRUE for any transition 'from' state
|
|
frisk <- (nfit$n.risk * ifelse(temp,1, 0))
|
|
nrisk <- rowSums(frisk[netime,])
|
|
maxrisk <- apply(frisk[netime,],2,max)
|
|
@
|
|
|
|
As a second case consider the NAFLD data used as an example in the survival
|
|
vignette, a 5 state model with death and 0, 1, 2, or 3 metabolic comorbidities
|
|
as the states.
|
|
For a survival curve on age scale, using age 50 as a starting point and computing
|
|
the influence at age 90, the data set has `r nrow(ndata2)` rows that overlap
|
|
the age 50-90 interval, while the
|
|
average risk set size is $r=$ `r round(mean(nrisk))`, $<12$\% of the data rows.
|
|
There are `r sum(netime)` unique event times between 50 and 90 of
|
|
which `r sum(rowSums(nfit\$n.transition[netime,]) ==1)` have only a single %$
|
|
event type and the remainder 2.
|
|
|
|
In this case the first algorithm can take advantage of sparse $H$ matrices and
|
|
the second of
|
|
sparse $B$ matrices.
|
|
The big gain is rows of $B$ that are 0.
|
|
To take maximal advantage of this we can keep an updated vector of
|
|
the indices of the rows that are at risk, see the section on skiplists.
|
|
Note: this has not yet been implemented.
|
|
|
|
For the AUC we can use a rearrangment. Below we write out the leverage on
|
|
the AUC at time $d_4$, the fourth event, with $w_i= d_{i+1}-d_i$ the width
|
|
of the adjactent rectagles that make up the area under the curve.
|
|
|
|
\begin{align*}
|
|
A(d_4) &= U(0)\left[w_0 + H(d_1) w_1 + H(d_1)H(d_2)w_2 + H(d_1)H(d_2)H(d_3)w_3\right]+ \\
|
|
&\phantom{=} B(d_1)\left[w_1 + H(d_2)w_2 + H(d_2)H(d_3)w_3\right] + \\
|
|
&\phantom{=} B(d_2)\left[w_2 + H(d_3) w_3 \right] + B(d_3)w_3
|
|
\end{align*}
|
|
|
|
For $U$ the matrix sequence was $I$, $H_3I$, $H_2H_3I$, \ldots, it is now starts
|
|
with $w_3I$, at the next step we multiply the prior matrix by $H_3$ and add
|
|
$w_2I$, at the third multiply the prior matrix by $H_2$ and add $w_1$, etc.
|
|
|
|
|
|
\section{Skiplists}
|
|
\index{skiplist}
|
|
An efficient structure for a continually updated index to the rows currently
|
|
at risk appears to be a skiplist, see table II of Pugh
|
|
\cite{Pugh90}. Each observation is added (at time2) and deleted (at
|
|
time 1) just once from the risk set so we have $n$ additions and $n$ deletions,
|
|
which are faster for skip lists than binary trees. Reading out the entire list,
|
|
which is done at each event time, has the same cost for each structure.
|
|
Searching is faster for trees, but we don't need that operation.
|
|
If there are $k$ states we keep a separate skiplist containing those at
|
|
risk for each state.
|
|
|
|
When used in the residuals function,
|
|
we can modify the usual skiplist algorithm to take advantage of two things:
|
|
we know the maximum size of the list from the n.risk component,
|
|
and that the additions are nearly in random order.
|
|
The last follows from the fact that most data sets are in subject id order,
|
|
so that ordering by observation's ending time skips wildly across data rows.
|
|
For the nafld data example above, the next figure shows the row number of the
|
|
first 200 additions to the risk set, when going from largest time to smallest.
|
|
For the example given above, the maximum number at risk in each of the 5
|
|
states is (`r paste(apply(nfit\$n.risk, 2, max), collapse=', ')`). %$
|
|
|
|
<<skiplist1, fig=TRUE>>=
|
|
sort2 <- order(ndata$age2)
|
|
plot(1:200, rev(sort2)[1:200], xlab="Addition to the list",
|
|
ylab="Row index of the addition")
|
|
@
|
|
|
|
Using $p=4$ as recommended by Pugh \cite{Pugh90} leads to $\log_4(1420) =5$
|
|
linked lists; 1420 is the maximum number of the $n$ =`r nrow(ndata)` rows at
|
|
risk in any one state at any one time.
|
|
The first is a forward linked list containing all those currently at risk,
|
|
the second has 1/4 the elements, the third 1/16, 1/64, 1/256.
|
|
More precisely, 3/4 of the nodes have a single forward pointer, 3/16 have 2
|
|
forward pointers corresponding to levels 1 and 2, 3/64 have 3 forward pointer,
|
|
etc. All told there are 4/3 as many pointer as a singly linked list, a modest
|
|
memory cost over a linked list.
|
|
To add a new element to the list first find the point of insertion in list 4,
|
|
list 3, list 2, and list 1 sequentially; each of which requires on average
|
|
2 forward steps. Randomly choose whether the new entry should participate in
|
|
1, 2, 3 or 4 of the levels, and insert it.
|
|
|
|
\begin{figure}
|
|
<<skiplist2, fig=TRUE, echo=FALSE>>=
|
|
# simulate a skiplist with period 3
|
|
set.seed(1953)
|
|
n <- 30
|
|
yval <- sort(sample(1:200, n))
|
|
phat <- c(2/3, 2/9, 2/27)
|
|
y.ht <- rep(c(1,1,2,1,1,1,1,3,1,1,2,1,1,1,2,1), length=n)
|
|
plot(yval, rep(1,n), ylim=c(1,3), xlab="Data", ylab="")
|
|
indx <- which(y.ht > 1)
|
|
segments(yval[indx], rep(1, length(indx)), yval[indx], y.ht[indx])
|
|
|
|
y1 <- yval[indx]
|
|
y2 <- yval[y.ht==3]
|
|
lines(c(0, y2[1], y2[1], y1[5], y1[5], max(yval[yval < 100])),
|
|
c(3,3, 2,2,1,1), lwd=2, col=2, lty=3)
|
|
@
|
|
\caption{A simplified skiplist with 40 elements and 3 levels, showing the
|
|
path used to insert a new element with value 100.}
|
|
\label{skiplist2}
|
|
\end{figure}
|
|
|
|
Figure \ref{skiplist2} shows a simplified list with 40 elements and 3 levels,
|
|
along with the search path for adding a new entry with value 100.
|
|
The search starts at the head (shown here at x= 0), takes one step at level 3 to
|
|
find the largest element $< 100$ at level 3, then 3 steps at level 2,
|
|
then one more at level 1 to find the insertion point. This is compared to
|
|
20 steps using only level 1.
|
|
Since the height of a inserted node is chosen randomly there is always the
|
|
possiblity
|
|
of a long final search on level 1, but even a string of 10 is unlikely:
|
|
$(3/4)^{10} < .06$.
|
|
|
|
|
|
\printindex
|
|
\bibliographystyle{plain}
|
|
\bibliography{refer}
|
|
\end{document}
|
|
|