1986 lines
80 KiB
Plaintext
1986 lines
80 KiB
Plaintext
\documentclass[11pt]{article}
|
|
\usepackage[pdftex]{graphicx}
|
|
\usepackage{Sweave}
|
|
\usepackage{amsmath}
|
|
\addtolength{\textwidth}{1in}
|
|
\addtolength{\oddsidemargin}{-.5in}
|
|
\setlength{\evensidemargin}{\oddsidemargin}
|
|
|
|
\SweaveOpts{keep.source=TRUE, fig=FALSE}
|
|
%\VignetteIndexEntry{Introduction to Rpart}
|
|
%\VignetteDepends{rpart}
|
|
%\VignetteDepends{survival}
|
|
% Ross Ihaka suggestions
|
|
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
|
|
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
|
|
\fvset{listparameters={\setlength{\topsep}{0pt}}}
|
|
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
|
|
|
|
\SweaveOpts{engine=R,eps=FALSE,pdf=TRUE, width=7, height=4.5}
|
|
\newcommand{\myfig}[1]{\resizebox{\textwidth}{!}
|
|
{\includegraphics{#1.pdf}}}
|
|
\def\tree{\texttt{tree}}
|
|
\def\rpart{\texttt{rpart}}
|
|
\def\splus{S-Plus}
|
|
\newcommand{\Co}[1]{\texttt{#1}}
|
|
|
|
\title {An Introduction to Recursive Partitioning\\
|
|
Using the RPART Routines}
|
|
\author{Terry M. Therneau \\
|
|
Elizabeth J. Atkinson\\
|
|
Mayo Foundation }
|
|
\date{\today}
|
|
|
|
\begin{document}
|
|
\maketitle
|
|
\tableofcontents
|
|
<<echo=FALSE>>=
|
|
options(continue = " ", width = 60)
|
|
options(SweaveHooks=list(fig=function() par(mar = c(4.1, 4.1, 0.1, 1.1))))
|
|
pdf.options(pointsize = 10)
|
|
par(xpd = NA) #stop clipping
|
|
library(rpart)
|
|
@
|
|
|
|
\section{Introduction}
|
|
This document is a modification of a technical report from the
|
|
Mayo Clinic Division of Biostatistics \cite{Therneau97},
|
|
which was itself an expansion of an earlier Stanford report
|
|
\cite{Therneau83}.
|
|
It is intended to give a short overview of
|
|
the methods found in the {\rpart} routines, which implement many of the
|
|
ideas found in the CART (Classification and Regression Trees) book and
|
|
programs of Breiman, Friedman, Olshen and Stone \cite{Breiman83}.
|
|
Because CART is the trademarked name of a particular software
|
|
implementation of these ideas, and \emph{tree} has been used
|
|
for the {\splus} routines of Clark and Pregibon~\cite{Clark92}
|
|
a different acronym
|
|
--- Recursive PARTitioning or rpart --- was chosen.
|
|
It is somewhat humorous that this label ``rpart'' has now become more
|
|
common than the original and more descriptive ``cart'',
|
|
a testament to the
|
|
influence of freely available software.
|
|
|
|
The {\rpart} programs build classification or regression models of a very
|
|
general structure using a two stage procedure; the resulting models can
|
|
be represented as binary trees. An example is some preliminary data
|
|
gathered at Stanford on revival of cardiac arrest patients by
|
|
paramedics. The goal is to predict which patients can be successfully
|
|
revived in the
|
|
field on the basis of fourteen variables known at or near the time of
|
|
paramedic arrival, e.g., sex, age, time from attack to first care, etc.
|
|
Since some patients who are not revived on site are later successfully
|
|
resuscitated at the hospital, early identification of these ``recalcitrant''
|
|
cases is of considerable clinical interest.
|
|
|
|
|
|
\begin{figure}
|
|
\begin{picture}(350, 160)(-40,0)
|
|
\thicklines
|
|
\put(120,160){\phantom{2}24 revived}
|
|
\put(120,150){144 not revived}
|
|
\put(150,130){\line( 2,-1){89}}
|
|
\put(150,130){\line(-2,-1){89}}
|
|
\put(50, 70){$X_1=1$}
|
|
\put(50, 60){22 / 13}
|
|
\put(210, 70){$X_1=$2, 3 or 4}
|
|
\put(220, 60){2 / 131}
|
|
\put(60, 54) {\line(-2,-1){45}}
|
|
\put(60, 54) {\line( 2,-1){45}}
|
|
\put(240, 54){\line( 2,-1){45}}
|
|
\put(240, 54){\line(-2,-1){45}}
|
|
\put( 5,10){$X_2=1$}
|
|
\put( 5, 0){20 / 5}
|
|
\put( 90,10){$X_2=$2 or 3}
|
|
\put( 95, 0){2 / 8}
|
|
\put(185,10){$X_3=$1}
|
|
\put(185, 0){2 / 31}
|
|
\put(270,10){$X_3=$2 or 3}
|
|
\put(275, 0){0 / 100}
|
|
\end{picture}
|
|
\caption{Revival data}
|
|
\label{revival}
|
|
\end{figure}
|
|
|
|
The resultant model separated the patients into four groups as shown in
|
|
figure \ref{revival}, where
|
|
\begin{tabbing}
|
|
\qquad \qquad \= $X1 $ = \= initial heart rhythm \\
|
|
\> \> 1= VF/VT 2=EMD 3=Asystole 4=Other \\ \\
|
|
\> $X_2 $ = initial response to defibrillation \\
|
|
\> \>1=Improved 2=No change 3=Worse \\ \\
|
|
\> $X_3 $ = initial response to drugs \\
|
|
\> \> 1=Improved 2=No change 3=Worse
|
|
\end{tabbing}
|
|
|
|
The other 11 variables did not appear in the final model.
|
|
This procedure seems to work especially well for variables such as
|
|
$X_1 $ ,
|
|
where there is a definite ordering, but spacings are not necessarily equal.
|
|
|
|
The tree is built by the following process: first the single variable
|
|
is found which best splits the data into two groups (`best' will be defined
|
|
later). The data is separated, and then this process is applied
|
|
{\em separately}
|
|
to each sub-group, and so on recursively until the subgroups either reach
|
|
a minimum size (5 for this data) or until no improvement can be made.
|
|
|
|
The resultant model is, with a certainty, too complex, and the question
|
|
arises as it does with all stepwise procedures of when to stop. The
|
|
second stage of the procedure consists of using cross-validation to
|
|
trim back the full tree. In the medical example above the
|
|
full tree had ten terminal regions. A cross validated estimate of risk
|
|
was computed for a nested set of sub trees; this final model was that sub tree
|
|
with the lowest estimate of risk.
|
|
|
|
\section{Notation}
|
|
The partitioning method can be applied to many different kinds of
|
|
data. We will start by looking at the classification problem,
|
|
which is one of the more instructive cases (but also has the
|
|
most complex equations).
|
|
The sample population consists of $n$ observations from $C$ classes. A given
|
|
model will break these observations into $k$ terminal groups;
|
|
to each of these groups is assigned a predicted class.
|
|
In an actual application, most parameters will be estimated from
|
|
the data, such estimates are given by $\approx$ formulae.
|
|
|
|
\begin{quote}
|
|
\begin{tabbing}
|
|
$\pi_i $\qquad\qquad \=$i=1,2,...,C$ \qquad prior probabilities of each class \\
|
|
\\
|
|
$L(i,j)$ \> $i=1,2,...,C$ \quad Loss matrix for incorrectly classifying \\
|
|
\> an $i$ as a $j$. $L(i,i)\equiv 0 $ \\
|
|
\\
|
|
$A$ \>some node of the tree \\
|
|
\> Note that A represents both a set of individuals in \\
|
|
\> the sample data, and, via the tree that produced it, \\
|
|
\> a classification rule for future data. \\
|
|
\\
|
|
$\tau(x)$ \> true class of an observation $ x$, where $x$ is the \\
|
|
\> vector of predictor variables \\
|
|
\\
|
|
$\tau( A ) $ \>
|
|
the class assigned to A, if A were to be taken as a \\
|
|
\> final node \\
|
|
\\
|
|
$n_i, \,n_A$ \>
|
|
number of observations in the sample that are class $i$, \\
|
|
\> number of obs in node $A$ \\
|
|
\\
|
|
$P(A) $ \>probability of $A$ (for future observations) \\
|
|
\>$=\sum_{i=1}^C \pi_i \,P\{ x \in A \mid \tau( x) =i \}$ \\
|
|
\>$ \approx \sum_{i=1}^C \pi_i n_{iA} / n_i $ \\
|
|
\\
|
|
$p(i|A)$ \> $P\{ \tau( x) =i \mid x \in A \} $ (for future observations) \\
|
|
\>$= \pi_i P \{ x \in A\mid\tau( x)=i\} /P \{ x \in A\} $ \\
|
|
\>$ \approx \pi_i (n_{iA} / n_i) /\sum \pi_i (n_{iA} / n_i$ )\\
|
|
\\
|
|
$R(A) $ \>risk of $A$ \\
|
|
\> $= \sum_{i=1}^C p(i|A) L(i, \tau(A)) $ \\
|
|
\> where $\tau(A) $ is chosen to minimize this risk \\
|
|
\\
|
|
$R(T)$ \>risk of a model (or tree) T \\
|
|
\>$ =\sum_{j=1}^k P(A_j) R(A_j) $ \\
|
|
\> where $A_j $ are the terminal nodes of the tree \\
|
|
\end{tabbing}
|
|
\end{quote}
|
|
|
|
If $L(i,j)=1$ for all $i \ne j$, and we set the prior probabilities
|
|
$\pi$ equal to the observed class frequencies in the sample
|
|
then $p(i|A)= n_{iA}/n_A$ and
|
|
$R(T)$ is the proportion misclassified.
|
|
\section{Building the tree}
|
|
\subsection{Splitting criteria}
|
|
|
|
If we split a node $A$ into two sons $A_L $ and $A_R $
|
|
(left and right sons), we will have
|
|
$$
|
|
P(A_L ) r(A_L )+ P(A_R ) r(A_R ) \le P(A) r(A)
|
|
$$
|
|
(this is proven in \cite{Breiman83}).
|
|
Using this, one obvious way to build a tree is
|
|
to choose that split which maximizes $\Delta r$, the decrease in risk.
|
|
There are defects with this, however, as the following example shows:
|
|
\begin{quote}
|
|
Suppose losses are equal and that the data is 80\% class 1's, and that some
|
|
trial split results in $A_L $ being 54\% class 1's and
|
|
$A_R $ being 100\% class 1's.
|
|
Since the minimum risk prediction for both the left and right son is
|
|
$\tau (A_L ) = \tau (A_R ) =1$, this split will
|
|
have $ \Delta r=0$, yet
|
|
scientifically this is a very informative division of the sample.
|
|
In real data with such a majority, the first few splits very often can
|
|
do no better than this.
|
|
\end{quote}
|
|
|
|
A more serious defect is that the risk reduction is essentially linear.
|
|
If there were two competing splits, one separating the data into groups
|
|
of 85\% and 50\% purity respectively, and the other into 70\%-70\%, we would
|
|
usually prefer the former, if for no other reason than because it better
|
|
sets things up for the next splits.
|
|
|
|
One way around both of these problems is to use look-ahead rules; but
|
|
these are computationally very expensive. Instead {\rpart} uses one of
|
|
several measures of impurity, or diversity, of a node. Let $f$ be some
|
|
impurity function and define the impurity of a node A as
|
|
$$
|
|
I(A)= \sum_{i=1}^C f(p_{iA} )
|
|
$$
|
|
where $p_{iA} $ is the proportion of those in $A$ that belong to class
|
|
$i$ for future samples.
|
|
Since we would like $I(A)$ =0 when $A$ is pure, $f$ must be concave with
|
|
$f(0)=f(1)=0$.
|
|
|
|
Two candidates for $f$ are the information index $f(p)=-p\log(p)$ and
|
|
the Gini
|
|
index $f(p)=p(1-p)$. We then use that split with maximal impurity
|
|
reduction
|
|
$$
|
|
\Delta I =p(A)I(A)-p(A_L )I(A_L )-p(A_R )I(A_R )
|
|
$$
|
|
The two impurity functions are plotted in figure (\ref{ginip}),
|
|
along with a rescaled version of the Gini measure.
|
|
For the two class problem the measures differ only slightly, and
|
|
will nearly always choose the same split point.
|
|
\begin{figure}
|
|
\myfig{longintro-impurity}
|
|
\caption{Comparison of Gini and Information impurity for two groups.}
|
|
\label{ginip}
|
|
\end{figure}
|
|
<<impurity, echo=FALSE, fig=TRUE, include=FALSE>>=
|
|
ptemp <- seq(0, 1, length = 101)[2:100]
|
|
gini <- 2* ptemp *(1-ptemp)
|
|
inform <- -(ptemp*log(ptemp) + (1-ptemp)*log(1-ptemp))
|
|
sgini <- gini *max(inform)/max(gini)
|
|
matplot(ptemp, cbind(gini, inform, sgini), type = 'l', lty = 1:3,
|
|
xlab = "P", ylab = "Impurity", col = 1)
|
|
legend(.3, .2, c("Gini", "Information", "rescaled Gini"),
|
|
lty = 1:3, col = 1, bty = 'n')
|
|
@
|
|
|
|
Another convex criteria not quite of the above class is twoing for which
|
|
$$
|
|
I(A)=\min_{C_1C_2} [f(p_{C_1} )+f(p_{C_2} )]
|
|
$$
|
|
where $C_1 $,$C_2 $ is some partition of the $C$ classes into
|
|
two disjoint sets. If $C=2$ twoing is equivalent to the usual impurity index
|
|
for $f$.
|
|
Surprisingly, twoing can be calculated almost as efficiently as the
|
|
usual impurity index.
|
|
One potential advantage of twoing is that the output may give the user
|
|
additional insight concerning the structure of the data. It can be viewed
|
|
as the partition of $C$ into two superclasses which are in some sense the
|
|
most dissimilar for those observations in $A$.
|
|
For certain problems there may be a natural ordering of the response
|
|
categories (e.g. level of education), in which case ordered twoing can be
|
|
naturally defined, by restricting $C_1$ to be an interval $[1,2,\ldots,k]$
|
|
of classes.
|
|
Twoing is not part of {\rpart}.
|
|
|
|
\subsection{Incorporating losses}
|
|
|
|
One salutatory aspect of the risk reduction criteria not found
|
|
in the impurity measures is inclusion of the loss function. Two different
|
|
ways of extending the impurity criteria
|
|
to also include losses
|
|
are implemented in CART, the
|
|
generalized Gini index and altered priors.
|
|
The {\rpart} software implements only the altered priors method.
|
|
|
|
\subsubsection{Generalized Gini index}
|
|
|
|
The Gini index has the following interesting interpretation. Suppose
|
|
an object is selected at random from one of C classes according to the
|
|
probabilities $(p_1 ,p_2 ,...,p_C ) $
|
|
and is randomly assigned to a class using the same distribution. The
|
|
probability of misclassification is
|
|
$$
|
|
\sum_ i \sum_ {j \ne i} p_i p_j =
|
|
\sum_ i \sum_ j p_i p_j - \sum_ i p_i^2
|
|
= \sum_ i 1- p_i^2 = \hbox{Gini index for $p$}
|
|
$$
|
|
Let $L(i,j)$ be the loss of assigning class $j$ to an object which actually
|
|
belongs to class $i$. The expected cost of misclassification is
|
|
$\sum \sum L(i,j)p_i p_j $. This suggests defining a
|
|
{\em generalized Gini} index of impurity by
|
|
$$
|
|
G( p) = \sum_ i \sum_ j L(i,j)p_i p_j
|
|
$$
|
|
|
|
The corresponding splitting criterion appears to be promising for applications
|
|
involving variable misclassification costs. But there are several
|
|
reasonable objections to it. First, $G( p) $ is not necessarily a
|
|
concave function of $ p$, which was the motivating factor behind
|
|
impurity measures. More seriously, $G$ symmetrizes the loss
|
|
matrix before using it. To see this note that
|
|
$$
|
|
G( p) = (1/2) \sum \sum [L(i,j)+L(j,i)] \, p_i p_j
|
|
$$
|
|
In particular, for two-class problems, $G$ in effect ignores the
|
|
loss matrix.
|
|
|
|
|
|
\subsubsection{Altered priors}
|
|
|
|
Remember the definition of $R(A)$
|
|
\begin{eqnarray*}
|
|
R(A) &\equiv& \sum_{i=1}^C p_{iA}L(i, \tau (A)) \\
|
|
&=& \sum_{i=1}^C \pi_i L(i, \tau (A))
|
|
(n_{iA} / n_i )( n / n_A )
|
|
\end{eqnarray*}
|
|
Assume there exists $\tilde \pi $ and $\tilde L $ be such that
|
|
$$
|
|
\tilde \pi_i \tilde L (i,j) = \pi_i L(i,j)\qquad \forall i,j \in C
|
|
$$
|
|
Then $R(A)$ is unchanged under the new losses and priors. If $\tilde L$
|
|
is proportional to the zero-one loss matrix then the priors $\tilde \pi$
|
|
should be used in the splitting criteria. This is possible only if $L$ is
|
|
of the form
|
|
$$
|
|
L(i,j) = \left\{ \begin{array}{ll}
|
|
L_i & i \ne j \\
|
|
|
|
0 & i=j
|
|
\end{array} \right .
|
|
$$
|
|
in which case
|
|
$$
|
|
\tilde \pi_i = \frac{\pi_i L_i } { \sum_j \pi_j L_j }
|
|
$$
|
|
This is always possible when $C=2$,
|
|
and hence altered priors are exact for the two class problem.
|
|
For arbitrary loss matrix of dimension $C> 2$,
|
|
{\rpart} uses the above formula with
|
|
$L_i = \sum_ j L(i,j)$.
|
|
|
|
|
|
A second justification for altered priors is this. An impurity index
|
|
$I(A) = \sum f(p_i )$ has its maximum at $p_1= p_2 =\ldots =
|
|
p_C = 1/C$. If a problem had, for instance, a misclassification
|
|
loss for class 1 which was twice the loss for a class 2 or 3
|
|
observation, one would wish I(A) to have its maximum at $p_1 =$1/5,
|
|
$p_2 = p_3 =$2/5, since this is the worst possible set of proportions
|
|
on which to decide a node's class. The altered priors technique does
|
|
exactly this, by shifting the $p_i $.
|
|
|
|
Two final notes
|
|
\begin{itemize}
|
|
\item When altered priors are used, they affect only the choice of split.
|
|
The ordinary losses and priors are used to compute the risk of the node.
|
|
The altered priors simply help the impurity rule choose splits that
|
|
are likely to be ``good'' in terms of the risk.
|
|
\item The argument for altered priors is valid for both the Gini and
|
|
information splitting rules.
|
|
\end{itemize}
|
|
|
|
\subsection{Example: Stage C prostate cancer (\Co{class} method)}
|
|
\begin{figure}
|
|
\myfig{longintro-gini1}
|
|
\caption{Classification tree for the Stage C data}
|
|
\label{fgini1}
|
|
\end{figure}
|
|
|
|
This first example is based on a data set of 146 stage C prostate
|
|
cancer patients \cite{Nativ88}.
|
|
The main clinical endpoint of interest is whether the
|
|
disease recurs after initial surgical removal of the prostate, and
|
|
the time interval to that progression (if any).
|
|
The endpoint in this example is \Co{status}, which takes
|
|
on the value 1 if the disease has progressed and 0 if not.
|
|
Later we'll analyze
|
|
the data using the \Co{exp} method, which will take into account
|
|
time to progression. A short description of each of the variables is
|
|
listed below. The main predictor variable of interest in
|
|
this study was DNA ploidy, as determined by flow cytometry.
|
|
For diploid and tetraploid tumors, the flow cytometry method was
|
|
also able to estimate the percent of tumor cells in a $G_2$ (growth)
|
|
stage of their cell cycle; $G_2$\% is systematically missing
|
|
for most aneuploid tumors.
|
|
|
|
The variables in the data set are
|
|
\begin{center}
|
|
\begin{tabular}{ll}
|
|
pgtime & time to progression, or last follow-up free of progression \\
|
|
pgstat & status at last follow-up (1=progressed, 0=censored)\\
|
|
age & age at diagnosis \\
|
|
eet & early endocrine therapy (1=no, 0=yes) \\
|
|
ploidy & diploid/tetraploid/aneuploid DNA pattern \\
|
|
g2 & \% of cells in $G_2$ phase \\
|
|
grade & tumor grade (1-4) \\
|
|
gleason & Gleason grade (3-10) \\
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
|
|
The model is fit by using the {\rpart} function. The first
|
|
argument of the function is a model formula,
|
|
with the $\sim$ symbol standing for ``is modeled as''.
|
|
The \Co{print} function gives an abbreviated output, as for other R models.
|
|
The \Co{plot} and \Co{text} command plot the tree and then label the
|
|
plot, the result is shown in figure \ref{fgini1}.
|
|
<<gini1, fig=TRUE, include=FALSE>>=
|
|
progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
|
|
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
|
|
data = stagec, method = 'class')
|
|
print(cfit)
|
|
par(mar = rep(0.1, 4))
|
|
plot(cfit)
|
|
text(cfit)
|
|
@
|
|
|
|
\begin{itemize}
|
|
\item The creation of a labeled factor variable as the response
|
|
improves the labeling of the printout.
|
|
\item We have explicitly directed the routine to treat \Co{progstat} as
|
|
a categorical variable by asking for \Co{method='class'}.
|
|
(Since \Co{progstat} is a factor this would have been the default choice).
|
|
Since no optional classification
|
|
parameters are specified the routine will use the Gini
|
|
rule for splitting, prior probabilities that are proportional
|
|
to the observed data frequencies, and 0/1 losses.
|
|
\item The child nodes of node $x$ are always $2x$ and $2x+1$, to help
|
|
in navigating the printout (compare the printout to figure \ref{fgini1}).
|
|
\item Other items in the list are the definition of the split used
|
|
to create a node,
|
|
the number of subjects at the node,
|
|
the loss or error at the node (for this example, with proportional
|
|
priors and unit losses this will be the number misclassified),
|
|
and the predicted class for the node.
|
|
\item * indicates that the node is terminal.
|
|
\item Grades 1 and 2 go to the left, grades 3 and 4 go to the right.
|
|
The tree is arranged so that the branches with the largest
|
|
``average class'' go to the right.
|
|
\end{itemize}
|
|
|
|
\subsection{Variable importance}
|
|
The long form of the printout for the stage C data, obtained with the
|
|
summary function, contains further information on the surrogates.
|
|
The \texttt{cp} option of the summary function instructs it to prune
|
|
the printout, but it does not prune the tree.
|
|
<<summary(cfit3, cp = 0.06)>>=
|
|
@
|
|
For each node up to 5 surrogate splits (default) will be printed,
|
|
but only those whose utility is greater than the baseline ``go with the
|
|
majority'' surrogate.
|
|
The first surrogate for the first split is based on the following
|
|
table:
|
|
<<>>=
|
|
temp <- with(stagec, table(cut(grade, c(0, 2.5, 4)),
|
|
cut(gleason, c(2, 5.5, 10)),
|
|
exclude = NULL))
|
|
temp
|
|
@
|
|
The surrogate sends \Sexpr{sum(diag(temp))} of the \Sexpr{sum(temp)}
|
|
observations the correct direction for an agreement of
|
|
\Sexpr{round(sum(diag(temp))/sum(temp), 3)}.
|
|
The majority rule gets \Sexpr{zz <-max(cfit$frame$n[2:3]); zz} correct, and
|
|
the adjusted agreement is (\Sexpr{sum(diag(temp))} - \Sexpr{zz})/
|
|
(\Sexpr{sum(temp)} - \Sexpr{zz}).
|
|
|
|
A variable may appear in the tree many times, either as a primary or a
|
|
surrogate variable.
|
|
An overall measure of variable importance is the sum of the goodness of split
|
|
measures for each split for which it was the primary variable, plus
|
|
goodness * (adjusted agreement) for all splits in which it was a surrogate.
|
|
In the printout these are scaled to sum to 100 and the rounded values are shown,
|
|
omitting any variable whose proportion is less than 1\%.
|
|
Imagine two variables which were essentially duplicates of each other;
|
|
if we did not count surrogates they would split the importance with neither
|
|
showing up as strongly as it should.
|
|
|
|
|
|
\section{Pruning the tree}
|
|
\subsection{Definitions}
|
|
We have built a complete tree, possibly quite large and/or
|
|
complex, and must now decide how much of that model to retain. In
|
|
stepwise regression, for instance, this issue is addressed sequentially
|
|
and the fit is stopped when the F test fails to achieve some level
|
|
$\alpha $.
|
|
|
|
Let $T_1$, $T_2$,....,$T_k$ be the terminal nodes of a tree T.
|
|
Define
|
|
\begin{quote}
|
|
$|T|$ = number of terminal nodes \\
|
|
risk of $T$ = $R(T)$ = $\sum_{i=1}^k P(T_i )R(T_i )$
|
|
\end{quote}
|
|
In comparison to regression, $|T|$ is analogous to the degrees of freedom
|
|
and $R(T)$ to the residual sum of squares.
|
|
|
|
Now let $\alpha$ be some number between 0 and $\infty$ which measures
|
|
the 'cost' of adding another variable to the model; $\alpha$ will be called a
|
|
complexity parameter. Let $R(T_0)$ be the risk for the zero split tree.
|
|
Define
|
|
$$
|
|
R_\alpha (T) =R(T) + \alpha |T|
|
|
\label{cp1}
|
|
$$
|
|
to be the cost for the tree, and define $T_\alpha $ to be that
|
|
sub tree of the full model which has minimal cost. Obviously $T_0 $
|
|
= the full model and $T_\infty $ = the model with no splits at all.
|
|
The following results are shown in \cite{Breiman83}.
|
|
\begin{enumerate}
|
|
\item If $T_1 $ and $T_2 $ are sub trees of $T$ with
|
|
$R_\alpha (T_1 )= R_\alpha (T_2 ) $,
|
|
then either $T_1 $ is a sub tree of $T_2 $ or $T_2 $ is
|
|
a sub tree of $T_1 $; hence either $|T_1| < |T_2|$ or
|
|
$|T_2| < |T_1|$.
|
|
\item If $\alpha > \beta$ then either $T_\alpha = T_\beta $ or
|
|
$T_\alpha $ is a strict sub tree of $T_\beta $.
|
|
\item Given some set of numbers $\alpha_1 ,\alpha_2 ,\ldots,\alpha_m $;
|
|
both $T_{\alpha_1} ,\ldots,T_{\alpha_m} $ and
|
|
$R(T_{\alpha_1} )$, $\ldots$, $R(T_{\alpha_m} )$
|
|
can be computed efficiently.
|
|
\end{enumerate}
|
|
Using the first result, we can uniquely define $T_\alpha$ as the smallest
|
|
tree $T$ for which $R_\alpha (T) $ is minimized.
|
|
|
|
Since any sequence of nested trees based on $T$ has at most
|
|
$|T|$ members, result 2 implies that all possible values of $\alpha$
|
|
can be grouped into $m$ intervals, $m \le |T| $
|
|
\begin{eqnarray*}
|
|
I_1 &=& [0, \alpha_1 ] \\
|
|
I_2 &=& ( \alpha_1 , \alpha_2 ] \\
|
|
\vdots \\
|
|
I_m &=& ( \alpha_{m-1} , \infty]
|
|
\end{eqnarray*}
|
|
where all $\alpha \in I_i $ share the same minimizing sub tree.
|
|
|
|
\subsection{Cross-validation}
|
|
Cross-validation is used to choose a best value for
|
|
$\alpha$ by the following steps:
|
|
\begin{enumerate}
|
|
\item \begin{tabbing}
|
|
\= Fit the full model on the data set \\
|
|
\> compute $I_1 ,I_2,...,I_m$ \\
|
|
\> set \= $\beta_1=0 $ \\
|
|
\> \> $\beta_2=\sqrt {\alpha_1 \alpha_2 } $ \\
|
|
\> \> $\beta_3=\sqrt {\alpha_2 \alpha_3 } $ \\
|
|
\> \> $\vdots $ \\
|
|
\> \> $\beta_{m-1} = \sqrt {\alpha_{m-2} \alpha_{m-1} } $ \\
|
|
\> \> $\beta_m=\infty$ \\
|
|
\> each $\beta_i $ is a `typical value' for its $I_i $ \\
|
|
\end{tabbing}
|
|
\item Divide the data set into $s$ groups $G_1 ,G_2 ,...,G_s $
|
|
each of size $s/n$, and for each group separately:
|
|
\begin{itemize}
|
|
\item fit a full model on the data set `everyone except $G_i $'
|
|
and determine $T_{\beta_1} , T_{\beta_2} ,...,T_{\beta_m} $
|
|
for this reduced data set,
|
|
\item compute the predicted class for each observation in $G_i$,
|
|
under each of the models $T_{\beta_j}$ for $1 \le j \le m$,
|
|
\item from this compute the risk for each subject.
|
|
\end{itemize}
|
|
\item
|
|
Sum over the $G_i $ to get an estimate of risk for each
|
|
$\beta_j $.
|
|
For that $\beta$ (complexity parameter) with smallest risk
|
|
compute $T_\beta $ for the full data set, this is
|
|
chosen as the best trimmed tree.
|
|
\end{enumerate}
|
|
|
|
In actual practice, we may use instead the 1-SE rule. A plot of $\beta$
|
|
versus risk often has an initial sharp drop followed by a relatively flat
|
|
plateau and then a slow rise. The choice of $\beta$ among those
|
|
models on the
|
|
plateau can be essentially random. To avoid this, both an estimate of the
|
|
risk and its standard error are computed during the cross-validation. Any
|
|
risk within one standard error of the achieved minimum is marked as
|
|
being equivalent to the minimum (i.e. considered to be part of the flat
|
|
plateau). Then the simplest model, among all those ``tied'' on the
|
|
plateau, is chosen.
|
|
|
|
In the usual definition of cross-validation we would have taken $s=n$ above, i.e.,
|
|
each of the $G_i $ would contain exactly one observation, but for
|
|
moderate $n$ this is computationally prohibitive. A value of $s = 10$ has
|
|
been found to be sufficient, but users can vary this if they wish.
|
|
|
|
In Monte-Carlo trials, this method of pruning has proven very reliable for
|
|
screening out `pure noise' variables in the data set.
|
|
|
|
\subsection{Example: The Stochastic Digit Recognition Problem}
|
|
|
|
This example is found in section 2.6 of \cite{Breiman83}, and
|
|
used as a running example throughout much of their book.
|
|
Consider the segments of an unreliable digital readout
|
|
\begin{center}
|
|
\begin{picture}(350,160)(-100,0)
|
|
\put(27,120){1}
|
|
\put(10,115){\line(1,0){40}}
|
|
\put(10, 75){\line(1,0){40}}
|
|
\put(10, 35){\line(1,0){40}}
|
|
\put(10, 35){\line(0,1){80}}
|
|
\put(50, 35){\line(0,1){80}}
|
|
\put(27, 23){7}
|
|
\put(27, 80){4}
|
|
\put(0 , 50){5}
|
|
\put(0, 90){2}
|
|
\put(60, 90){3}
|
|
\put(60, 50){6}
|
|
\end{picture}
|
|
\end{center}
|
|
where each light is correct with probability 0.9, e.g., if the true digit is
|
|
a 2, the lights 1, 3, 4, 5, and 7 are on with probability 0.9 and lights
|
|
2 and 6 are on with probability 0.1. Construct test data where
|
|
$Y\in \{0,1,...,9\}$, each with proportion 1/10 and the $X_i$, $i=1,\ldots,7$
|
|
are i.i.d. Bernoulli variables with parameter depending on Y.
|
|
$X_8 - X_{24} $
|
|
are generated as i.i.d Bernoulli $P\{X_i =1\} =.5 $, and are
|
|
independent of Y. They correspond to embedding the readout in a larger
|
|
rectangle of random lights.
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-dig1}
|
|
\caption{Optimally pruned tree for the stochastic digit recognition data}
|
|
\label{figdig}
|
|
\end{figure}
|
|
|
|
|
|
A sample of size 200 was generated accordingly and the
|
|
procedure applied using the Gini index (see 3.2.1) to build the tree.
|
|
The R code to compute the simulated data and the fit are shown below.
|
|
<<dig1, fig=TRUE, include=FALSE>>=
|
|
set.seed(1953) # An auspicious year
|
|
n <- 200
|
|
y <- rep(0:9, length = 200)
|
|
temp <- c(1,1,1,0,1,1,1,
|
|
0,0,1,0,0,1,0,
|
|
1,0,1,1,1,0,1,
|
|
1,0,1,1,0,1,1,
|
|
0,1,1,1,0,1,0,
|
|
1,1,0,1,0,1,1,
|
|
0,1,0,1,1,1,1,
|
|
1,0,1,0,0,1,0,
|
|
1,1,1,1,1,1,1,
|
|
1,1,1,1,0,1,0)
|
|
|
|
lights <- matrix(temp, 10, 7, byrow = TRUE) # The true light pattern 0-9
|
|
temp1 <- matrix(rbinom(n*7, 1, 0.9), n, 7) # Noisy lights
|
|
temp1 <- ifelse(lights[y+1, ] == 1, temp1, 1-temp1)
|
|
temp2 <- matrix(rbinom(n*17, 1, 0.5), n, 17) # Random lights
|
|
x <- cbind(temp1, temp2)
|
|
|
|
dfit <- rpart(y ~ x, method='class',
|
|
control = rpart.control(xval = 10, minbucket = 2, cp = 0))
|
|
printcp(dfit)
|
|
|
|
fit9 <- prune(dfit, cp = 0.02)
|
|
par(mar = rep(0.1, 4))
|
|
plot(fit9, branch = 0.3, compress = TRUE)
|
|
text(fit9)
|
|
@
|
|
|
|
This table differs from that in section 3.5 of \cite{Breiman83} in
|
|
several ways, the last two of which are substantive.
|
|
\begin{itemize}
|
|
\item The actual values are different, of course, because of different
|
|
random number generators in the two runs.
|
|
\item The complexity table is printed from the smallest tree (no splits) to the
|
|
largest one (28 splits).
|
|
We find it easier to compare one tree to another when they start at
|
|
the same place.
|
|
\item The number of splits is listed, rather than the number of nodes.
|
|
The number of nodes is always 1 + the number of splits.
|
|
\item For easier reading, the error columns have been scaled so that the
|
|
first node has an error of 1. Since in this example the model with no
|
|
splits must make 180/200 misclassifications, multiply columns 3-5 by
|
|
180 to get a result in terms of absolute error.
|
|
(Computations are done on the absolute error scale, and printed on
|
|
relative scale).
|
|
\item The complexity parameter column has been similarly scaled.
|
|
\end{itemize}
|
|
|
|
Looking at the table, we see that the best tree has 10 terminal
|
|
nodes (9 splits), based on cross-validation.
|
|
This sub tree is extracted with a call to \Co{prune} and saved in \Co{fit9}.
|
|
The pruned tree is shown in figure \ref{figdig}.
|
|
Two options have been used in the plot. The \Co{compress} option
|
|
tries to narrow the printout by vertically overlapping portions
|
|
of the plot.
|
|
The \Co{branch} option controls the shape of the branches that
|
|
connect a node to its children.
|
|
|
|
The largest tree, with 35 terminal nodes, correctly classifies 170/200 = 85\%
|
|
of the observations, but uses several of the random predictors in
|
|
doing so and seriously over fits the data.
|
|
If the number of observations per terminal node (minbucket)
|
|
had been set to 1 instead of
|
|
2, then every observation would be classified correctly in the final model,
|
|
many in terminal nodes of size 1.
|
|
|
|
\section{Missing data}
|
|
\subsection{Choosing the split}
|
|
|
|
Missing values are one of the curses of statistical models and analysis.
|
|
Most procedures deal with them by refusing to deal with them --
|
|
incomplete observations are tossed out. \texttt{Rpart} is somewhat more
|
|
ambitious. Any observation with values for the dependent variable and
|
|
at least one independent variable will participate in the modeling.
|
|
|
|
|
|
The quantity to be maximized is still
|
|
$$
|
|
\Delta I = p(A)I(A)-p(A_L )I(A_L ) - p(A_R ) I(A_R)
|
|
$$
|
|
The leading term is the same for all variables and splits irrespective of
|
|
missing data, but the right two terms are somewhat modified. Firstly, the
|
|
impurity indices $I(A_R )$ and $I (A_L )$ are calculated only over
|
|
the observations which are not missing a particular predictor. Secondly,
|
|
the two probabilities $p(A_L )$ and $p(A_R )$ are also calculated
|
|
only over the relevant observations, but they are then adjusted so that
|
|
they sum to $p(A)$. This entails some extra bookkeeping as the tree is
|
|
built, but ensures that the terminal node probabilities sum to 1.
|
|
|
|
In the extreme case of a variable for which only 2 observations are
|
|
non-missing, the impurity of the two sons will both be zero when splitting
|
|
on that variable. Hence $\Delta I $ will be maximal, and this `almost all
|
|
missing' coordinate is guaranteed to be chosen as best; the method is
|
|
certainly flawed in this extreme case. It is difficult to say whether this
|
|
bias toward missing coordinates carries through to the non-extreme cases,
|
|
however, since a more complete variable also affords for itself more possible
|
|
values at which to split.
|
|
|
|
\subsection{Surrogate variables}
|
|
|
|
Once a splitting variable and a split point for it have been decided,
|
|
what {\em is}
|
|
to be done with observations missing that variable? One approach is to
|
|
estimate the missing datum using the other independent variables; {\rpart}
|
|
uses a variation of this to define
|
|
{\em surrogate} variables.
|
|
|
|
As an example, assume that the split (age $<$40, age $\ge $40) has
|
|
been chosen. The surrogate variables are found by re-applying the
|
|
partitioning algorithm (without recursion) to predict the two categories
|
|
`age $<$40' vs. `age $\ge $40' using the other independent variables.
|
|
|
|
%$$
|
|
%\begin{tabular}{l|r|r|}
|
|
% & & \multicolumn{2}{c}{Age $<$ 40} \\
|
|
% & & True & False \\ \hline
|
|
% Other X $<$ cutpoint & True & True Positive & False Positive \\ \cline{3-4}
|
|
% & False & False Negative & True Negative \\ \hline
|
|
%\end{tabular}
|
|
%$$
|
|
|
|
For each predictor an optimal split point and a misclassification error
|
|
are computed. (Losses and priors do not enter in --- none are defined for
|
|
the age groups --- so the risk is simply \#misclassified / n.) Also evaluated
|
|
is the blind rule `go with the majority' which has misclassification
|
|
error $\min(p, 1-p)$ where
|
|
\begin{center}
|
|
$p$ = (\# in A with age $<$ 40) / $n_A $.
|
|
\end{center}
|
|
The surrogates are ranked, and any variables which do no better than the
|
|
blind rule are discarded from the list.
|
|
|
|
Assume that the majority of subjects have age $\le$ 40 and that there is
|
|
another variable $x$ which is uncorrelated to
|
|
age; however, the subject with the largest value of $x$ is also over
|
|
40 years of age. Then the surrogate variable
|
|
$x < \max$ versus $x \ge \max$ will have one less error that the
|
|
blind rule,
|
|
sending 1 subject to the right and $n-1$ to the left.
|
|
A continuous variable that is completely unrelated to age has probability
|
|
$1-p^2$ of generating such a
|
|
trim-one-end surrogate by chance alone. For this reason
|
|
the {\rpart} routines impose one more constraint during the construction
|
|
of the surrogates: a candidate split must send at least 2 observations to
|
|
the left and at least 2 to the right.
|
|
|
|
Any observation which is missing the split variable is then classified using
|
|
the first surrogate variable, or if missing that, the second surrogate
|
|
is used, and etc. If an observation is missing all the surrogates the blind
|
|
rule is used. Other strategies for these `missing everything' observations
|
|
can be convincingly argued, but there should be few or no observations
|
|
of this type (we hope).
|
|
|
|
\subsection{Example: Stage C prostate cancer (cont.)}
|
|
Let us return to the stage C prostate cancer data of the
|
|
earlier example.
|
|
For a more detailed listing of the rpart object, we use the \Co{summary}
|
|
function. It includes the information from the CP table (not repeated below),
|
|
plus information about each node. It is easy to print a
|
|
sub tree based on a different cp value using the \Co{cp} option.
|
|
Any value between 0.0555 and 0.1049 would produce the same
|
|
result as is listed below, that is, the tree with 3 splits.
|
|
Because the printout is long, the \Co{file} option of \Co{summary.rpart}
|
|
is often useful.
|
|
|
|
<<>>=
|
|
printcp(cfit)
|
|
|
|
summary(cfit, cp = 0.06)
|
|
@
|
|
|
|
\begin{itemize}
|
|
\item There are 54 progressions (class 1) and 94 non-progressions, so
|
|
the first node has an expected loss of $54/146 \approx 0.37$.
|
|
(The computation is this simple only for the default priors and losses).
|
|
\item Grades 1 and 2 go to the left, grades 3 and 4 to the right.
|
|
The tree is arranged so that the ``more severe'' nodes go to the
|
|
right.
|
|
\item The improvement is $n$ times the change in impurity index.
|
|
In this instance, the largest improvement is for the variable \Co{grade},
|
|
with an improvement of 10.36.
|
|
The next best choice is Gleason score, with an improvement of 8.4.
|
|
The actual values of the improvement are not so important, but their
|
|
relative size gives an indication of the comparative utility of the
|
|
variables.
|
|
\item Ploidy is a categorical variable, with values of
|
|
diploid, tetraploid, and aneuploid, in that order.
|
|
(To check the order, type \Co{table(stagec\$ploidy)}).
|
|
All three possible splits were attempted:
|
|
aneuploid+diploid vs. tetraploid, aneuploid+tetraploid vs. diploid,
|
|
and aneuploid vs. diploid + tetraploid.
|
|
The best split sends diploid to the right and the others to the left.
|
|
\item The 2 by 2 table of diploid/non-diploid vs grade= 1-2/3-4 has
|
|
64\% of the observations on the diagonal. \\
|
|
$$
|
|
\begin{tabular}{r|rr}
|
|
& 1-2 & 3-4 \\ \hline
|
|
diploid & 38 & 29 \\
|
|
tetraploid & 22 & 46 \\
|
|
aneuploid & 1 & 10
|
|
\end{tabular}
|
|
$$
|
|
\item For node 3, the primary split variable is missing on 6 subjects.
|
|
All 6 are split based on the first surrogate, ploidy. Diploid and
|
|
aneuploid tumors are sent to the left, tetraploid to the right.
|
|
As a surrogate,
|
|
eet was no better than 45/85 (go with the majority), and was not retained.
|
|
$$
|
|
\begin{tabular}{l|c|c|}
|
|
& g2 $<$ 13.2 & g2 $>$ 13.2 \\ \hline
|
|
Diploid/aneuploid & 71 & 1 \\
|
|
Tetraploid & 3 & 64 \\
|
|
\end{tabular}
|
|
$$
|
|
\end{itemize}
|
|
|
|
\section{Further options}
|
|
\subsection{Program options}
|
|
|
|
The central fitting function is {\rpart}, whose main arguments are
|
|
\begin{itemize}
|
|
\item \Co{formula}: the model formula, as in \Co{lm} and other R
|
|
model fitting functions. The right hand side may contain both
|
|
continuous and categorical (factor) terms. If the outcome $y$ has more
|
|
than two levels, then categorical predictors must be fit by exhaustive
|
|
enumeration, which can take a very long time.
|
|
\item \Co{data, weights, subset}: as for other R models.
|
|
\item \Co{method}: the type of splitting rule to use. Options
|
|
at this point are classification, anova, Poisson, and exponential.
|
|
\item \Co{parms}: a list of method specific optional parameters.
|
|
For classification, the
|
|
list can contain any of: the vector of prior probabilities
|
|
(component \Co{prior}), the loss matrix (component \Co{loss})
|
|
or the splitting index (component \Co{split}).
|
|
The priors must be positive and sum to 1. The loss matrix must have
|
|
zeros on the diagonal and positive off-diagonal elements.
|
|
The splitting index can be \Co{"gini"} or \Co{"information"}.
|
|
\item \Co{na.action}: the action for missing values. The default action
|
|
for rpart is \Co{na.rpart}, this default is not overridden by the
|
|
\Co{options(na.action)} global option. The default action removes only
|
|
those rows for which either the response $y$ or \emph{all} of the predictors
|
|
are missing.
|
|
This ability to retain partially missing observations is perhaps the single
|
|
most useful feature of {\rpart} models.
|
|
\item \Co{control}: a list of control parameters, usually the result of
|
|
the \Co{rpart.control} function. The list must contain
|
|
\begin{itemize}
|
|
\item \Co{minsplit}: The minimum number of observations in a node for which
|
|
the routine will even try to compute a split. The default is 20.
|
|
This parameter can save computation time, since smaller nodes
|
|
are almost always pruned away by cross-validation.
|
|
\item \Co{minbucket}: The minimum number of observations in a terminal node.
|
|
This defaults to \Co{minsplit/3}.
|
|
\item \Co{maxcompete}: It is often useful in the printout to see not only
|
|
the variable that gave the best split at a node, but also the second, third,
|
|
etc best. This parameter controls the number that will be printed. It has
|
|
no effect on computational time, and a small effect on the amount of
|
|
memory used. The default is 4.
|
|
\item \Co{xval}: The number of cross-validations to be done. Usually set to
|
|
zero during exploratory phases of the analysis. A value of 10, for instance,
|
|
increases the compute time to 11-fold over a value of 0.
|
|
\item \Co{maxsurrogate}: The maximum number of surrogate variables to retain
|
|
at each node. (No surrogate that does worse than ``go with the majority''
|
|
is printed or used). Setting this to zero will cut the computation time in
|
|
half, and set \Co{usesurrogate} to zero. The default is 5.
|
|
Surrogates give different information than competitor splits. The competitor
|
|
list asks ``which other splits would have as many correct classifications'',
|
|
surrogates ask ``which other splits would classify the same subjects in the
|
|
same way'', which is a harsher criteria.
|
|
\item \Co{usesurrogate}: A value of usesurrogate=2, the default, splits
|
|
subjects in the way described previously. This is similar to CART.
|
|
If the value is 0, then a subject who is missing the primary split variable
|
|
does not progress further down the tree. A value of 1 is intermediate:
|
|
all surrogate variables except ``go with the majority'' are used to
|
|
send a case further down the tree.
|
|
\item \Co{cp}: The threshold complexity parameter.
|
|
\end{itemize}
|
|
\end{itemize}
|
|
|
|
The complexity parameter \Co{cp} is, like \Co{minsplit}, an advisory parameter,
|
|
but is considerably more useful. It is specified according to the
|
|
formula
|
|
$$
|
|
R_{cp}(T) \equiv R(T) + cp* |T| * R(T_1)
|
|
$$
|
|
where $T_1$ is the tree with no splits, $|T|$ is the
|
|
number of splits for a tree, and $R$ is the risk.
|
|
This scaled version is much more user friendly than the original
|
|
CART formula (\ref{cp1}) since it is unit less.
|
|
A value of \Co{cp = 1} will always result in a tree with no splits.
|
|
For regression models (see next section) the scaled cp has a very
|
|
direct interpretation: if any split does not increase the overall
|
|
${\rm R}^2$ of the model by at least $cp$
|
|
(where ${\rm R}^2$ is the usual linear-models definition)
|
|
then that split is decreed to be, a priori, not worth pursuing.
|
|
The program does not split said branch any further, and
|
|
saves considerable computational effort.
|
|
The default value of .01 has been reasonably successful at
|
|
`pre-pruning' trees so that the cross-validation step need only
|
|
remove 1 or 2 layers, but it sometimes over prunes,
|
|
particularly for large data sets.
|
|
|
|
|
|
\subsection{Example: Consumer Report Auto Data}
|
|
A second example using the \Co{class} method demonstrates the outcome
|
|
for a response with multiple ($>2$) categories. We also explore
|
|
the difference between Gini and information splitting rules.
|
|
The dataset cu.summary contains a collection of variables from the April,
|
|
1990 Consumer Reports summary on 117 cars. For our purposes, car
|
|
Reliability will be treated as the response. The variables are:
|
|
$$
|
|
\begin{tabular}{ll}
|
|
Reliability & an ordered factor (contains NAs): \\
|
|
& Much worse $<$ worse $<$ average $<$ better
|
|
$<$ Much Better \\
|
|
Price & numeric: list price in dollars, with standard equipment\\
|
|
Country & factor: country where car manufactured \\
|
|
Mileage & numeric: gas mileage in miles/gallon, contains NAs \\
|
|
Type & factor: Small, Sporty, Compact, Medium, Large, Van \\
|
|
\end{tabular}
|
|
$$
|
|
In the analysis we are treating reliability as an unordered outcome.
|
|
Nodes potentially can be classified as Much worse, worse, average,
|
|
better, or Much better, though there are none that are labeled as
|
|
just ``better''. The 32 cars with missing response (listed as NA)
|
|
were not used in the analysis.
|
|
Two fits are made, one using the Gini index and the other the
|
|
information index.
|
|
|
|
\begin{figure}[tb]
|
|
\myfig{longintro-cars}
|
|
\caption{Displays the rpart-based model relating automobile
|
|
Reliability to car type, price, and country of origin. The figure
|
|
on the left uses the \texttt{gini} splitting
|
|
index and the figure on the right uses the \texttt{information} splitting
|
|
index.}
|
|
\label{ginifig5}
|
|
\end{figure}
|
|
|
|
<<cars, fig=TRUE, include=FALSE>>=
|
|
fit1 <- rpart(Reliability ~ Price + Country + Mileage + Type,
|
|
data = cu.summary, parms = list(split = 'gini'))
|
|
fit2 <- rpart(Reliability ~ Price + Country + Mileage + Type,
|
|
data = cu.summary, parms = list(split = 'information'))
|
|
|
|
par(mfrow = c(1,2), mar = rep(0.1, 4))
|
|
plot(fit1, margin = 0.05); text(fit1, use.n = TRUE, cex = 0.8)
|
|
plot(fit2, margin = 0.05); text(fit2, use.n = TRUE, cex = 0.8)
|
|
@
|
|
Due to the wide labels at the bottom, we had to increase the figure space
|
|
slightly and decrease the character size in order to avoid truncation at the
|
|
left and right edges.
|
|
Details for the first two splits in the Gini tree are
|
|
<<>>=
|
|
summary(fit1, cp = 0.06)
|
|
@
|
|
|
|
And for the information splitting the first split is
|
|
<<>>=
|
|
fit3 <- rpart(Reliability ~ Price + Country + Mileage + Type,
|
|
data=cu.summary, parms=list(split='information'),
|
|
maxdepth=2)
|
|
summary(fit3)
|
|
@
|
|
|
|
The first 3 countries (Brazil, England, France) had only one or two
|
|
cars in the listing, all of which were missing the reliability
|
|
variable. There are no entries for these countries in the first node,
|
|
leading to the $-$ symbol for the rule.
|
|
The information measure has larger ``improvements'', consistent with
|
|
the difference in scaling between the information and Gini criteria
|
|
shown in figure 2, but the relative merits of different splits are
|
|
fairly stable.
|
|
|
|
The two rules do not choose the same primary split at node 2.
|
|
The data at this point are
|
|
\begin{center}
|
|
\begin{tabular}{rcccccc}
|
|
&Compact & Large & Medium &Small & Sporty & Van \\
|
|
Much worse & 2 & 2 & 4 & 2 & 7 & 1 \\
|
|
worse & 5 & 0 & 4 & 3 & 0 & 0 \\
|
|
average & 3 & 5 & 8 & 2 & 2 & 3 \\
|
|
better & 2 & 0 & 0 & 3 & 0 & 0 \\
|
|
Much better & 0 & 0 & 0 & 0 & 0 & 0
|
|
\end{tabular}
|
|
\end{center}
|
|
Since there are 6 different categories, all $2^5 = 32$ different
|
|
combinations were explored, and as it turns out there are several
|
|
with a nearly identical improvement. The Gini and information
|
|
criteria make different ``random'' choices from this set of near
|
|
ties. For the Gini index, \emph{Sporty vs others} and
|
|
\emph{Compact/Small vs others} have improvements of 3.19 and 3.12,
|
|
respectively.
|
|
For the information index, the improvements are
|
|
6.21 versus 9.28.
|
|
Thus the Gini index picks the first rule and information the second.
|
|
Interestingly, the two splitting criteria arrive at exactly the
|
|
same final nodes, for the full tree, although by different paths.
|
|
(Compare the class counts of the terminal nodes).
|
|
|
|
We have said that for a categorical predictor with $m$ levels,
|
|
all $2^{m-1}$ different possible splits are tested..
|
|
When there are a large number of categories for the predictor,
|
|
the computational burden of evaluating all of these subsets can become
|
|
large.
|
|
For instance, the call \Co{rpart(Reliability ~ ., data=car90)}
|
|
does not return for a \emph{long,} long time: one of the predictors
|
|
in that data set is an unordered factor with 30 levels!
|
|
Luckily, for any ordered outcome there is a computational
|
|
shortcut that allows the program to find the best split using
|
|
only $m-1$ comparisons. This includes the classification method
|
|
when there are only two categories, along with the
|
|
anova and Poisson methods to be introduced later.
|
|
|
|
|
|
\subsection{Example: Kyphosis data}
|
|
|
|
A third \Co{class} method example explores
|
|
the parameters \Co{prior} and \Co{loss}.
|
|
The dataset kyphosis has 81 rows representing data on 81 children who
|
|
have had corrective spinal surgery. The variables are:
|
|
$$
|
|
\begin{tabular}{ll}
|
|
Kyphosis & factor: postoperative deformity is present/absent \\
|
|
Age & numeric: age of child in months \\
|
|
Number & numeric: number of vertebrae involved in operation \\
|
|
Start & numeric: beginning of the range of vertebrae involved \\
|
|
\end{tabular}
|
|
$$
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-kyphos}
|
|
\caption{Displays the rpart-based models for the
|
|
presence/absence of kyphosis. The figure
|
|
on the left uses the default {prior} (0.79,0.21) and {loss}; the
|
|
middle figure uses the user-defined {prior} (0.65,0.35) and default
|
|
{loss}; and the third figure uses the default {prior} and the
|
|
user-defined {loss} $L(1,2)=3,\, L(2,1)=2$).}
|
|
\label{kyphfig}
|
|
\end{figure}
|
|
|
|
<<kyphos, fig=TRUE, include=FALSE>>=
|
|
lmat <- matrix(c(0,3, 4,0), nrow = 2, ncol = 2, byrow = FALSE)
|
|
fit1 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
|
|
|
|
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
|
|
parms = list(prior = c(0.65, 0.35)))
|
|
fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
|
|
parms = list(loss = lmat))
|
|
|
|
par(mfrow = c(1, 3), mar = rep(0.1, 4))
|
|
plot(fit1); text(fit1, use.n = TRUE, all = TRUE, cex = 0.8)
|
|
plot(fit2); text(fit2, use.n = TRUE, all = TRUE, cex = 0.8)
|
|
plot(fit3); text(fit3, use.n = TRUE, all = TRUE, cex = 0.8)
|
|
@
|
|
|
|
This example shows how even the initial split changes depending on the
|
|
prior and loss that are specified. The first and third fits have the
|
|
same initial split (Start $< 8.5$), but the improvement differs. The
|
|
second fit splits Start at 12.5 which moves 46 people to the left
|
|
instead of 62.
|
|
|
|
Looking at the leftmost tree, we see that the sequence of
|
|
splits on the left hand branch yields only a single node classified
|
|
as {\em present}. For any loss greater than 4 to 3, the routine will
|
|
instead classify this node as {\em absent}, and the entire left side of
|
|
the tree collapses, as seen in the right hand figure.
|
|
This is not unusual --- the most common effect of alternate loss
|
|
matrices is to change the amount of pruning in the tree, more pruning in
|
|
some branches and less in others, rather than to change the
|
|
choice of splits.
|
|
|
|
The first node from the default tree is
|
|
\begin{verbatim}
|
|
|
|
Node number 1: 81 observations, complexity param=0.1765
|
|
predicted class= absent expected loss= 0.2099
|
|
class counts: 64 17
|
|
probabilities: 0.7901 0.2099
|
|
left son=2 (62 obs) right son=3 (19 obs)
|
|
Primary splits:
|
|
Start < 8.5 to the right, improve=6.762, (0 missing)
|
|
Number < 5.5 to the left, improve=2.867, (0 missing)
|
|
Age < 39.5 to the left, improve=2.250, (0 missing)
|
|
Surrogate splits:
|
|
Number < 6.5 to the left, agree=0.8025, (0 split)
|
|
\end{verbatim}
|
|
|
|
|
|
The fit using the prior (0.65,0.35) is
|
|
\begin{Verbatim}
|
|
Node number 1: 81 observations, complexity param=0.302
|
|
predicted class= absent expected loss= 0.35
|
|
class counts: 64 17
|
|
probabilities: 0.65 0.35
|
|
left son=2 (46 obs) right son=3 (35 obs)
|
|
Primary splits:
|
|
Start < 12.5 to the right, improve=10.900, (0 missing)
|
|
Number < 4.5 to the left, improve= 5.087, (0 missing)
|
|
Age < 39.5 to the left, improve= 4.635, (0 missing)
|
|
Surrogate splits:
|
|
Number < 3.5 to the left, agree=0.6667, (0 split)
|
|
\end{Verbatim}
|
|
|
|
And first split under 4/3 losses is
|
|
|
|
\begin{Verbatim}
|
|
Node number 1: 81 observations, complexity param=0.01961
|
|
predicted class= absent expected loss= 0.6296
|
|
class counts: 64 17
|
|
probabilities: 0.7901 0.2099
|
|
left son=2 (62 obs) right son=3 (19 obs)
|
|
Primary splits:
|
|
Start < 8.5 to the right, improve=5.077, (0 missing)
|
|
Number < 5.5 to the left, improve=2.165, (0 missing)
|
|
Age < 39.5 to the left, improve=1.535, (0 missing)
|
|
Surrogate splits:
|
|
Number < 6.5 to the left, agree=0.8025, (0 split)
|
|
\end{Verbatim}
|
|
|
|
|
|
|
|
\section{Regression}
|
|
\label{method}
|
|
\subsection{Definition}
|
|
|
|
Up to this point the classification problem has been used to define and
|
|
motivate our formulae. However, the partitioning procedure is quite
|
|
general and can be extended by specifying 5 ``ingredients'':
|
|
\begin{itemize}
|
|
\item A splitting criterion, which is used to decide which variable
|
|
gives the best split.
|
|
For classification this was either the Gini or log-likelihood function.
|
|
In the \Co{anova} method the splitting criteria is
|
|
$SS_T - (SS_L + SS_R)$, where
|
|
$SS_T = \sum(y_i - \bar y)^2$
|
|
is the sum of squares for the node, and $SS_R$, $SS_L$ are the
|
|
sums of squares for the right and left son, respectively.
|
|
This is equivalent to choosing the split to maximize the between-groups
|
|
sum-of-squares in a simple analysis of variance. This rule is
|
|
identical to the regression option for {\tree}.
|
|
\item A summary statistic or vector, which is used to describe a
|
|
node. The first element of the vector is considered to be the
|
|
fitted value. For the anova method this is the mean of the node;
|
|
for classification the response is the predicted class followed by
|
|
the vector of class probabilities.
|
|
\item The error of a node. This will be the variance of $y$ for anova,
|
|
and the predicted loss for classification.
|
|
\item The prediction error for a new observation, assigned to the node.
|
|
For anova this is $(y_{new} - \bar y)$.
|
|
\item Any necessary initialization.
|
|
\end{itemize}
|
|
|
|
The \Co{anova} method leads to regression trees; it is the
|
|
default method if $y$ a simple numeric vector, i.e., not a
|
|
factor, matrix, or survival object.
|
|
|
|
\subsection{Example: Consumer Report car data}
|
|
The dataset \Co{car90} contains a collection of variables from the
|
|
April, 1990 Consumer Reports; it has 34 variables on 111 cars.
|
|
We've excluded 3 variables: tire size and model name because they are factors
|
|
with a very large number of levels whose printout does not fit well in
|
|
this report's page size, and rim size because it is too good a predictor of
|
|
price and leads to a less interesting illustration. (Tiny cars are cheaper
|
|
and have small rims.)
|
|
<<>>=
|
|
cars <- car90[, -match(c("Rim", "Tires", "Model2"), names(car90))]
|
|
carfit <- rpart(Price/1000 ~ ., data=cars)
|
|
carfit
|
|
printcp(carfit)
|
|
@
|
|
|
|
<<echo=FALSE>>=
|
|
temp <- carfit$cptable
|
|
@
|
|
\begin{itemize}
|
|
\item The relative error is $1-{\rm R}^2$, similar to linear regression.
|
|
The xerror is related to the PRESS statistic.
|
|
The first split appears to improve the fit the most.
|
|
The last split adds little improvement to the apparent error,
|
|
and increases the cross-validated error.
|
|
\item The 1-SE rule would choose a tree with 3 splits.
|
|
\item This is a case where the default cp value of .01 may
|
|
have over pruned the tree, since the cross-validated error
|
|
is barely at a minimum. A rerun with the cp threshold at .001
|
|
gave little change, however.
|
|
\item For any CP value between Sexpr{round(temp[3,1],2)} and
|
|
\Sexpr{round(temp[2,1],2)} the best
|
|
model has one split;
|
|
for any CP value between \Sexpr{round(temp[4,1],2)} and
|
|
\Sexpr{round(temp[3,1],2)} the best model is with
|
|
2 splits; and so on.
|
|
\end{itemize}
|
|
|
|
The \Co{print} and \Co{summary} commands also recognizes the \Co{cp} option, which
|
|
allows the user to look at only the top few splits.
|
|
|
|
<<>>=
|
|
summary(carfit, cp = 0.1)
|
|
@
|
|
|
|
The first split on displacement partitions the
|
|
\Sexpr{carfit$frame$n[1]} observations into groups of
|
|
\Sexpr{carfit$frame$n[2]} and \Sexpr{carfit$frame$n[3]}
|
|
(nodes 2 and 3) with mean prices of
|
|
\Sexpr{round(carfit$frame$yval[2])} and
|
|
\Sexpr{round(carfit$frame$yval[3])}
|
|
|
|
\begin{itemize}
|
|
\item The improvement listed is the percent change in SS for this split,
|
|
i.e., $1 - (SS_{right} + SS_{left})/SS_{parent}$, which is the gain in
|
|
$R^2$ for the fit.
|
|
\item The data set has displacement of the engine in both cubic inches
|
|
(Disp) and liters (Disp2). The second is a perfect surrogate split for
|
|
the first, obviously.
|
|
\item The weight and displacement are very closely related,
|
|
as shown by the surrogate split agreement of 91\%.
|
|
\item Not all the countries are represented in node 3, e.g., there are no
|
|
larger cars from Brazil.
|
|
This is indicated by a \Co{-} in the list of split directions.
|
|
\end{itemize}
|
|
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-anova2}
|
|
\caption{Both plots were obtained using the function
|
|
\Co{rsq.rpart(fit3)}. The figure on the left shows that the first split
|
|
offers the most information. The figure on the right suggests that
|
|
the tree should be pruned to include only 1 or 2 splits.}
|
|
\label{anovafig2}
|
|
\end{figure}
|
|
<<anova2, fig=TRUE, include=FALSE, echo=FALSE>>=
|
|
par(mfrow=c(1,2))
|
|
rsq.rpart(carfit)
|
|
par(mfrow=c(1,1))
|
|
@
|
|
Other plots can be used to help determine the best cp value for this
|
|
model. The function \Co{rsq.rpart} plots the jackknifed error versus
|
|
the number of splits. Of interest is the smallest error, but any
|
|
number of splits within the ``error bars'' (1-SE rule) are considered
|
|
a reasonable number of splits (in this case, 1 or 2 splits seem to be
|
|
sufficient). As is often true with modeling, simpler is often
|
|
better. Another useful plot is the ${\rm R}^2$ versus number of
|
|
splits. The (1 - apparent error) and (1 - relative error) show how much
|
|
is gained with additional splits. This plot highlights the
|
|
differences between the ${\rm R}^2$ values.
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-anova3}
|
|
\caption{This plot shows the (observed-expected) cost of cars
|
|
versus the predicted cost of cars based on the nodes/leaves in which the cars
|
|
landed. There appears to be more variability in node 7 than in some of
|
|
the other leaves.}
|
|
\label{anovafig3}
|
|
\end{figure}
|
|
Finally, it is possible to look at the residuals from this model, just
|
|
as with a regular linear regression fit, as shown in
|
|
figure \ref{anovafig3} produced by the following.
|
|
<<anova3, fig=TRUE, include=FALSE>>=
|
|
plot(predict(carfit), jitter(resid(carfit)))
|
|
temp <- carfit$frame[carfit$frame$var == '<leaf>',]
|
|
axis(3, at = temp$yval, as.character(row.names(temp)))
|
|
mtext('leaf number', side = 3, line = 3)
|
|
abline(h = 0, lty = 2)
|
|
@
|
|
|
|
\subsection{Example: Stage C data (\Co{anova} method)}
|
|
The stage C prostate cancer data of the earlier section can also
|
|
be fit using the anova method, by treating the status variable
|
|
as though it were continuous.
|
|
<<>>=
|
|
cfit2 <- rpart(pgstat ~ age + eet + g2 + grade + gleason + ploidy,
|
|
data = stagec)
|
|
|
|
printcp(cfit2)
|
|
|
|
print(cfit2, cp = 0.03)
|
|
@
|
|
|
|
If this tree is compared to the earlier results, we see that it
|
|
has chosen exactly the same variables and split points as before.
|
|
The only addition is further splitting of node 2, the upper left
|
|
``No'' of figure \ref{fgini1}. This is no accident, for the two
|
|
class case the Gini splitting rule reduces to $2p(1-p)$, which is
|
|
the variance of a node.
|
|
|
|
The two methods differ in their evaluation and pruning, however.
|
|
Note that nodes 4 and 5, the two children of node 2, contain
|
|
2/40 and 7/21 progressions, respectively. For classification
|
|
purposes both nodes have the same predicted value (No) and the
|
|
split will be discarded since the error (\# of misclassifications)
|
|
with and without the split is identical.
|
|
In the regression context the two predicted values of .05 and .33
|
|
{\em are} different --- the split has identified a nearly pure
|
|
subgroup of significant size.
|
|
|
|
This setup is known as \emph{odds regression}, and may be a more
|
|
sensible way to evaluate a split when the emphasis of the model
|
|
is on understanding/explanation rather than on prediction error
|
|
per se. Extension of this rule to the multiple class problem
|
|
is appealing, but has not yet been implemented in {\rpart}.
|
|
|
|
|
|
|
|
\section{Poisson regression}
|
|
\subsection{Definition}
|
|
The Poisson splitting method attempts to extend {{\rpart}} models to
|
|
event rate data.
|
|
The model in this case is
|
|
$$
|
|
\lambda = f(x)
|
|
$$
|
|
where $\lambda$ is an event rate and $x$ is some set of predictors. As an
|
|
example consider hip fracture rates. For each county in the United States
|
|
we can obtain
|
|
\begin{itemize}
|
|
\item number of fractures in patients age 65 or greater (from Medicare files)
|
|
\item population of the county (US census data)
|
|
\item potential predictors such as
|
|
\begin{itemize}
|
|
\item socio-economic indicators
|
|
\item number of days below freezing
|
|
\item ethnic mix
|
|
\item physicians/1000 population
|
|
\item etc.
|
|
\end{itemize}
|
|
\end{itemize}
|
|
|
|
Such data would usually be approached by using Poisson regression; can we
|
|
find a tree based analogue?
|
|
In adding criteria for rates regression to this ensemble, the guiding
|
|
principle was the following: the between groups sum-of-squares is not a
|
|
very robust measure, yet tree based regression works fairly well for this
|
|
data.
|
|
So do the simplest (statistically valid) thing possible.
|
|
|
|
Let $c_i$ be the observed event count for observation $i$, $t_i$ be the
|
|
observation time, and $x_{ij}, j=1,\ldots,p$ be the predictors.
|
|
The $y$ variable for the program will be a 2 column matrix.
|
|
|
|
Splitting criterion: The likelihood ratio test for two Poisson groups
|
|
$$
|
|
D_{\hbox{parent}} - \left( D_{\hbox{left son}} + D_{\hbox{right son}}
|
|
\right )
|
|
$$
|
|
|
|
Summary statistics: The observed event rate and the number of events.
|
|
\begin{eqnarray*}
|
|
\hat\lambda &=& \frac{\hbox{\# events}}{\hbox{total time}}
|
|
= \frac{\sum c_i}{\sum t_i} \\
|
|
\end{eqnarray*}
|
|
|
|
Error of a node: The within node deviance.
|
|
$$
|
|
D = \sum \left[ c_i \log \left(\frac{c_i}{\hat\lambda t_i} \right)
|
|
- (c_i - \hat\lambda t_i) \right]
|
|
$$
|
|
|
|
Prediction error: The deviance contribution for a new observation, using
|
|
$\hat \lambda$ of the node as the predicted rate.
|
|
|
|
\subsection{Improving the method}
|
|
There is a problem with the criterion just proposed, however:
|
|
cross-validation of a model often produces an infinite value for the
|
|
deviance. The simplest case where this occurs is easy to understand.
|
|
Assume that some terminal node of the tree has 20 subjects, but only 1 of
|
|
the 20 has experienced any events. The cross-validated error (deviance)
|
|
estimate for that node will have one subset --- the one where the subject
|
|
with an event is left out --- which has $\hat\lambda=0$.
|
|
When we use the prediction for the 10\% of subjects who were set aside,
|
|
the deviance contribution of the subject with an event is
|
|
$$
|
|
\ldots + c_i \log(c_i / 0) + \ldots
|
|
$$
|
|
which is infinite since $c_i >0$. The problem is that when $\hat\lambda=0$
|
|
the occurrence of an event is infinitely improbable, and, using the deviance
|
|
measure, the corresponding model is then infinitely bad.
|
|
|
|
One might expect this phenomenon to be fairly rare, but unfortunately it is
|
|
not so. One given of tree-based modeling is that a right-sized model is
|
|
arrived at by purposely over-fitting the data and then pruning back the
|
|
branches. A program that aborts due to a numeric exception during the first
|
|
stage is uninformative to say the least.
|
|
Of more concern is that this edge effect does not seem to be limited to
|
|
the pathological case detailed above. Any near approach to the boundary value
|
|
$\lambda=0$ leads to large values of the deviance, and the procedure
|
|
tends to discourage any final node with a small number of events.
|
|
|
|
An ad hoc solution is to use the revised estimate
|
|
$$
|
|
\hat{\hat \lambda} = \max \left(\hat\lambda, \frac{k}{\sum t_i} \right)
|
|
$$
|
|
where $k$ is 1/2 or 1/6.
|
|
That is, pure nodes are given a partial event.
|
|
(This is similar to the starting estimates used in
|
|
the GLM program for a Poisson regression.) This is unsatisfying, however, and
|
|
we propose instead using a shrinkage estimate.
|
|
|
|
Assume that the true rates $\lambda_j$ for the leaves of the tree are
|
|
random values from a Gamma$(\mu, \sigma)$ distribution. Set
|
|
$\mu$ to the observed overall event rate $\sum c_i / \sum t_i$, and
|
|
let the user choose as a prior the coefficient of variation
|
|
$k =\sigma / \mu$. A value of $k=0$ represents extreme pessimism
|
|
(``the leaf nodes will all give the same result''), whereas $k=\infty$
|
|
represents
|
|
extreme optimism. The Bayes estimate of the event rate for a node works out
|
|
to be
|
|
$$
|
|
\hat \lambda_k = \frac{\alpha + \sum c_i}{\beta + \sum t_i},
|
|
$$
|
|
where $\alpha= 1/k^2$ and $\beta = \alpha/ \hat\lambda$.
|
|
|
|
This estimate is scale invariant, has a simple interpretation, and
|
|
shrinks least those nodes with a large amount of information. In practice,
|
|
a value of $k=10$ does essentially no shrinkage.
|
|
For \Co{method='poisson'}, the optional parameters list is the single
|
|
number $k$, with a default value of 1.
|
|
|
|
\subsection{Example: solder.balance data}
|
|
The solder.balance data frame, as explained in the R help file, is a
|
|
dataset with 900 observations which are the results of an
|
|
experiment varying 5 factors relevant to the wave-soldering procedure
|
|
for mounting components on printed circuit boards. The full version of the data (unbalanced) is available in the \Co{survival} package. The response
|
|
variable, skips, is a count of how many solder skips appeared to a
|
|
visual inspection. The other variables are listed below:
|
|
|
|
$$
|
|
\begin{tabular}{ll}
|
|
Opening & factor: amount of clearance around
|
|
the mounting pad (S $<$ M $<$ L) \\
|
|
Solder & factor: amount of solder used (Thin $<$ Thick) \\
|
|
Mask & factor: Type of solder mask used (5 possible) \\
|
|
PadType & factor: Mounting pad used (10 possible) \\
|
|
Panel & factor: panel (1, 2 or 3) on board being counted \\
|
|
\end{tabular}
|
|
$$
|
|
|
|
In this call, the \Co{rpart.control} options are modified:
|
|
\Co{maxcompete = 2} means that only 2 other competing splits are
|
|
listed (default is 4); \Co{cp = .05} means that a smaller
|
|
tree will be built initially (default is .01).
|
|
The $y$ variable for Poisson partitioning may be a two column
|
|
matrix containing the observation time in column 1 and the
|
|
number of events in column 2, or it may be a vector of event
|
|
counts alone.
|
|
|
|
<<>>=
|
|
sfit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel,
|
|
data = solder.balance, method = 'poisson',
|
|
control = rpart.control(cp = 0.05, maxcompete = 2))
|
|
sfit
|
|
@
|
|
\begin{itemize}
|
|
\item The response value is the expected event rate
|
|
(with a time variable), or in this case the expected number of skips.
|
|
The values are shrunk towards the global estimate of 5.53 skips/observation.
|
|
\item The deviance is the same as the null deviance (sometimes called the
|
|
residual deviance) that you'd get when calculating a Poisson glm model
|
|
for the given subset of data.
|
|
\end{itemize}
|
|
|
|
<<>>=
|
|
summary(sfit, cp = 0.1)
|
|
@
|
|
|
|
\begin{itemize}
|
|
\item The improvement is ${\rm Deviance_{parent} - (Deviance_{left} +
|
|
Deviance_{right})}$, which is the likelihood ratio test
|
|
for comparing two Poisson samples.
|
|
\item The cross-validated error has been found to be
|
|
overly pessimistic when describing how much the error
|
|
is improved by each split. This is likely an effect of the
|
|
boundary effect mentioned earlier, but more research is needed.
|
|
\item The variation xstd is not as useful, given the bias of xerror.
|
|
\end{itemize}
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-poisson1}
|
|
\caption{The first figure shows the solder.balance data, fit with the
|
|
\Co{poisson} method, using a \Co{cp} value of 0.05. The second figure
|
|
shows the same fit, but with a \Co{cp} value of 0.10. The function
|
|
\Co{prune.rpart} was used to produce the smaller tree.}
|
|
\label{poissonfig1}
|
|
\end{figure}
|
|
<<poisson1, fig=TRUE, include=FALSE>>=
|
|
par(mar = rep(0.1, 4))
|
|
plot(sfit)
|
|
text(sfit, use.n = TRUE, min = 3)
|
|
|
|
fit.prune <- prune(sfit, cp = 0.10)
|
|
plot(fit.prune)
|
|
text(fit.prune, use.n = TRUE, min = 2)
|
|
@
|
|
|
|
The \Co{use.n = TRUE} option specifies that number of events / total N should be
|
|
listed along with the predicted rate (number of events/person-years).
|
|
The function \Co{prune} trims the tree \Co{fit} to the \Co{cp} value
|
|
$0.10$. The same tree could have been created by specifying
|
|
\Co{cp = 0.10} in the original call to {\rpart}.
|
|
|
|
\subsection{Example: Stage C Prostate cancer, survival method}
|
|
One special case of the Poisson model is of particular interest
|
|
for medical consulting (such as the authors do).
|
|
Assume that we have survival data, i.e., each subject has either 0 or 1 event.
|
|
Further, assume that the time values have been pre-scaled so as to
|
|
fit an exponential model. That is, stretch the time axis so that a
|
|
Kaplan-Meier plot of the data will be a straight line when plotted on
|
|
the logarithmic scale.
|
|
An approximate way to do this is
|
|
<<>>=
|
|
require(survival)
|
|
temp <- coxph(Surv(pgtime, pgstat) ~ 1, stagec)
|
|
newtime <- predict(temp, type = 'expected')
|
|
@
|
|
and then do the analysis using the \Co{newtime} variable.
|
|
(This replaces each time value by $\Lambda(t)$, where $\Lambda$ is the
|
|
cumulative hazard function).
|
|
|
|
A slightly more sophisticated version of this which we
|
|
will call \emph{exponential scaling}
|
|
gives a straight line curve for log(survival) under a
|
|
parametric exponential model.
|
|
The only difference from the approximate scaling above is that
|
|
a subject who is censored between observed death times
|
|
will receive ``credit'' for the intervening interval, i.e.,
|
|
we assume the baseline hazard to be linear between observed deaths.
|
|
If the data is pre-scaled in this way, then the
|
|
Poisson model above is equivalent to the \emph{local full likelihood}
|
|
tree model of LeBlanc and Crowley \cite{Leblanc92}.
|
|
They show that this model is more efficient than the earlier suggestion
|
|
of Therneau et al. \cite{Therneau90} to use the martingale residuals
|
|
from a Cox model as input to a regression tree (anova method).
|
|
Exponential scaling or method='exp' is
|
|
the default if $y$ is a \Co{Surv} object.
|
|
|
|
Let us again return to the stage C cancer example.
|
|
Besides
|
|
the variables explained previously, we will use pgtime, which is time
|
|
to tumor progression.
|
|
<<exp3, fig=TRUE, include=FALSE>>=
|
|
require(survival)
|
|
pfit <- rpart(Surv(pgtime, pgstat) ~ age + eet + g2 + grade +
|
|
gleason + ploidy, data = stagec)
|
|
print(pfit)
|
|
|
|
pfit2 <- prune(pfit, cp = 0.016)
|
|
par(mar = rep(0.2, 4))
|
|
plot(pfit2, uniform = TRUE, branch = 0.4, compress = TRUE)
|
|
text(pfit2, use.n = TRUE)
|
|
@
|
|
Note that the primary split on grade is the same as when
|
|
status was used as a dichotomous endpoint,
|
|
but that the splits thereafter differ.
|
|
|
|
Suppose that we wish to simplify this tree, so that only
|
|
four splits remain.
|
|
Looking at the table of complexity parameters, we see that
|
|
\Co{prune(fit, cp = 0.016)} would give the desired result,
|
|
which is shown in figure \ref{expfig3}.
|
|
From the figure node 4 (leftmost terminal) has only 1 event
|
|
for 33 subjects, a relative death rate of .133 times the
|
|
overall rate, and is defined by grade = 1--2 and
|
|
g2 $<$ 11.36.
|
|
\begin{figure}
|
|
\myfig{longintro-exp3}
|
|
\caption{Survival fit to the stage C prostate data.}
|
|
\label{expfig3}
|
|
\end{figure}
|
|
|
|
For a final summary of the model, it can be helpful to plot the
|
|
probability of survival based on the final bins in which the subjects
|
|
landed. To create new variables based on the rpart groupings,
|
|
using the \Co{where} component of the fit, as
|
|
shown in figure \ref{expfig4}.
|
|
We'll further prune the tree down to four nodes by removing the
|
|
split at node 6.
|
|
\begin{figure}
|
|
\myfig{longintro-exp4}
|
|
\caption{Survival plot based on snipped rpart object. The
|
|
probability of tumor progression is greatest in node 8, which has
|
|
patients who are older and have a higher initial tumor grade. }
|
|
\label{expfig4}
|
|
\end{figure}
|
|
<<exp4, fig=TRUE, include=FALSE>>=
|
|
temp <- snip.rpart(pfit2, 6)
|
|
km <- survfit(Surv(pgtime, pgstat) ~ temp$where, stagec)
|
|
plot(km, lty = 1:4, mark.time = FALSE,
|
|
xlab = "Years", ylab = "Progression")
|
|
legend(10, 0.3, paste('node', c(4,5,6,7)), lty = 1:4)
|
|
@
|
|
|
|
\subsection{Open issues}
|
|
The default value of the shrinkage parameter $k$ is 1. This corresponds
|
|
to prior coefficient of variation of 1 for the estimated $\lambda_j$.
|
|
We have not nearly enough experience to decide if this is a good value.
|
|
(It does stop the $\log(0)$ message though).
|
|
|
|
Cross-validation does not work very well. The procedure gives very
|
|
conservative results, and quite often declares the no-split tree to
|
|
be the best. This may be another artifact of the edge effect.
|
|
|
|
\section{Plotting options}
|
|
\label{plotsect}
|
|
|
|
This section examines the various options that are available when
|
|
plotting an rpart object. For simplicity, the same model (data from
|
|
Example 1) will be used throughout.
|
|
You have doubtless already noticed the use of \texttt{par(mar =)}
|
|
in the examples.
|
|
The plot function for rpart uses the general \texttt{plot} function
|
|
to set up the plot region. By default, this leaves space for
|
|
axes, legends or titles on the bottom, left, and top.
|
|
Room for axes is not needed in general for rpart plots, and
|
|
for this report we also do not have top titles.
|
|
For the small plots in this report it was important to use all of
|
|
the page, so we reset these for each plot.
|
|
(Due to the way that Sweave works each plot is a separate
|
|
environment, so the par() parameters do not endure from plot
|
|
to plot.)
|
|
|
|
The simplest labeled plot is called by using \Co{plot} and \Co{text}
|
|
without changing any of the defaults. This is useful for a first
|
|
look, but sometimes you'll want more information about each of the
|
|
nodes.
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-plots1}
|
|
\caption{\Co {plot(fit); text(fit)}}
|
|
\label{plotfig1}
|
|
\end{figure}
|
|
|
|
<<plots1, fig=TRUE, include=FALSE>>=
|
|
fit <- rpart(pgstat ~ age + eet + g2 + grade + gleason + ploidy,
|
|
stagec, control = rpart.control(cp = 0.025))
|
|
par(mar = rep(0.2, 4))
|
|
plot(fit)
|
|
text(fit)
|
|
@
|
|
\begin{figure}
|
|
\myfig{longintro-plots2}
|
|
\caption{\Co {plot(fit, uniform = TRUE); text(fit,use.n = TRUE,all = TRUE)}}
|
|
\label{plotfig2}
|
|
\end{figure}
|
|
|
|
The next plot has uniform stem lengths (\Co{uniform = TRUE}), has extra
|
|
information (\Co{use.n = TRUE}) specifying number of subjects at each node,
|
|
and has labels on all the nodes, not just the terminal nodes (\Co{all = TRUE}).
|
|
<<plots2, fig=TRUE, include=FALSE>>=
|
|
par(mar = rep(0.2, 4))
|
|
plot(fit, uniform = TRUE)
|
|
text(fit, use.n = TRUE, all = TRUE)
|
|
@
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-plots3}
|
|
\caption{\Co{plot(fit, branch=0); text(fit,use.n = TRUE)}}
|
|
\label{plotfig3}
|
|
\end{figure}
|
|
|
|
Fancier plots can be created by modifying the \Co{branch} option,
|
|
which controls the shape of branches that connect a node to its
|
|
children. The default for the plots is to have square shouldered trees
|
|
(\Co{branch = 1.0}). This can be taken to the other extreme with
|
|
no shoulders at all (\Co{branch=0}).
|
|
<<plots3, fig=TRUE, include=FALSE>>=
|
|
par(mar = rep(0.2, 4))
|
|
plot(fit, branch = 0)
|
|
text(fit, use.n = TRUE)
|
|
@
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-plots4}
|
|
\caption{\Co{plot(fit, branch = 0.4, uniform = TRUE, compress = TRUE)}}
|
|
\label{plotfig4}
|
|
\end{figure}
|
|
|
|
These options can be combined with others to create the plot that fits your
|
|
particular needs.
|
|
The default plot may be inefficient in it's use of space: the terminal
|
|
nodes will always lie at x-coordinates of 1,2,\ldots.
|
|
The \Co{compress} option attempts to improve this by overlapping some nodes.
|
|
It has little effect on figure \ref{plotfig4}, but in figure \ref{figdig}
|
|
it allows the lowest branch to ``tuck under'' those above.
|
|
If you want to play around with the
|
|
spacing with compress, try using \Co{nspace} which regulates the space
|
|
between a terminal node and a split.
|
|
|
|
<<plots4, fig=TRUE, include=FALSE>>=
|
|
par(mar = rep(0.2, 4))
|
|
plot(fit, branch = 0.4,uniform = TRUE, compress = TRUE)
|
|
text(fit, all = TRUE, use.n = TRUE)
|
|
@
|
|
|
|
\begin{figure}
|
|
\myfig{longintro-plots5}
|
|
\caption{Fancier plot}
|
|
\label{plotfig5}
|
|
\end{figure}
|
|
|
|
Several of these options were combined into a function called
|
|
\Co{post.rpart}, whose default action of creating a .ps file
|
|
in the current directory is now rather dated.
|
|
Identical results to the function can be obtained by the collection
|
|
of options shown below, the result is displayed in figure \ref{plotfig5}.
|
|
The code is essentially
|
|
<<plots5, fig=TRUE, include=FALSE>>=
|
|
par(mar = rep(0.2, 4))
|
|
plot(fit, uniform = TRUE, branch = 0.2, compress = TRUE, margin = 0.1)
|
|
text(fit, all = TRUE, use.n = TRUE, fancy = TRUE, cex= 0.9)
|
|
@
|
|
The \Co{fancy} option of \Co{text} creates the ellipses and
|
|
rectangles, and moves the splitting rule to the midpoints of
|
|
the branches.
|
|
\Co{Margin} shrinks the plotting region
|
|
slightly so that the \Co{text} boxes don't run over the edge of the
|
|
plot.
|
|
The \Co{branch} option makes the lines exit the ellipse
|
|
at a ``good'' angle.
|
|
A separate package \texttt{rpart.plot} carries these ideas much further.
|
|
|
|
|
|
\section{Other functions}
|
|
A more general approach to cross-validation can be gained using the
|
|
\Co{xpred.rpart} function.
|
|
Given an rpart fit, a vector of $k$ complexity parameters,
|
|
and the desired number of cross-validation groups,
|
|
this function returns an $n$ by $k$ matrix containing the
|
|
predicted value $\hat y_{(-i)}$ for each subject, from the model
|
|
that was fit without that subject.
|
|
The $cp$ vector defaults to the geometric mean of the $cp$
|
|
sequence for the pruned tree on the full data set.
|
|
|
|
Here is an example that uses the mean absolute deviation for
|
|
continuous data rather than the usual mean square error.
|
|
<<>>=
|
|
carfit <- rpart(Price/1000 ~ ., cars)
|
|
carfit$cptable
|
|
|
|
price2 <- cars$Price[!is.na(cars$Price)]/1000
|
|
temp <- xpred.rpart(carfit)
|
|
errmat <- price2 - temp
|
|
abserr <- colMeans(abs(errmat))
|
|
rbind(abserr, relative=abserr/mean(abs(price2-mean(price2))))
|
|
@
|
|
We see that on the absolute error scale the relative error
|
|
improvement is not quite so large,
|
|
though this could be expected given that the optimal split was
|
|
not chosen to minimize absolute error.
|
|
|
|
|
|
\section{Test Cases}
|
|
\subsection{Classification}
|
|
The definitions for classification trees can get the most complex,
|
|
especially with respect to priors and loss matrices.
|
|
In this section we lay out a simple example, in great detail.
|
|
(This was done to debug the R functions.)
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular} {r|rrrrrrrr}
|
|
y & 1& 2 & 3 & 1 & 2 &3 & 1 & 2 \\
|
|
x1& 1 &2 & 3 & 4 & 5& 6 & 7 & 8 \\
|
|
x2& 1 &2 & 3 & 4 & 5& 6 & 1 & 2 \\
|
|
x3&NA &22& 38& 12&NA& 48& 14& 32 \\
|
|
\multicolumn{2}{c}{} \\ % tricky way to get a blank line with no vertical bar
|
|
y& 3 & 1 & 2 & 3 & 1 & 2 & 1 \\
|
|
x1& 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
|
|
x2& 3 & 4 & 5 & 6 & 1 & 2 & 3 \\
|
|
x3& 40& NA & 30& 46 & 28 & 34 & 48 \\
|
|
\end{tabular}
|
|
\caption{Data set for the classification test case}
|
|
\label{datatab}
|
|
\end{table}
|
|
|
|
|
|
Let $n=15$, and the data be as given in table \ref{datatab}.
|
|
The loss matrix is defined as
|
|
$$
|
|
L = \begin{array}{ccc}
|
|
0 & 2 & 2 \\
|
|
2 & 0 & 6 \\
|
|
1 & 1 & 0 \\ \end{array}
|
|
$$
|
|
where rows represent the true class and columns the assigned class.
|
|
Thus the error in mistakenly calling a class 2 observation a 3 is quite
|
|
large in this data set.
|
|
The prior probabilities for the study are assumed to be
|
|
$\pi = .2, .3, .5$,
|
|
so class 1 is most prevalent in the input data ($n_i$=6, 5, and 4 observations,
|
|
respectively), but class 3 the most prevalent in the external population.
|
|
|
|
Splits are chosen using the Gini index with altered priors,
|
|
as defined in equation (4.15) of Breiman et al \cite{Breiman83}.
|
|
\begin{eqnarray*}
|
|
\tilde\pi_1 &=& \pi_1(0+2+2) / \sum \tilde \pi_i = 4/21 \\
|
|
\tilde\pi_2 &=& \pi_2(2+0+6) / \sum \tilde \pi_i = 12/21 \\
|
|
\tilde\pi_3 &=& \pi_3(1+1+0) / \sum \tilde \pi_i = 5/21\\
|
|
\end{eqnarray*}
|
|
|
|
For a given node $T$, the Gini impurity will be
|
|
$\sum_j p(j|T)[1-p(j|T)] = 1- \sum p(j|T)^2$,
|
|
where $p(j|T)$ is the expected proportion of class $j$ in the node:
|
|
$$
|
|
p(j|T) = \tilde\pi_j [n_i(T)/n_i ]/ \sum p(i|T)
|
|
$$.
|
|
|
|
Starting with the top node,
|
|
for each possible predicted class we have the following loss
|
|
$$
|
|
\begin{tabular}{cc}
|
|
predicted class & E(loss)\\
|
|
1 & .2*0 + .3*2 + .5*1 = 1.1 \\
|
|
2 & .2*2 + .3*0 + .5*1 = 0.9 \\
|
|
3 & .2*2 + .3*6 + .5*0 = 2.2 \\
|
|
\end{tabular}
|
|
$$
|
|
The best choice is class 2, with an expected loss of 0.9.
|
|
The Gini impurity for the node, using altered priors, is
|
|
$G = 1 - (16+144+25)/21^2 = 256/441 \approx .5805$.
|
|
|
|
Assume that variable $x1$ is split at 12.5, which is, as it turns out,
|
|
the optimal split point for that variable under the constraint that
|
|
at least 2 observations are in each terminal node.
|
|
Then the right node will have class counts of
|
|
(4,4,4) and the left node of (2,1,0).
|
|
For the right node (node 3 in the tree) \\
|
|
\begin{tabular}{rcl}
|
|
$P(R)$ &=&.2(4/6) + .3(4/5) + .5(4/4)= 131/150 \\
|
|
&=& \mbox{probability of the node (in the population)} \\
|
|
$p(i|R)$ &=& (.2(4/6), .3(4/5), .5(4/4))/$P(R)$ \\
|
|
$\widetilde P(R)$ &=& (4/21)(4/6) + (12/21)(4/5) + (5/21)(4/4)
|
|
= 259/315=37/45 \\
|
|
$\tilde p(i|R)$&=& [(4/21)(4/6), (12/21)(4/5), (5/21)(4/4)] (45/37) \\
|
|
$G(R)$&=& $1 - \sum \tilde p(i|R)^2 \approx .5832$
|
|
\end{tabular}
|
|
|
|
For the left node (node 2) \\
|
|
\begin{tabular}{rcl}
|
|
$P(L)$ &=&.2(2/6) + .3(1/5) + .5(0/4)= 19/150 \\
|
|
$p(i|L)$ &=& (.4/3, .3/5, 0)/ $P(L)$ = (10/19, 9/19, 0) \\
|
|
$\widetilde P(L)$ &=& $1- \widetilde P(R) =8/45$ \\
|
|
$\tilde p(i|L)$ &=& [(4/21)(2/6), (12/21)(1/5), (5/21)(0/4)]/
|
|
$\widetilde P(L)$\\
|
|
$G(L)$ &=&$ 1 - \sum \tilde p(i|L)^2 \approx .459$ \\
|
|
\end{tabular}
|
|
|
|
The total improvement for the split involves the
|
|
change in impurity between the parent and the two child nodes
|
|
$$
|
|
n(G - [\widetilde P(L)* G(L) + \widetilde P(R)* G(R)] )
|
|
\approx .2905
|
|
$$
|
|
where $n=15$ is the total sample size.
|
|
|
|
For variable $x2$ the best split occurs at 5.5, splitting the data
|
|
into left and right nodes with class counts of (2,2,1) and
|
|
(4,3,3), respectively.
|
|
Computations just exactly like the above give an improvement of
|
|
1.912.
|
|
|
|
For variable $x3$ there are 3 missing values, and the computations are
|
|
similar to what would be found for a node further down the tree with
|
|
only 12/15 observations.
|
|
The best split point is at 3.6, giving class counts of
|
|
(3,4,0) and (1,0,4) in the left and right nodes, respectively.
|
|
|
|
For the right node (node 3 in the tree)\\
|
|
\begin{tabular}{rcl}
|
|
$\widetilde P(R)$ &=& (4/21)(3/6) + (12/21)(4/5) + (5/21)(0/4) = 174/315\\
|
|
$\tilde p(i|R)$&=& [(4/21)(3/6), (12/21)(4/5), (5/21)(0/4)] (315/174) \\
|
|
&=& (5/29, 24/29, 0) \\
|
|
$G(R)$&=& $1 - (25+576)/29^2 = 240/841 approx .2854$
|
|
\end{tabular}
|
|
|
|
For the left node (node 2) \\
|
|
\begin{tabular}{rcl}
|
|
$\widetilde P(L)$ &=& (4/21)(1/6) +(12/21)(0/5) + (5/21)(4/4)= 85/315\\
|
|
$\tilde p(i|L)$ &=& [(4/21)(1/6), (12/21)(0/5), (5/21)(4/4)] (315/85) \\
|
|
&=& (2/17, 0, 15/17)\\
|
|
$G(L)$ &=&$ 1 - (4 + 225)/17^2 = 60/289 \approx .2076$ \\
|
|
\end{tabular}
|
|
|
|
The overall impurity for the node involves only 12 of the 15
|
|
observations, giving the following values for the top node:\\
|
|
\begin{tabular}{rcl}
|
|
$\widetilde P(T)$ &=& 174/315 + 85/315 = 259/315\\
|
|
$\tilde p(i|T)$ &=& [(4/21)(4/6), (12/21)(4/5), (5/21)(4/4)] (315/259) \\
|
|
&=& (40/259, 144/259, 75/259) \\
|
|
$G(T)$& =& $1- (40^2 + 144^2 + 75^2)/259^2 = 39120/67081$
|
|
\end{tabular}
|
|
|
|
The total improvement for the split involves the impurity $G$ of all three
|
|
nodes, weighted by the probabilities of the nodes under the alternate priors.
|
|
$$
|
|
15* \{(259/315)(39120/67081) - [(174/315)(240/841) + (85/315)(60/289)]\}
|
|
\approx 3.9876
|
|
$$
|
|
As is true in general with the rpart routines, variables with missing
|
|
values are penalized with respect to choosing a split -- a factor of
|
|
259/315 or about 12/15 in the case of $x3$ at the top node.
|
|
Table \ref{tab:cutpoint} shows the statistics for all of the cutpoints
|
|
of $x3$.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tabular}{c | rrrrr}
|
|
Cutpoint& $P(L)$ & $P(R)$ & $G(L)$ & $G(R)$ & $\Delta I$\\ \hline
|
|
1.3 & 0.03 & 0.97 & 0.00 & 0.56 & 0.55\\
|
|
1.8 & 0.06 & 0.94 & 0.00 & 0.53 & 1.14\\
|
|
2.5 & 0.18 & 0.82 & 0.46 & 0.57 & 0.45\\
|
|
2.9 & 0.21 & 0.79 & 0.50 & 0.53 & 0.73\\
|
|
3.1 & 0.32 & 0.68 & 0.42 & 0.56 & 1.01\\
|
|
3.3 & 0.44 & 0.56 & 0.34 & 0.52 & 1.96\\
|
|
3.6 & 0.55 & 0.45 & 0.29 & 0.21 & 3.99\\
|
|
3.9 & 0.61 & 0.39 & 0.41 & 0.26 & 2.64\\
|
|
4.3 & 0.67 & 0.33 & 0.48 & 0.33 & 1.56\\
|
|
4.7 & 0.73 & 0.27 & 0.53 & 0.45 & 0.74\\
|
|
4.8 & 0.79 & 0.21 & 0.56 & 0.00 & 0.55\\
|
|
\end{tabular}
|
|
\caption{Cut points and statistics for variable $x3$, top node}
|
|
\label{tab:cutpoint}
|
|
\end{table}
|
|
|
|
Because $x3$ has missing values, the next step is choice of a surrogate
|
|
split. Priors and losses currently play no role in the computations for
|
|
selecting surrogates. For all the prior computations, the effect of
|
|
priors is identical to that of adding case weights to the observations
|
|
such that the apparent frequencies are equal to the chosen priors;
|
|
since surrogate computations do account for case weights, one could argue
|
|
that they should also then make use of priors.
|
|
The argument has not yet been found compelling enough to add this to the
|
|
code.
|
|
|
|
Note to me: the cp is not correct for usesurrogate=0. The error
|
|
after a split is not (left error + right error) -- it also needs to
|
|
have a term (parent error for those obs that weren't split).
|
|
|
|
\bibliographystyle{plain}
|
|
\bibliography{refer}
|
|
|
|
\end{document}
|