579 lines
23 KiB
Plaintext
579 lines
23 KiB
Plaintext
|
\documentclass{article}
|
||
|
\usepackage[pdftex]{graphicx}
|
||
|
\usepackage{Sweave}
|
||
|
\usepackage{amsmath}
|
||
|
\addtolength{\textwidth}{1in}
|
||
|
\addtolength{\oddsidemargin}{-.5in}
|
||
|
\setlength{\evensidemargin}{\oddsidemargin}
|
||
|
|
||
|
\SweaveOpts{keep.source=TRUE, fig=FALSE}
|
||
|
%\VignetteIndexEntry{User Written Split Functions}
|
||
|
%\VignetteDepends{rpart}
|
||
|
% Ross Ihaka suggestions
|
||
|
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
|
||
|
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
|
||
|
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
|
||
|
\fvset{listparameters={\setlength{\topsep}{0pt}}}
|
||
|
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
|
||
|
|
||
|
\newcommand{\myfig}[1]{\resizebox{\textwidth}{!}
|
||
|
{\includegraphics{#1.pdf}}}
|
||
|
\title{User written splitting functions for RPART}
|
||
|
\author{Terry Therneau \\Mayo Clinic}
|
||
|
|
||
|
\begin{document}
|
||
|
\maketitle
|
||
|
<<echo=FALSE>>=
|
||
|
options(continue=" ", width=60)
|
||
|
options(SweaveHooks=list(fig=function() par(mar=c(5.1, 4.1, .3, 1.1))))
|
||
|
@
|
||
|
\section{Splitting functions}
|
||
|
The rpart code was written in a modular fashion with the idea that the C
|
||
|
code would be extended to include more splitting functions.
|
||
|
As a consequence all of the splitting functions are coordinated through a
|
||
|
single structure in the file \texttt{func\_table.h}, shown below:
|
||
|
\begin{verbatim}
|
||
|
static struct {
|
||
|
int (*init_split)();
|
||
|
void (*choose_split)();
|
||
|
void (*eval)();
|
||
|
double (*error)();
|
||
|
} func_table [] = {
|
||
|
{anovainit, anova, anovass, anovapred},
|
||
|
{poissoninit, poisson, poissondev, poissonpred},
|
||
|
{giniinit, gini, ginidev, ginipred},
|
||
|
{usersplit_init, usersplit, usersplit_eval, usersplit_pred}
|
||
|
};
|
||
|
|
||
|
#define NUM_METHODS 4 /*size of the above structure */
|
||
|
\end{verbatim}
|
||
|
Adding a new splitting method requires the definition of four functions which
|
||
|
initialize, choose a split, compute the predicted value or values for a node,
|
||
|
and
|
||
|
compute a prediction error for a new observation assigned to the node.
|
||
|
The last of these is used only in cross-validation.
|
||
|
|
||
|
To add new splitting rules to the main routine
|
||
|
four new C functions need to be defined, the
|
||
|
\texttt{func\_table.h} file expanded, the R function \texttt{rpart} updated to
|
||
|
recognize the new
|
||
|
option (it calls the C routine with the row number of func\_table),
|
||
|
and finally an initialization function written.
|
||
|
See the \texttt{rpart.class}, \texttt{rpart.anova}, and \texttt{rpart.exp}
|
||
|
functions for examples of the latter.
|
||
|
The lion's share of the code, which deals with all the necessary bookkeeping,
|
||
|
remains unchanged.
|
||
|
|
||
|
An easier route is to add a user defined splitting rules as a set of R functions.
|
||
|
The resulting code will be considerably slower than the built-in functions;
|
||
|
nevertheless this allows an easy way to extend \texttt{rpart}, and to
|
||
|
test out new
|
||
|
ideas with less effort than additions to the C base.
|
||
|
For user defined splits only the first three functions are defined,
|
||
|
and cross-validation does not happen automatically.
|
||
|
(User defined methods are many times slower than the built in ones, so
|
||
|
it may be preferable to skip cross-validation during the initial evaluation
|
||
|
of a model.)
|
||
|
|
||
|
\section{Anova function}
|
||
|
As the first illustration we show how to add anova splitting as a user-written
|
||
|
addition.
|
||
|
This method is also one of those built into the C code, so we can also
|
||
|
check the correctness of the results.
|
||
|
|
||
|
\subsection{Initialization function}
|
||
|
<<>>=
|
||
|
itemp <- function(y, offset, parms, wt) {
|
||
|
if (is.matrix(y) && ncol(y) > 1)
|
||
|
stop("Matrix response not allowed")
|
||
|
if (!missing(parms) && length(parms) > 0)
|
||
|
warning("parameter argument ignored")
|
||
|
if (length(offset)) y <- y - offset
|
||
|
sfun <- function(yval, dev, wt, ylevel, digits ) {
|
||
|
paste(" mean=", format(signif(yval, digits)),
|
||
|
", MSE=" , format(signif(dev/wt, digits)),
|
||
|
sep = '')
|
||
|
}
|
||
|
environment(sfun) <- .GlobalEnv
|
||
|
list(y = c(y), parms = NULL, numresp = 1, numy = 1, summary = sfun)
|
||
|
}
|
||
|
@
|
||
|
|
||
|
On input the function will be called with
|
||
|
\begin{description}
|
||
|
\item[y] the response value as found in the formula.
|
||
|
Note that rpart will normally
|
||
|
have removed any observations with a missing response.
|
||
|
\item[offset] the offset term, if any, found on the right hand side of the formula
|
||
|
\item[parms] the vector or list (if any) supplied by the user as a
|
||
|
\texttt{parms} argument to the call.
|
||
|
\item[wt] the weight vector from the call, if any
|
||
|
\end{description}
|
||
|
|
||
|
The last two arguments are optional.
|
||
|
The initialization function needs to perform any data checks,
|
||
|
deal with the offset
|
||
|
vector if present, and return a list containing
|
||
|
\begin{description}
|
||
|
\item[y] the value of the response, possibly updated
|
||
|
\item[numy] the number of columns of y
|
||
|
\item[numresp] the length of the prediction vector for each node
|
||
|
\item[summary] optional: a function which will produce a 1--3 line summary for the
|
||
|
node, to be used by \texttt{summary.rpart}.
|
||
|
\item[print] optional: a function which will produce a one line summary for
|
||
|
the \texttt{print} function.
|
||
|
\item[text] optional: a function which will produce a short label for the node,
|
||
|
used by the \texttt{plot} function.
|
||
|
\end{description}
|
||
|
|
||
|
The parameters vector can contain whatever might be appropriate for
|
||
|
the method, e.g. the variance of a prior distribution.
|
||
|
The vector of parameters passed forward need not be he same as the set
|
||
|
passed into the routine.
|
||
|
In anova splitting there are no extra parameters.
|
||
|
|
||
|
The summary function will be called with arguments which are
|
||
|
\begin{itemize}
|
||
|
\item yval= the response value for the node
|
||
|
\item dev = the deviance for the node
|
||
|
\item wt = the sum of weights at the node (number of observations)
|
||
|
\item ylevel = the levels of y, if y is categorical
|
||
|
\item digits = the current setting for digits
|
||
|
\end{itemize}
|
||
|
It should return a character vector.
|
||
|
The text function will be called with these arguments plus two more
|
||
|
\begin{itemize}
|
||
|
\item n= the number of observations in the node
|
||
|
\item use.n = TRUE/FALSE, see the help page for \texttt{text.rpart}
|
||
|
\end{itemize}
|
||
|
The print function is called only with yval, ylevels, and digits.
|
||
|
|
||
|
The only puzzling line may be \texttt{environment(sfun) <- .GlobalEnv}.
|
||
|
R ensures that any function can access variables that are \emph{not}
|
||
|
passed in as arguments via a mechanism called environments.
|
||
|
As a trivial example
|
||
|
<<>>=
|
||
|
temp <- 4
|
||
|
fun1 <- function(x) {
|
||
|
q <- 15
|
||
|
z <- 10
|
||
|
fun2 <- function(y) y + z + temp
|
||
|
fun2(x^2)
|
||
|
}
|
||
|
fun1(5)
|
||
|
@
|
||
|
The definition of fun2 essentially contains a copy of \texttt{z} (and
|
||
|
\texttt{q} as well) to ensure that this works.
|
||
|
The exception to this is objects at the top level such as
|
||
|
\texttt{temp} above; the user is responsible
|
||
|
for the retention of those via their answer to the ``save'' question at
|
||
|
the end of an R session.
|
||
|
|
||
|
The consequence of this is that the summary function created by a
|
||
|
call to itemp will by default have copies of all of the external variables
|
||
|
that it \emph{might} make use of, in this case all of the input arguments.
|
||
|
The summary function is in fact self-contained and makes reference only to
|
||
|
its input arguments (doing otherwise is bad programming, in my
|
||
|
opinion) and so needs none of these.
|
||
|
Setting the environment of the function to \texttt{.GlobalEnv} in essence
|
||
|
tells R to treat the function as though it had been defined at top level,
|
||
|
i.e., the input prompt, and thus not attach these extraneous copies.
|
||
|
I first became aware of this issue when using a huge data set, and found that
|
||
|
the saved output of the fit took up an inordinate amount of disk space due
|
||
|
to data copies attached to the summary function.
|
||
|
|
||
|
Do not take this as a critique of R environments in general.
|
||
|
The rules governing them need to be responsive to a large ensemble of
|
||
|
conditions; in this particular case the results are not ideal but in the main
|
||
|
they are the best compromise possible.
|
||
|
|
||
|
\subsection{Evaluation function}
|
||
|
The evaluation function is called once per node.
|
||
|
It needs to produce a deviance for the node along with a vector to be
|
||
|
used as a label for the node.
|
||
|
<<>>=
|
||
|
etemp <- function(y, wt, parms) {
|
||
|
wmean <- sum(y*wt)/sum(wt)
|
||
|
rss <- sum(wt*(y-wmean)^2)
|
||
|
list(label = wmean, deviance = rss)
|
||
|
}
|
||
|
@
|
||
|
|
||
|
As an example of a longer label, the gini splitting method for
|
||
|
categorical outcomes returns both the chosen group and the full vector
|
||
|
of estimated group probabilities.
|
||
|
The deviance value that is returned does not need to be an actual
|
||
|
deviance associated with a log-likelihood, any impurity measure for
|
||
|
the node will work.
|
||
|
However, the pruning algorithm used during tree construction will be most
|
||
|
efficient if the value 0 corresponds to a perfectly pure node.
|
||
|
(The pruning code decides when a branch would be futile and further
|
||
|
computations on it can thus be avoided.)
|
||
|
|
||
|
\subsection{Splitting function}
|
||
|
The splitting function is where the work lies.
|
||
|
It will be called once for every covariate at each potential split.
|
||
|
The input arguments are
|
||
|
\begin{description}
|
||
|
\item[y] vector or matrix of response values
|
||
|
\item[wt] vector of weights
|
||
|
\item[x] vector of x values
|
||
|
\item[parms] vector of user parameters, passed forward
|
||
|
\item[continuous] if TRUE the x variable should be treated as continuous
|
||
|
\end{description}
|
||
|
|
||
|
The data will have been subset to include only the non-missing subset
|
||
|
of x and y at the node, and if x is continuous all values will be in the
|
||
|
sorted order of the x variable.
|
||
|
If x is continuous the routine should return two vectors each of
|
||
|
length $n-1$, where $n$ is the length of x:
|
||
|
\begin{description}
|
||
|
\item[goodness] the utility of the split, where larger numbers are
|
||
|
better. A value of 0 signifies that no worthwhile split could be
|
||
|
found (for instance if y were a constant). The $i$th value of
|
||
|
goodness compares a split of observations 1 to $i$ versus $i+1$ to $n$.
|
||
|
\item[direction] A vector of the same length with values of -1
|
||
|
and +1, where -1 suggests that values with y $<$ cutpoint be
|
||
|
sent to the left side of the tree, and a value of +1 that values
|
||
|
with y $<$ cutpoint be sent to the right.
|
||
|
This is not critical, but sending larger values of y to the
|
||
|
right, as is done in the code below, seems to make the final tree
|
||
|
easier to read.
|
||
|
\end{description}
|
||
|
|
||
|
The reason for returning an entire vector of goodness values is that
|
||
|
the parent code is responsible for handling the minimum node size
|
||
|
constraint, and also for dealing with ties. When x is continuous the
|
||
|
split routine actually has no reason to look at x.
|
||
|
|
||
|
<<>>=
|
||
|
stemp <- function(y, wt, x, parms, continuous)
|
||
|
{
|
||
|
# Center y
|
||
|
n <- length(y)
|
||
|
y <- y- sum(y*wt)/sum(wt)
|
||
|
|
||
|
if (continuous) {
|
||
|
# continuous x variable
|
||
|
temp <- cumsum(y*wt)[-n]
|
||
|
left.wt <- cumsum(wt)[-n]
|
||
|
right.wt <- sum(wt) - left.wt
|
||
|
lmean <- temp/left.wt
|
||
|
rmean <- -temp/right.wt
|
||
|
goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
|
||
|
list(goodness = goodness, direction = sign(lmean))
|
||
|
} else {
|
||
|
# Categorical X variable
|
||
|
ux <- sort(unique(x))
|
||
|
wtsum <- tapply(wt, x, sum)
|
||
|
ysum <- tapply(y*wt, x, sum)
|
||
|
means <- ysum/wtsum
|
||
|
|
||
|
# For anova splits, we can order the categories by their means
|
||
|
# then use the same code as for a non-categorical
|
||
|
ord <- order(means)
|
||
|
n <- length(ord)
|
||
|
temp <- cumsum(ysum[ord])[-n]
|
||
|
left.wt <- cumsum(wtsum[ord])[-n]
|
||
|
right.wt <- sum(wt) - left.wt
|
||
|
lmean <- temp/left.wt
|
||
|
rmean <- -temp/right.wt
|
||
|
list(goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2),
|
||
|
direction = ux[ord])
|
||
|
}
|
||
|
}
|
||
|
@
|
||
|
|
||
|
The code above does the computations for all the split points at once
|
||
|
by making use of two tricks. The first is to center the y values
|
||
|
at zero (so the grand mean is zero), and the second takes advantage of the
|
||
|
many ways to write the ``effect'' sum of squares for a simple two
|
||
|
group anova.
|
||
|
The key identity is
|
||
|
\begin{equation}
|
||
|
\sum (y_i - c)^2 = \sum (y_i - \overline y)^2 + n(c - \overline y)^2
|
||
|
\end{equation}
|
||
|
If you have an old enough statistics book, this is used to show that
|
||
|
for a 1-way anova the between groups sum of squares $SS_B$ is
|
||
|
\begin{equation}
|
||
|
SS_B = n_l(\overline y_l - \overline y)^2 +
|
||
|
n_r(\overline y_r - \overline y)^2 \label{ssb}
|
||
|
\end{equation}
|
||
|
where $n_l$ is the number of observations in the left group,
|
||
|
$\overline y_l$ the
|
||
|
mean of $y$ in the left group, $\overline y$ the overall mean,
|
||
|
and $n_r$ $\overline y_r$ the corresponding terms for the right hand group.
|
||
|
|
||
|
Centering at zero makes $\overline y$ zero in \eqref{ssb}, and the terms
|
||
|
then can be computed for all splits at once using the cumsum function.
|
||
|
Extension of the formulas to case weights is left as an exercise for
|
||
|
the reader.
|
||
|
|
||
|
If the predictor variable x is categorical with $k$ classes then there are
|
||
|
potentially $2^{k-1} -1$ different ways to split the node.
|
||
|
However, for most splitting rules the optimal rule can be found by first
|
||
|
ordering the groups by their average y value and then using the
|
||
|
usual splitting rule on this ordered variable.
|
||
|
For user mode rules this is assumed to be the case.
|
||
|
The variable x is supplied as integer values 1, 2, \ldots, k.
|
||
|
On return the direction vector should have $k$ values giving the ordering
|
||
|
of the groups, and the goodness vector $k-1$ values giving the utility of
|
||
|
the splits.
|
||
|
|
||
|
|
||
|
\subsection{Test}
|
||
|
We can test the above code to make sure that it gives the same answer as
|
||
|
the built-in anova splitting method.
|
||
|
<<>>=
|
||
|
library(rpart)
|
||
|
mystate <- data.frame(state.x77, region=state.region)
|
||
|
names(mystate) <- casefold(names(mystate)) #remove mixed case
|
||
|
ulist <- list(eval = etemp, split = stemp, init = itemp)
|
||
|
fit1 <- rpart(murder ~ population + illiteracy + income + life.exp +
|
||
|
hs.grad + frost + region, data = mystate,
|
||
|
method = ulist, minsplit = 10)
|
||
|
fit2 <- rpart(murder ~ population + illiteracy + income + life.exp +
|
||
|
hs.grad + frost + region, data = mystate,
|
||
|
method = 'anova', minsplit = 10, xval = 0)
|
||
|
all.equal(fit1$frame, fit2$frame)
|
||
|
all.equal(fit1$splits, fit2$splits)
|
||
|
all.equal(fit1$csplit, fit2$csplit)
|
||
|
all.equal(fit1$where, fit2$where)
|
||
|
all.equal(fit1$cptable, fit2$cptable)
|
||
|
@
|
||
|
The all.equal test can't be done on fit1 vs fit2 as a whole since their
|
||
|
\texttt{call} component will differ.
|
||
|
|
||
|
\section{Cross validation}
|
||
|
To do cross-validation on user written rules one needs to use the
|
||
|
\texttt{xpred.rpart} routine.
|
||
|
This routine returns the predicted value(s) for each observation,
|
||
|
predicted from a fit that did not include that observation.
|
||
|
The result is a matrix with one row per subject and one column for
|
||
|
each complexity parameter value.
|
||
|
As an example we will replicate the cross-validation results
|
||
|
for the anova model above.
|
||
|
In order to get the same groupings we fix the xval group membership
|
||
|
in advance.
|
||
|
<<>>=
|
||
|
xgroup <- rep(1:10, length = nrow(mystate))
|
||
|
xfit <- xpred.rpart(fit1, xgroup)
|
||
|
xerror <- colMeans((xfit - mystate$murder)^2)
|
||
|
|
||
|
fit2b <- rpart(murder ~ population + illiteracy + income + life.exp +
|
||
|
hs.grad + frost + region, data = mystate,
|
||
|
method = 'anova', minsplit = 10, xval = xgroup)
|
||
|
topnode.error <- (fit2b$frame$dev/fit2b$frame$wt)[1]
|
||
|
|
||
|
xerror.relative <- xerror/topnode.error
|
||
|
all.equal(xerror.relative, fit2b$cptable[, 4], check.attributes = FALSE)
|
||
|
@
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
\begin{figure}
|
||
|
\myfig{usercode-fig1}
|
||
|
\caption{Goodness of split for predicting income from illiteracy, using
|
||
|
the state data.}
|
||
|
\label{fig:state}
|
||
|
\end{figure}
|
||
|
|
||
|
<<fig1, echo=FALSE, fig=TRUE, include=FALSE>>=
|
||
|
tdata <- mystate[order(mystate$illiteracy), ]
|
||
|
n <- nrow(tdata)
|
||
|
temp <- stemp(tdata$income, wt = rep(1, n), tdata$illiteracy,
|
||
|
parms = NULL, continuous = TRUE)
|
||
|
xx <- (tdata$illiteracy[-1] + tdata$illiteracy[-n])/2
|
||
|
plot(xx, temp$goodness, xlab = "Illiteracy cutpoint",
|
||
|
ylab = "Goodness of split")
|
||
|
lines(smooth.spline(xx, temp$goodness, df = 4), lwd = 2, lty = 2)
|
||
|
@
|
||
|
|
||
|
\subsection{Smoothed anova}
|
||
|
For any particular covariate consider a plot of
|
||
|
x= split point vs y= goodness of split;
|
||
|
figure \ref{fig:state} shows an example for the state data
|
||
|
set, along with a smoothed curve.
|
||
|
The plot points are very erratic.
|
||
|
At one time we entertained the idea that a more stable split
|
||
|
point would be found by finding the maximum value for a
|
||
|
smoothed version of the plot.
|
||
|
Exactly how to smooth is of course a major issue in such an
|
||
|
endeavor.
|
||
|
|
||
|
Modification of the anova splitting routine to test this
|
||
|
is quite easy --- simply add a few lines to the stemp routine:
|
||
|
\begin{verbatim}
|
||
|
...
|
||
|
goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
|
||
|
rx <- rank(x[-1]) #use only the ranks of x, to preserve invariance
|
||
|
fit <- smooth.spline(rx, goodness, df=4)
|
||
|
list(goodness= predict(fit, rx)$y, direction=sign(lmean))
|
||
|
}
|
||
|
else {
|
||
|
...
|
||
|
\end{verbatim}
|
||
|
|
||
|
\section{Alternating logistic regression}
|
||
|
Chen \cite{Chen07} used a mixed logistic regression model to
|
||
|
fit clinical and genetic marker data.
|
||
|
\begin{equation}
|
||
|
E(y) = \frac{e^{\eta_1 + \eta_2}}{1+e^{\eta_1 + \eta_2}}
|
||
|
\end{equation}
|
||
|
where $\eta_1 = X\beta$ is a standard linear predictor based
|
||
|
on clinical variables $X$ and $\eta_2$ is based on a tree model
|
||
|
using a set of genetic markers $Z$.
|
||
|
The solution was computed using alternate steps of an ordinary
|
||
|
logistic regression on $X$ treating $\eta_2$ as an offset,
|
||
|
and an rpart fit for $Z$ treating $\eta_1$ as an offset.
|
||
|
|
||
|
For many splitting rules the initialization function can take
|
||
|
care of offset terms once and for all by modifying the y
|
||
|
vector, but this is not the case for logistic regression.
|
||
|
In order to make the offset visible to the splitting function
|
||
|
we create a 2 column ``response'' consisting of the original
|
||
|
y vector in the first column and the offset in the second.
|
||
|
<<>>=
|
||
|
loginit <- function(y, offset, parms, wt)
|
||
|
{
|
||
|
if (is.null(offset)) offset <- 0
|
||
|
if (any(y != 0 & y != 1)) stop ('response must be 0/1')
|
||
|
|
||
|
sfun <- function(yval, dev, wt, ylevel, digits ) {
|
||
|
paste("events=", round(yval[,1]),
|
||
|
", coef= ", format(signif(yval[,2], digits)),
|
||
|
", deviance=" , format(signif(dev, digits)),
|
||
|
sep = '')}
|
||
|
environment(sfun) <- .GlobalEnv
|
||
|
list(y = cbind(y, offset), parms = 0, numresp = 2, numy = 2,
|
||
|
summary = sfun)
|
||
|
}
|
||
|
|
||
|
logeval <- function(y, wt, parms)
|
||
|
{
|
||
|
tfit <- glm(y[,1] ~ offset(y[,2]), binomial, weight = wt)
|
||
|
list(label= c(sum(y[,1]), tfit$coef), deviance = tfit$deviance)
|
||
|
}
|
||
|
@
|
||
|
The evaluation function returns the number of positive responses
|
||
|
along with the fitted coefficient.
|
||
|
|
||
|
The splitting function attempts every possible logistic regression.
|
||
|
A much faster version could almost certainly be written based on
|
||
|
a score test, however.
|
||
|
<<>>=
|
||
|
logsplit <- function(y, wt, x, parms, continuous)
|
||
|
{
|
||
|
if (continuous) {
|
||
|
# continuous x variable: do all the logistic regressions
|
||
|
n <- nrow(y)
|
||
|
goodness <- double(n-1)
|
||
|
direction <- goodness
|
||
|
temp <- rep(0, n)
|
||
|
for (i in 1:(n-1)) {
|
||
|
temp[i] <- 1
|
||
|
if (x[i] != x[i+1]) {
|
||
|
tfit <- glm(y[,1] ~ temp + offset(y[,2]), binomial, weight = wt)
|
||
|
goodness[i] <- tfit$null.deviance - tfit$deviance
|
||
|
direction[i] <- sign(tfit$coef[2])
|
||
|
}
|
||
|
}
|
||
|
} else {
|
||
|
# Categorical X variable
|
||
|
# First, find out what order to put the categories in, which
|
||
|
# will be the order of the coefficients in this model
|
||
|
tfit <- glm(y[,1] ~ factor(x) + offset(y[,2]) - 1, binomial, weight = wt)
|
||
|
ngrp <- length(tfit$coef)
|
||
|
direction <- rank(rank(tfit$coef) + runif(ngrp, 0, 0.1)) #break ties
|
||
|
# breaking ties -- if 2 groups have exactly the same p-hat, it
|
||
|
# does not matter which order I consider them in. And the calling
|
||
|
# routine wants an ordering vector.
|
||
|
#
|
||
|
xx <- direction[match(x, sort(unique(x)))] #relabel from small to large
|
||
|
goodness <- double(length(direction) - 1)
|
||
|
for (i in 1:length(goodness)) {
|
||
|
tfit <- glm(y[,1] ~ I(xx > i) + offset(y[,2]), binomial, weight = wt)
|
||
|
goodness[i] <- tfit$null.deviance - tfit$deviance
|
||
|
}
|
||
|
}
|
||
|
list(goodness=goodness, direction=direction)
|
||
|
}
|
||
|
@
|
||
|
|
||
|
\section{Categorical splits}
|
||
|
A common question from users concerns the splitting rules for a categorical
|
||
|
predictor.
|
||
|
If a categorical predictor has 5 levels, why don't we need to evaluate all
|
||
|
$2^4 = 16$ possible ways of splitting this into two groups?
|
||
|
The result is based on a fairly simple observation.
|
||
|
Think of the choice of a split as two problems: selection of a pair of
|
||
|
predicted values $p1$ and $p2$, and selection of the observations into
|
||
|
two groups $g1$ and $g2$.
|
||
|
Then the problem is to find
|
||
|
\begin{equation*}
|
||
|
\min_{g1, g2} \left[\min_{p1, p2} \sum_{i \in g1} E(y_i, p1) +
|
||
|
\sum_{i \in g2} E(y_i, p2) \right]
|
||
|
\end{equation*}
|
||
|
where $E$ is our error in prediction function.
|
||
|
The inner optimization is trivial for most problems, so we normally
|
||
|
don't think about it.
|
||
|
For a continuous $y$ variable for instance the error is
|
||
|
$E(y,p) = (y-p)^2$ and $p1$ is the mean $y$ value of all subjects in
|
||
|
group 1.
|
||
|
|
||
|
Now reverse the order of the above equation, and instead think of the
|
||
|
``outer loop'' of the optimization as first a selection of the two predicted
|
||
|
values $p1$ and $p2$, followed by finding the optimal grouping for those
|
||
|
predicted values.
|
||
|
When the predictor $x$ is categorical this reduces to checking
|
||
|
whether $\sum E(y, p1)$ or $\sum E(y, p2)$ is smaller,
|
||
|
when summed over all the observations having each particular level of $x$,
|
||
|
and then assigning that level to group $g1$ or $g2$ accordingly.
|
||
|
|
||
|
For squared error loss it is easy to see that this last step will be to
|
||
|
assign a subset to group $g1$ if mean($y$), over that subset, is closer to
|
||
|
$p1$ than to $p2$.
|
||
|
Thus, whatever the values of $p1$ and $p2$, the best split of the data
|
||
|
divides the groups according to low versus high values of $\overline y$.
|
||
|
If $x$ has $k$ levels, this shows that the only \emph{admissible} splits
|
||
|
are the $k-1$ groupings that first order the data according to the
|
||
|
within level means: as $p1$ and $p2$ range over the entire real plane these
|
||
|
are the only inner loop solutions that will arise.
|
||
|
Knowing this, the computer code does not have to try all values of
|
||
|
$p1$ and $p2$, it uses the first form of the minimum but can restrict the
|
||
|
outer loop to this small set of cases.
|
||
|
|
||
|
Each different type of $y$ value and loss function requires it's own version
|
||
|
of the paragraph just above. For Poisson loss the optimal ordering turns out
|
||
|
to be by the within level event rate, for instance.
|
||
|
A similar partition can be shown for many loss functions.
|
||
|
For categorical $y$ and the Gini criterion the result is particularly
|
||
|
interesting.
|
||
|
Assume that the response $y$ has $m$ levels, and summarize each of the
|
||
|
distinct groups of observations corresponding to a categorical predictor
|
||
|
$x$ by it proportion vector: the proportion with $y=1$, proportion with $y=2$
|
||
|
etc. If $x$ has $k$ levels, these proportions are $k$ points lying on the
|
||
|
$m$ dimensional simplex,
|
||
|
and the set of admissible splits are all partitions of these points by a
|
||
|
plane. For the information criterion the points lie on a curved surface in
|
||
|
$m$ space but again the admissible set are all partitions by a plane.
|
||
|
When $m=2$ the points lie on a 1-dimensional curve and the problem reduces to
|
||
|
ordering the groups by mean($y$).
|
||
|
For $m>2$ the total number of admissible partitions of $k$ points can be much
|
||
|
smaller than $2^k$ and efficient methods for enumerating
|
||
|
them have been described \cite{Sleumer}, but this has not been
|
||
|
incorporated into rpart.
|
||
|
|
||
|
\begin{thebibliography}{9}
|
||
|
\bibitem{Chen07} Chen J, Yu K, Hsing A and Therneau TM.
|
||
|
\emph{A partially linear tree-based regression model for assessing complex
|
||
|
joint gene-gene and gene-environment effects},
|
||
|
Genet Epidemiol, 2007(3), 238-51.
|
||
|
\bibitem{Sleumer} Nora Sleumer, \emph{Hyperplane arrangements: construction
|
||
|
visualization and applications}, PhD dissertation, Swiss Federal Institute
|
||
|
of Technology, 1969.
|
||
|
\end{thebibliography}
|
||
|
\end{document}
|