2025-01-12 00:52:51 +08:00

1959 lines
80 KiB
Plaintext

\documentclass[nojss]{jss}
\usepackage[english]{babel}
%\documentclass[fleqn, a4paper]{article}
%\usepackage{a4wide}
%\usepackage[round,longnamesfirst]{natbib}
%\usepackage{graphicx,keyval,thumbpdf,url}
%\usepackage{hyperref}
%\usepackage{Sweave}
\SweaveOpts{strip.white=true}
\AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}}
\usepackage[utf8]{inputenc}
%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsmath}
\usepackage{amsfonts}
%\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\class}[1]{\mbox{\textsf{#1}}}
\newcommand{\func}[1]{\mbox{\texttt{#1()}}}
%\newcommand{\code}[1]{\mbox{\texttt{#1}}}
%\newcommand{\pkg}[1]{\strong{#1}}
\newcommand{\samp}[1]{`\mbox{\texttt{#1}}'}
%\newcommand{\proglang}[1]{\textsf{#1}}
\newcommand{\set}[1]{\mathcal{#1}}
\newcommand{\sQuote}[1]{`{#1}'}
\newcommand{\dQuote}[1]{``{#1}''}
\newcommand\R{{\mathbb{R}}}
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
%% almost as usual
\author{Michael Hahsler\\Southern Methodist University \And
Kurt Hornik\\Wirtschaftsuniversit\"at Wien \AND
Christian Buchta\\Wirtschaftsuniversit\"at Wien}
\title{Getting Things in Order:\\
An Introduction to the \proglang{R}~Package~\pkg{seriation}}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Michael Hahsler, Kurt Hornik, Christian Buchta} %% comma-separated
\Plaintitle{Getting Things in Order: An Introduction to the R Package seriation} %% without formatting
\Shorttitle{Getting Things in Order} %% a short title (if necessary)
%% an abstract and keywords
\Abstract{Seriation, i.e., finding a suitable linear order for a set of
objects given data and a loss or merit function, is a basic problem in
data analysis. Caused by the problem's combinatorial nature, it is
hard to solve for all but very small sets. Nevertheless, both exact
solution methods and heuristics are available. In this paper we
present the package~\pkg{seriation} which provides an infrastructure
for seriation with \proglang{R}. The infrastructure comprises data
structures to represent linear orders as permutation vectors, a wide
array of seriation methods using a consistent interface, a method to
calculate the value of various loss and merit functions, and several
visualization techniques which build on seriation. To illustrate how
easily the package can be applied for a variety of applications, a
comprehensive collection of examples is presented.}
\Keywords{combinatorial data analysis, seriation, permutation, \proglang{R}}
\Plainkeywords{combinatorial data analysis, seriation, permutation, R} %% without formatting
\Address{
Michael Hahsler\\
Engineering Management, Information, and Systems\\
Lyle School of Engineering\\
Southern Methodist University\\
P.O. Box 750123 \\
Dallas, TX 75275-0123\\
E-mail: \email{mhahsler@lyle.smu.edu}\\
URL: \url{http://lyle.smu.edu/~mhahsler}
Kurt Hornik\\
Department f\"ur Statistik \& Mathematik\\
Wirtschaftsuniversit\"at Wien\\
1090 Wien, Austria\\
E-mail: \email{kurt.hornik@wu.ac.at}\\
URL: \url{http://statmath.wu.ac.at/~hornik/}
Christian Buchta\\
Department f\"ur Welthandel\\
Wirtschaftsuniversit\"at Wien\\
1090 Wien, Austria\\
E-mail: \email{christian.buchta@wu.ac.at}\\
URL: \url{http://www.wu.ac.at/itf/institute/staff/buchta}
}
\hyphenation{Brusco}
\sloppy
%% \VignetteIndexEntry{An Introduction to the R package seriation}
\begin{document}
%\title{Getting Things in Order: An introduction to the
%R~package~\pkg{seriation}}
%\author{Michael Hahsler, Kurt Hornik and Christian Buchta}
\maketitle
%\abstract{Seriation, i.e., finding a suitable linear order for a set of
% objects given data and a loss or merit function, is a basic problem in
% data analysis. Caused by the problem's combinatorial nature, it is
% hard to solve for all but very small sets. Nevertheless, both exact
% solution methods and heuristics are available. In this paper we
% present the package~\pkg{seriation} which provides an infrastructure
% for seriation with \proglang{R}. The infrastructure comprises data
% structures to represent linear orders as permutation vectors, a wide
% array of seriation methods using a consistent interface, a method to
% calculate the value of various loss and merit functions, and several
% visualization techniques which build on seriation. To illustrate how
% easily the package can be applied for a variety of applications, a
% comprehensive collection of examples is presented.}
%
<<echo=FALSE>>=
options(scipen=3, digits=4)
### for sampling
set.seed(1234)
@
\section{Introduction}
A basic problem in data analysis, called \emph{seriation} or sometimes
\emph{sequencing}, is to arrange all objects in a set in a linear order
given available data and some loss or merit function in order to
reveal structural information. Together with
cluster analysis and variable selection, seriation is an important
problem in the field of \emph{combinatorial data
analysis}~\citep{seriation:Arabie:1996}. Solving problems in
combinatorial data analysis requires the solution of discrete
optimization problems which, in the most general case, involves
evaluating all feasible solutions. Due to the combinatorial nature, the
number of possible solutions grows with problem size (number of objects, $n$)
by the order~$O(n!)$. This makes a brute-force enumerative approach
infeasible for all but very small problems. To solve larger problems
(currently with up to 40 objects), partial enumeration methods can be used. For
example, \cite{seriation:Hubert:2001} propose dynamic programming and
\cite{seriation:Brusco:2005} use a branch-and-bound strategy. For even
larger problems only heuristics can be employed.
It has to be noted that seriation has a rich history in archaeology.
\cite{seriation:Petrie:1899} was the first to use seriation as a formal method.
He applied it to find a chronological order for graves discovered in the Nile
area given objects found there. He used a cross-tabulation of grave sites and
objects and rearranged the table using row and column permutations till all
large values were close to the diagonal. In the rearranged table graves with
similar objects are closer to each other. Together with the assumption that
different objects continuously come into and go out of fashion, the order of
graves in the rearranged table suggests a chronological order.
Initially, the rearrangement of rows and columns of this
contingency table was done manually and the adequacy was only judged
subjectively by the researcher. Later, \cite{seriation:Robinson:1951},
\cite{seriation:Kendall:1971} and others proposed measures of agreement
between rows to quantify optimality of the resulting table. A
comprehensive description of the development of seriation in archaeology
is presented by \cite{seriation:Ihm:2005}.
Techniques related to seriation are also popular in several other
fields. Especially in ecology scaling techniques are used under the
name \emph{ordination}. For these applications several \proglang{R} packages
already exist (e.g.,
\pkg{ade4}~\citep{seriation:Chessel:2007,seriation:Dray:2007} and
\pkg{vegan}~\citep{seriation:Oksanen:2007}). This paper describes the new
package \pkg{seriation} which differs from existing packages in the
following ways:
\begin{itemize}
\item \pkg{seriation} provides a flexible infrastructure for seriation;
\item \pkg{seriation} focuses on seriation as a combinatorial
optimization problem.
\end{itemize}
This paper starts with a formal introduction of the seriation problem as
a combinatorial optimization problem in Section~\ref{sec:seriation}. In
Section~\ref{sec:methods} we give an overview of seriation methods. In
Section~\ref{sec:infrastructure} we present the infrastructure provided
by the package~\pkg{seriation}. Several examples and applications for
seriation are given in Section~\ref{sec:example}.
Section~\ref{sec:conclusion} concludes.
A previous version of this manuscript was published in the \emph{Journal
of Statistical Software} \citep{seriation:Hahsler+Hornik:2008}.
\section{Seriation as a combinatorial optimization problem}
\label{sec:seriation}
To seriate a set of $n$ objects $\{O_1,\dots,O_n\}$ one typically starts
with an $n \times n$ symmetric dissimilarity matrix~$\mathbf{D} =
(d_{ij})$ where $d_{ij}$ for $1 \le i,j \le n$ represents the
dissimilarity between objects $O_i$ and $O_j$, and $d_{ii} = 0$ for
all~$i$. We define a permutation function $\Psi$ as a function which
reorders the objects in $\mathbf{D}$ by simultaneously permuting rows
and columns. The seriation problem is to find a permutation function
$\Psi^*$
%$\{1,\dots,n\} \rightarrow \{1,\dots,n\}$, i.e. a
%bijection that maps the set of indices of the objects (and equally of rows and
%columns of $\mathbf{D}$) onto itself,
which optimizes the value of a given loss function~$L$ or merit function~$M$.
This results in the optimization problems
\begin{equation}
\Psi^* = \argmin_\Psi L(\Psi(\mathbf{D})) \quad \text{or} \quad
\Psi^* = \argmax_\Psi M(\Psi(\mathbf{D})),
\end{equation}
respectively.
%This is clearly a hard discrete optimization problem since the number of
%possible permutations is $n!$ which makes an exhaustive
%search for sets with a medium to large number of objects infeasible.
%Partial enumeration methods and heuristics can be used. Such methods are
%presented in Section~\ref{sec:methods}.
%But first, we review commonly used loss functions in the following section.
%\marginpar{two-mode data missing}
A symmetric dissimilarity matrix is known as \emph{two-way one-mode}
data since it has columns and rows (two-way) but only represents one set
of objects (one-mode). Seriation is also possible for two-way two-mode
data which are represented by a general nonnegative matrix. In such data
columns and rows represent two sets of objects which are reordered
simultaneously. For loss/merit functions for two-way two-mode data the
optimal order of columns can depend of the order of rows and vice
versa or it can be independent allowing for breaking the optimization
down into two separate problems, one for the columns and one for the
rows. Another way to deal with the seriation for two-way two-mode data
is to calculate two dissimilarity matrices, one for each mode, and then
solve two seriation problems for two-way one-mode data. Furthermore,
seriation can be generalized to $k$-way $k$-mode data in the form of a
$k$-dimensional array by defining suitable loss/merit functions for such data
or by breaking the problem down into several lower dimensional
independent problems.
To assess the complexity of seriation of $k$-way $k$-mode data, let us
assume the data is a $k$-dimensional array with the dimensions
containing $n_1, n_2, \ldots, n_k$ objects. If the loss/merit function allows
for separating the problem into $k$ independent problems, the problem
size is just the sum of the individual problems. By using complete
enumeration the size is $O(\sum_{i=1}^k{n_i!})$. If the problem is not
separable and the optimal seriation of each dimension depends on the
order of the objects of the other dimensions, the problem size is
$O((\sum_{i=1}^k{n_i})!)$. For example for $k=5$ and all dimensions
containing 5 objects, the search space for separable dimensions is only
600 while without separability it is larger than $10^{25}$ clearly too
big to be solvable in reasonable time. This shows that for data with
even only a few dimensions and a few objects each, finding the optimal
solution is infeasible and loss/merit functions which allow for separating the
problem are highly desirable.
In the following subsections, we review some commonly employed loss/merit
functions. Most functions are used for two-way one-mode data but the
measure of effectiveness and stress can be also used for two-way
two-mode data. For the implementation of various loss or merit measures
see function~\func{criterion} in Section~\ref{sec:infrastructure}.
%\section{Loss functions}
%\label{sec:criteria}
%In the literature several loss functions are suggested.
%We review the most commonly used functions.
\subsection{Column/row gradient measures}
A symmetric dissimilarity matrix where the values in all rows and
columns only increase when moving away from the main diagonal is called
a perfect \emph{anti-Robinson matrix} after the statistician
\cite{seriation:Robinson:1951}. Formally, an $n \times n$ dissimilarity
matrix $\mathbf{D}$ is in anti-Robinson form if and only if the
following two gradient conditions hold~\citep{seriation:Hubert:2001}:
\begin{align}
\text{within rows:} & \quad d_{ik} \le d_{ij} \quad \text{for}
\quad 1 \le i < k < j \le n; \\
\text{within columns:} & \quad d_{kj} \le d_{ij} \quad \text{for}
\quad 1 \le i < k < j \le n.
\end{align}
In an anti-Robinson matrix the smallest dissimilarity values appear
close to the main diagonal, therefore, the closer objects are together
in the order of the matrix, the higher their similarity. This provides
a natural objective for seriation.
It has to be noted that $\mathbf{D}$ can be brought into a perfect
anti-Robinson form by row and column permutation whenever $\mathbf{D}$ is an
ultrametric or $\mathbf{D}$ has an exact Euclidean representation in a single
dimension~\citep{seriation:Hubert:2001}. However, for most data only an
approximation to the anti-Robinson form is possible.
A suitable merit measure which quantifies the divergence of a matrix from the
anti-Robinson form was given by \cite{seriation:Hubert:2001} as
\begin{equation}
M(\mathbf{D}) =
\sum_{i<k<j}f(d_{ik}, d_{ij}) + \sum_{i<k<j}f(d_{kj}, d_{ij})
\label{equ:gradient}
\end{equation}
where $f(\cdot,\cdot)$ is a function which defines how a violation or
satisfaction of a gradient condition for an object triple ($O_i$, $O_k$ and
$O_j$) is counted. \cite{seriation:Hubert:2001} suggest two functions. The
first function is given by:
\begin{equation}
f(z,y) = \mathrm{sign}(y-z) =
\begin{cases}
+1 \quad \text{if} \quad z < y; \\
\phantom{+}0 \quad \text{if} \quad z = y; \\
-1 \quad \text{if} \quad z > y.
\end{cases}
\end{equation}
It results in the raw number of triples satisfying the gradient
constraints minus triples which violate the constraints.
The second function is defined as:
\begin{equation}
f(z,y) = |y-z|\mathrm{sign}(y-z) = y-z
\end{equation}
It weighs each satisfaction or violation by its
magnitude given by the absolute difference between the values.
\subsection{Anti-Robinson events}
An even simpler loss function can be created in the same way as the gradient
measures above by concentrating on violations only.
\begin{equation}
L(\mathbf{D}) =
\sum_{i<k<j}f(d_{ik}, d_{ij}) + \sum_{i<k<j}f(d_{kj}, d_{ij})
\end{equation}
To only count the violations we use
\begin{equation}
f(z, y) = I(z, y) =
\begin{cases}
1 \quad \text{if} \quad z > y \quad \text{and} \\
0 \quad \text{otherwise.}
\end{cases}
\end{equation}
$I(\cdot)$ is an indicator function returning $1$ only for violations.
\cite{seriation:Chen:2002} presented a formulation for an equivalent
loss function and called the violations \emph{anti-Robinson events}.
\cite{seriation:Chen:2002} also introduced a weighted versions of the loss
function resulting in
\begin{equation}
f(z, y) = |y-z|I(z, y)
\end{equation}
using the absolute deviations as weights.
\subsection{Hamiltonian path length}
The dissimilarity matrix $\mathbf{D}$ can be represented as a finite weighted
graph $G = (\Omega,E)$ where the set of objects~$\Omega$ constitute the
vertices and each edge~$e_{ij} \in E$ between the objects $O_i, O_j \in \Omega$
has a weight~$w_{ij}$ associated which represents the dissimilarity~$d_{ij}$.
Such a graph can be used for seriation~\citep[see,
e.g.,][]{seriation:Hubert:1974,seriation:Caraux:2005}. An order~$\Psi$
of the objects can be seen as a path through the graph where each node
is visited exactly once, i.e., a Hamiltonian path. Minimizing the
Hamiltonian path length results in a seriation optimal with respect to
dissimilarities between neighboring objects. The loss function based on
the Hamiltonian path length is:
\begin{equation}
L(\mathbf{D}) = \sum_{i=1}^{n-1} d_{i,i+1}.
\end{equation}
Note that the length of the Hamiltonian path is equal to the value of the
\emph{minimal span loss function} \citep[as used by][]{seriation:Chen:2002},
and both notions are related to the \emph{traveling salesperson
problem}~\citep{seriation:Gutin:2002}.
\subsection{Inertia criterion}
Another way to look at the seriation problem is not to focus on placing small
dissimilarity values close to the diagonal, but to push large values away from
it. A function to quantify this is the moment of inertia of dissimilarity
values around the diagonal \citep{seriation:Caraux:2005} defined as
\begin{equation}
M(\mathbf{D}) = \sum_{i=1}^n \sum_{j=1}^n d_{ij}|i-j|^2.
\end{equation}
$|i-j|^2$ is used as a measure for the distance to the diagonal and $d_{ij}$
gives the weight. This is a merit function since the sum increases when higher
dissimilarity values are placed farther away from the diagonal.
\subsection{Least squares criterion}
Another natural loss function for seriation is to quantify the deviations
between the dissimilarities in $\mathbf{D}$ and the rank differences of the
objects. Such deviations can be measured, e.g, by the sum of squares of
deviations \citep{seriation:Caraux:2005} defined by
\begin{equation}
L(\mathbf{D}) = \sum_{i=1}^n \sum_{j=1}^n (d_{ij} - |i-j|)^2,
\end{equation}
where $|i-j|$ is the rank difference or gap between $O_i$ and $O_j$.
The least squares criterion defined here is related to uni-dimensional
scaling~\citep{seriation:Leeuw:2005}, where the objective is to place all
$n$ objects on a straight line using a position vector~$\mathbf{z} =
z_1,z_2,\ldots,z_n$ such that the dissimilarities
in $\mathbf{D}$ are
preserved by the relative positions in the best possible way. The optimization
problem of uni-dimensional scaling is to find the
position vector~$\mathbf{z^*}$ which minimizes $\sum_{i=1}^n \sum_{j=1}^n
(d_{ij} - |z_i-z_j|)^2$.
This is close to the seriation problem, but in
addition to the ranking of the objects also takes the distances between objects
on the resulting scale into account.
Note that if Euclidean distance is used to calculate $\mathbf{D}$ from a data
matrix~$\mathbf{X}$, using the order of the elements in $\mathbf{X}$ as they
occur projected on the first principal component of $\mathbf{X}$
minimizes the loss function of uni-dimensional scaling (using squared
distances). Using this order, also provides a good solution
for the least square seriation criterion.
\subsection{Linear Seriation Criterion}
The Linear Seriation Criterion (Hubert and Schultz 1976)
weights the distances with the absolute rank differences.
$$L(\mathbf{D}) \sum_{i=1}^n \sum_{j=1}^n d_{ij} (-|i-j|)$$
\subsection{2-Sum Problem}
The 2-Sum loss criterion \citep{seriation:Barnard:1993}
multiplies the similarity between objects
with the squared rank differences.
$$L(\mathbf{D}) \sum_{i,j=1}^p \frac{1}{1+d_{ij}} (i-j)^2,$$
where $s_{ij} = \frac{1}{1+d_{ij}}$ represents the similarity between
objects $i$ and $j$.
\subsection{Measure of effectiveness}
\label{sec:ME}
\cite{seriation:McCormick:1972} defined the
\emph{measure of effectiveness (ME)} for an $n \times m$ matrix~$\mathbf{X} =
(x_{ij})$ as
\begin{equation}
M(\mathbf{X}) =
\frac{1}{2}
\sum_{i=1}^{n} \sum_{j=1}^{m} x_{ij}[x_{i,j+1}+x_{i,j-1}+
x_{i+1,j}+x_{i-1,j}]
\label{equ:ME}
\end{equation}
with, by convention $x_{0,j}=x_{n+1,j}=x_{i,0}=x_{i,m+1}=0$. ME is maximized
if each element is as closely related numerically to its four neighboring
elements as possible.
ME was developed for two-way two-mode data, however, ME can also be used for a
symmetric matrix (one-mode data) and gets maximal only if all large values are
grouped together around the main diagonal.
Note that the definition in equation~(\ref{equ:ME})
can be rewritten as
\begin{equation}
M(\mathbf{X}) =
\frac{1}{2}
\sum_{i=1}^{n} \sum_{j=1}^{m} x_{ij}[x_{i,j+1}+x_{i,j-1}] +
\sum_{i=1}^{n} \sum_{j=1}^{m} x_{ij}[x_{i+1,j}+x_{i-1,j}]
\end{equation}
showing that the contributions of column and row order to the merit function
are independent.
\subsection{Stress}
\label{sec:stress}
Stress measures the conciseness of the presentation of a matrix
(two-mode data) and can be seen as a purity function which compares the
values in a matrix with their neighbors. The stress measures used here
are computed as the sum of squared distances of each matrix entry from
its adjacent entries. \cite{seriation:Niermann:2005} defined for an $n
\times m$ matrix~$\mathbf{X} = (x_{ij})$ two types of neighborhoods:
\begin{itemize}
\item The Moore neighborhood comprises the (at most) eight adjacent entries.
The local stress measure for element~$x_{ij}$ is defined as
\begin{equation}
\sigma_{ij} = \sum_{k=\max(1,i-1)}^{\min(n,i+1)}
\sum_{l=\max(1,j-1)}^{\min(m,j+1)}
(x_{ij} - x_{kl})^2
\end{equation}
\item The Neumann neighborhood comprises the (at most) four adjacent entries
resulting in the local stress of $x_{ij}$ of
\begin{equation}
\sigma_{ij} =
\sum_{k=\max(1,i-1)}^{\min(n,i+1)} (x_{ij} - x_{kj})^2 +
\sum_{l=\max(1,j-1)}^{\min(m,j+1)} (x_{ij} - x_{il})^2
%(x_{ij} - x(i-1,j))^2 + (x_{ij} - x(i+1,j))^2 +
%(x_{ij} - x(i,j-1))^2 + (x_{ij} - x(i,j+1))^2
\end{equation}
\end{itemize}
Both local stress measures can be used to construct a global measure for the
whole matrix by summing over all entries which can be used as a loss function:
\begin{equation}
L(\mathbf{X}) =
\sum_{i=1}^n \sum_{j=1}^m \sigma_{ij}
\end{equation}
The major difference between the Moore and the Neumann neighborhood is
that for the later the contributions of row and column order to
stress are independent.
Stress can be also used as a loss function for
symmetric proximity matrices (one-mode data).
%,
%since it can only be optimal, if large values are
%concentrated around the main diagonal.
Note also, that stress with Neumann
neighborhood is related to the measure of effectiveness defined above (in
Section~\ref{sec:ME}) since both measures are optimal if for each cell the cell
and its four neighbors are numerically as similar as possible.
\section{Seriation methods}
\label{sec:methods}
Solving the discrete optimization problem for seriation with most loss/merit
functions is clearly very hard. The number of possible permutations for $n$
objects is $n!$ which makes an exhaustive search for sets with a medium to
large number of objects infeasible. In this section, we describe some methods
(partial enumeration, heuristics and other methods) which are typically used
for seriation. For each method we state for which type of loss/merit functions
it is suitable and whether it finds the optimum or is a heuristic. For the
implementation of various seriation methods see function~\func{seriate} in
Section~\ref{sec:infrastructure}.
\subsection{Partial enumeration methods}
Partial enumeration methods search for the exact solution of a
combinatorial optimization problem. Exploiting properties of the search
space, only a subset of the enormous number of possible combinations has
to be evaluated. Popular partial enumeration methods which are used for
seriation are \emph{dynamic programming}~\citep{seriation:Hubert:2001}
and \emph{branch-and-bound}~\citep{seriation:Brusco:2005}.
Dynamic programming recursively searches for the optimal solution checking and
storing $2^n-1$ results. Although $2^n-1$ grows at a lower rate than $n!$ and
is for $n \gg 3$ considerably smaller, the storage requirements of $2^n-1$
results still grow fast, limiting the maximal problem size severely. For
example, for $n=30$ more than one billion results have to be calculated and
stored, clearly a number too large for the main memory capacity of most current
computers.
Branch-and-bound has only very moderate storage requirements. The
forward-branching procedure~\citep{seriation:Brusco:2005} starts to build
partial permutations from left (first position) to right. At each step, it is
checked if the permutation is valid and several fathoming tests are performed
to check if the algorithm should continue with the partial permutation. The
most important fathoming test is the boundary test, which checks if the partial
permutation can possibly lead to a complete permutation with a better solution
than the currently best one. In this way large parts of the search space can
be omitted. However, in contrast to the dynamic programming approach, the
reduction of search space is strongly data dependent and poorly structured
data can lead to very poor performance. With branch-and-bound slightly larger
problems can be solved than with dynamic programming in reasonable time.
\cite{seriation:Brusco:2005} state that depending on the data, in some cases
proximity matrices with 40 or more objects can be handled with current
hardware.
Partial enumeration methods can be used to find the exact solution
independently of the loss/merit function. However, partial enumeration is
limited to only relatively small problems.
\subsection{Traveling salesperson problem solver}
Seriation by minimizing the length of a Hamiltonian path through a graph
is equal to solving a traveling salesperson problem. The traveling
salesperson or salesman problem (TSP) is a well known and well
researched combinatorial optimization
problem~\citep[see, e.g.,][]{seriation:Gutin:2002}. The goal is to find the
shortest tour that, starting from a given city, visits each city in a
given list exactly once and then returns to the starting city.
In graph theory a TSP tour is called
a \emph{Hamiltonian cycle.} But for the seriation problem, we are
looking for a Hamiltonian path. \cite{seriation:Garfinkel:1985}
described a simple transformation of the TSP to find the shortest
Hamiltonian path. An additional row and column of 0's is added
(sometimes this is referred to as a \emph{dummy city}) to the original
$n \times n$ dissimilarity matrix~$\mathbf{D}$. The solution of this
$(n+1)$-city TSP, gives the shortest path where the city representing
the added row/column cuts the cycle into a linear path.
As the general seriation problem, solving the TSP is difficult. In the
seriation case with $n+1$ cities, $n!$ tours have to be checked. However,
despite this vast searching space, small instances can be solved efficiently
using dynamic programming \citep{seriation:Held:1962} and larger instances of
several hundred objects can be solved using \emph{branch-and-cut}
algorithms~\citep{seriation:Padberg:1990}. For even larger instances or if
running time is critical, a wide array of heuristics are available, ranging
from simple nearest neighbor approaches to construct a
tour~\citep{seriation:Rosenkrantz:1977} to complex heuristics like the
Lin-Kernighan heuristic~\citep{seriation:Lin:1973}. A comprehensive overview
of heuristics and exact methods can be found in \cite{seriation:Gutin:2002}.
\subsection{Bond energy algorithm}
The \emph{bond energy algorithm}~\citep[BEA;][]{seriation:McCormick:1972} is a
simple heuristic to rearrange columns and rows of a matrix
(two-way two-mode data) such that each entry
is as closely numerically related to its four neighbors as possible. To
achieve this, BEA tries to maximize the measure of effectiveness (ME) defined
in Section~\ref{sec:ME}. For optimizing the ME, columns and rows can be
treated separately since changing the order of rows does not influence the ME
contributions of the columns and vice versa. BEA consists of the
following three steps:
\begin{enumerate}
\item Place one randomly chosen column.
\item Try to place each remaining column at each possible position
left, right and between the already placed columns and calculate every
time the increase in ME. Choose the column and position which gives
the largest increase in ME and place the column. Repeat till all
columns are placed.
\item Repeat procedure with rows.
\end{enumerate}
This greedy algorithm works fast and only depends on the choice of the
first column/row. This dependence can be reduced by repeating the
procedure several times with different choices and returning the solution
with the highest ME.
Although \cite{seriation:McCormick:1972} use BEA also for non-binary data,
\cite{seriation:Arabie:1990} argue that the measure of effectiveness only
serves its intended purpose of finding an arrangement which is
close to Robinson form for binary data and should therefore only be
used for binary data.
\cite{seriation:Lenstra:1974} notes that the optimization problem of BEA
can be stated as two independent traveling salesperson problems (TSPs).
For example, the row TSP for an $n \times m$ matrix~$\mathbf{X}$
consists of $n$ cities with an $n \times n$ distance matrix~$\mathbf{D}$
where the distances are
\begin{displaymath}
d_{ij} = -\sum_{k=1}^m x_{ik}x_{jk}.
\end{displaymath}
BEA is in fact a simple suboptimal TSP heuristic using this distances
and instead of BEA any TSP solver can be used to obtain an order.
With an exact TSP solver, the optimal solution can be found.
\subsection{Hierarchical clustering}
\label{sec:hierarchical_clustering}
Hierarchical clustering produces a series of nested clusterings which
can be visualized by a dendrogram, a tree where each internal node
represents a split into subtrees and has a measure of
similarity/dissimilarity attached to it. As a simple heuristic to find
a linear order of objects, the order of the leaf nodes in a dendrogram
structure can be used. This idea is used, e.g., by heat maps to reorder
rows and columns with the aim to place more similar objects and
variables closer together.
%For hierarchical clustering several methods are available (e.g.,
%single linkage, average linkage, complete linkage, ward method) resulting in
%different dendrograms.
%However,
The order of leaf nodes in a dendrogram is not unique. A binary
(two-way splits only) dendrogram for $n$ objects has $2^{n-1}$ internal
nodes and at each internal node the left and right subtree (or leaves)
can be swapped resulting in $2^{n-1}$ distinct leaf orderings. To find
a unique or optimal order, an additional criterion has to be defined.
\cite{seriation:Gruvaeus:1972} suggest to obtain a unique order by
requiring to order the leaf nodes such that at each level the objects at
the edge of each cluster are adjacent to that object outside the cluster
to which it is nearest.
\cite{seriation:Bar-Joseph:2001} suggest to rearrange the dendrogram such that
the Hamiltonian path connecting the leaves is minimized and called this the
optimal leaf order. The authors also present a fast algorithm with time
complexity $O(n^4)$ to solve this optimization problem. Note that this problem
is related to the TSP described above, however, the given dendrogram structure
significantly reduces the number of permissible permutations making the problem
easier.
Although hierarchical clustering solves an optimization problem different to
the seriation problem discussed in this paper, hierarchical clustering still
can produce useful orderings, e.g., for visualization.
\subsection{Rank-two ellipse seriation}
\cite{seriation:Chen:2002} proposes to
generate a sequence of correlation matrices
$R^1, R^2, \ldots$. $R^1$ is the correlation matrix
of the original distance matrix $\mathbf{D}$ and
\begin{equation}
R^{n+1} = \phi R^n,
\end{equation}
where $\phi(\cdot)$ calculates a correlation matrix.
\cite{seriation:Chen:2002} shows that the
rank of the matrix $R^n$ falls with increasing $n$ and that if the sequence
is continued till the first matrix in the sequence has a rank of 2,
projecting all points in this matrix on its first two eigenvectors,
all points will fall on an ellipse.
\cite{seriation:Chen:2002} suggests to use the order of the points on
this ellipse as a seriation where the ellipse can be cut at any of the
two interception points (top or bottom) with the vertical axis.
Although the rank-two ellipse seriation procedure does not try to solve a
combinatorial optimization problem, it still provides for some cases a useful
ordering.
\subsection{Spectral Seriation}
Spectral seriation uses a relaxation to minimize the 2-Sum Problem
\citep{seriation:Barnard:1993}.
Rewriting the minimization problem using a permutation vector $\pi$,
its inverse, rescaling to $\mathrm{q}$ and using a Lagrangian
multiplier for the constraint on the permutation yields \citep{seriation:Ding:2004} the following equivalent optimization problem:
$$\mathrm{min}_\mathbf{q} \frac{\mathbf{q}^T L_\mathbf{S}\mathbf{q}}{\mathbf{q}^T\mathbf{q}}$$
where $L_\mathbf{S}$ is the Laplacian of $\mathbf{S}$.
The optimal order can be recovered by the sorting order of
the Fiedler vector (i.e., the second smallest eigenvector of the Laplacian
of the similarity matrix).
\subsection{Quadratic Assignment Problem}
Both, the linear seriation criterion and the 2-Sum problem formulation
can be written as a Quadratic Assignment Problem (QAP). However,
the QAP is in general NP-hard. Methods include
QIP, linearization, branch and bound and cutting planes as well as
heuristics including Tabu search, simulated annealing, genetic algorithms, and
ant systems \citep{seriation:Burkard:1998}.
\section{The package infrastructure}
\label{sec:infrastructure}
The \pkg{seriation} package provides the data structures and some algorithms
to efficiently handle seriation with \proglang{R}. As the
input data for seriation
\proglang{R}
already provides
\begin{itemize}
\item for two-way one-mode data the class \code{dist},
\item for two-way two-mode data the class \code{matrix}, and
\item for $k$-way $k$-mode data the class \code{array}.
\end{itemize}
\begin{figure}[tp]
\centerline{
%\includegraphics[width=12cm]{infrastructure}}
\includegraphics[width=10cm]{classes}}
\caption{UML class diagram of the data structures for permutations provided by \pkg{seriation}}
\label{fig:infrastructure}
\end{figure}
However, \proglang{R} provides no classes for representing permutation vectors.
\pkg{seriation} adds the necessary data structure (using the S3 class
system) as depicted in the UML class diagram \citep{seriation:Fowler:2004} in
Figure~\ref{fig:infrastructure}. In this diagram classes are represented by
rectangles and different symbols are used to state the type of
relationship between the classes. The class
\code{ser\_permutation}
in Figure~\ref{fig:infrastructure}
represents the permutation information for $k$-mode
data (including the cases of $k=1$ and $k=2$). It consists of $k$ permutation
vectors (class \code{ser\_permutation\_vector}). This relationship is
represented by the solid diamond and the star above the connection between the
two classes. Class \code{ser\_permutation\_vector} is defined \emph{abstract}
and only its concrete implementations (classes connected with the triangle
symbol) are used to store a permutation vector.
This design with an abstract class was chosen to allow to use
different representations for the permutation vectors.
Currently, the permutation vector can be stored as a simple
integer vector or as an object of class \code{hclust} (defined in
package \pkg{stats}). \code{hclust} describes a hierarchical clustering
tree (dendrogram) including an ordering for the tree's node leaves which
provides a permutation for all objects (see
Section~\ref{sec:hierarchical_clustering}).
Class \code{ser\_permutation\_vector} has a constructor
\func{ser\_permutation\_vector} which converts data into the correct concrete
subclass of \code{ser\_permutation\_vector} and checks if it contains a proper
permutation vector. For \code{ser\_permutation\_vector} the methods
\func{print}, \func{length} for the length of the permutation vector,
\func{get\_method} to get the method used to generate the permutation, and
\func{get\_order} to access the raw (integer) permutation vector are available.
To use an additional class to represent permutations as a concrete subclass of
\code{ser\_permutation\_vector} only an appropriate accessor method
\func{get\_order} has to be implemented for the new class.
For \code{ser\_permutation} a constructor is provided which can bind $k$
\code{ser\_permutation\_vector} objects together into an object for $k$-mode
data. \code{ser\_permutation} is implemented as a list of length~$k$ and each
element contains a \code{ser\_permutation\_vector} object. Methods like
\func{length}, accessing elements with \code{[[},
% ]]
\code{[[<-},
% ]]
subsetting with \code{[}, and combining with \func{c} work as expected.
Also a \func{print} method is provided. Finally, direct access to the
raw permutation vectors is available using \func{get\_order}. Here a
second argument (which defaults to $1$) specifies the dimension (mode)
for which the order vector is requested.
All seriation algorithms are available via the function \func{seriate}
defined as:
\begin{quotation}
\code{seriate(x, method = NULL, control = NULL, ...)}
\end{quotation}
where \code{x} is the input data, \code{method} is a string defining the
seriation method to be used and \code{control} can contain a list with
additional information for the algorithm. \func{seriate} returns an object
of class \code{ser\_permutation} with a length conforming to the number of
dimensions of~\code{x}.
Typical input data are a dissimilarity matrix (class~\code{dist};
see package \pkg{stats} for more information) for one-mode two-way data,
\code{matrix} for two-mode two-way data and \code{array} for $k$-mode $k$-way
data.
For \code{matrix} and \code{array} the additional argument
\code{margin} can be used to restrict the dimensions which should be seriated
(e.g., with \code{margin = 1} only the first dimension,
i.e., the columns of a matrix, are seriated).
%\begin{landscape}
\begin{table}[tp]
\centering
\begin{tabular}{p{5cm}p{3cm}p{4cm}l}
\hline
Algorithm & \code{method} & Optimizes & Input data \\
\hline
Simulated annealing & \code{"ARSA"} & Linear seriation crit.&\code{dist} \\
Branch-and-bound & \code{"BBURCG"} & Gradient measure &\code{dist} \\
Branch-and-bound & \code{"BBWRCG"} & Gradient measure (weighted)& \code{dist} \\
TSP solver & \code{"TSP"} & Hamiltonian path length& \code{dist} \\
Optimal leaf ordering & \code{"OLO"}
\code{"OLO_single"}
\code{"OLO_average"}
\code{"OLO_complete"}
& Hamiltonian path length (restricted)& \code{dist} \\
Gruvaeus and Wainer & \code{"GW"}
\code{"GW_single"}
\code{"GW_average"}
\code{"GW_complete"}
& Hamiltonian path length (restricted) & \code{dist} \\
MDS & \code{"MDS"}
\code{"MDS_metric"}
\code{"MDS_nonmetric"}
\code{"MDS_angle"}
& Least square crit.& \code{dist} \\
Spectral seriation & \code{"Spectral"}
\code{"Spectral_norm"}
& 2-Sum crit. & \code{dist} \\
QAP & \code{"QAP_2SUM"}
& 2-Sum crit. & \code{dist} \\
& \code{"QAP_LS"}
& Linear seriation crit. & \code{dist} \\
& \code{"QAP_BAR"}
& Banded AR form & \code{dist} \\
& \code{"QAP_Inertia"}
& Inertia crit. & \code{dist} \\
Genetic Algorithm & \code{"GA"}*
& various & \code{dist} \\
DendSer & \code{"DendSer"}*
& various & \code{dist} \\
Hierarchical clustering & \code{"HC"}
\code{"HC_single"}
\code{"HC_average"}
\code{"HC_complete"}
& Other& \code{dist} \\
Rank-two ellipse seriation & \code{"R2E"} & Other& \code{dist} \\
Sorting Points Into Neighborhoods & \code{"SPIN_NH"}
\code{"SPIN_STS"} & Other& \code{dist} \\
Visual Assessment of (Clustering) Tendency & \code{"VAT"}& Other& \code{dist} \\
\hline
Bond Energy Algorithm & \code{"BEA"} &
Measure of effectiveness & \code{matrix} \\
TSP to optimize ME & \code{"BEA\_TSP"} &
Measure of effectiveness& \code{matrix} \\
Principal component analysis& \code{"PCA"}
\code{"PCA_angle"}&
Least square crit.& \code{matrix} \\
\hline
\end{tabular}
\caption{Currently implemented methods for \func{seriation} (* methods need to be registered).}
\label{tab:methods}
\end{table}
%\end{landscape}
Various seriation methods were already introduced in this paper in
Section~\ref{sec:methods}. In Table~\ref{tab:methods} we summarize the
methods currently available in the package for seriation. The code for
the simulated annealing heuristic~\citep{seriation:Brusco:2007} and the
two branch-and-bound implementations~\citep{seriation:Brusco:2005} was
obtained from the authors. The TSP solvers (exact solvers and a variety
of heuristics) is provided by package
\pkg{TSP}~\citep{seriation:Hahsler:2007, seriation:Hahsler:2007b}. For
optimal leaf ordering we implemented the algorithm
by~\cite{seriation:Bar-Joseph:2001}. The BEA code was kindly provided by
Fionn Murtagh. For the Gruvaeus and Wainer algorithm, the
implementation in package \pkg{gclus}~\citep{seriation:Hurley:2007} is
used. For the rank-two ellipse seriation we implemented the algorithm
by~\cite{seriation:Chen:2002}.
Spectral seriation is described by~\cite{seriation:Ding:2004}.
Note that some methods implemented
(e.g., the rank-two ellipse seriation) do not fall within the
combinatorial optimization framework of this paper and thus are not
dealt with here in detail. They are included in the package since they
can be useful for various applications.
A detailed empirical comparison of seriation methods and criteria can be found
in the study by \cite{hahsler:Hahsler2016d}.
%Over time more methods will be
%added to the package.
To calculate the value of a loss/merit function for data and
a certain permutation, the function
\begin{quotation}
\code{criterion(x, order = NULL, method = NULL, ...)}
\end{quotation}
is provided. \code{x} is the data object, \code{order} contains a suitable
object of class \code{ser\_permutation} (if omitted no permutation is
performed) and \code{method} specifies the type of loss/merit function. A
vector of several methods can be used resulting in a named vector
with the values of the requested functions. If \code{method} is omitted
(\code{method = NULL}), the values for all applicable loss/merit functions are
calculated and returned. We already defined different loss/merit functions for
seriation in Section~\ref{sec:seriation}. In Table~\ref{tab:criteria} we
indicate the loss/merit functions currently available in the package.
\begin{table}[t]
\centering
\begin{tabular}{llll}
\hline
Name & \code{method} & merit/loss & Input data \\
\hline
Anti-Robinson events& \code{"AR\_events"} &
loss & \code{dist} \\
Anti-Robinson deviations& \code{"AR\_deviations"} &
loss & \code{dist} \\
Banded Anti-Robinson& \code{"BAR"} &
loss & \code{dist} \\
Gradient measure& \code{"Gradient\_raw"} &
merit & \code{dist} \\
Gradient measure (weighted)& \code{"Gradient\_weighted"} &
merit & \code{dist} \\
Hamiltonian path length & \code{"Path\_length"} & loss & \code{dist} \\
Inertia criterion& \code{"Inertia"} & merit & \code{dist} \\
Least squares criterion& \code{"Least\_squares"} & loss & \code{dist} \\
Linear Seriation criterion& \code{"LS"} & loss & \code{dist} \\
2-Sum criterion& \code{"2SUM"} & loss & \code{dist} \\
\hline
Measure of effectiveness& \code{"ME"} &
merit & \code{matrix} \\
Stress (Moore neighborhood)& \code{"Moore\_stress"} &
loss & \code{matrix} \\
Stress (Neumann neighborhood)& \code{"Neumann\_stress"} &
loss & \code{matrix} \\
\hline
\end{tabular}
\caption{Implemented loss/merit functions in function \func{criterion}.}
\label{tab:criteria}
\end{table}
All methods for \func{seriate} and \func{criterion} are managed by a
registry mechanism which makes the seriation framework easily extensible
for users. For example, a new seriation method can be registered using
\func{set\_seriation\_method} and then used in the same way as the
built-in methods with \func{seriate}. All available methods in the
registry can be viewed using \func{list\_seriation\_methods} and
\func{show\_seriation\_methods}. For criterion methods, the same
interface is available by just substituting `seriation' by `criterion'
in the function names. An example for how to add new methods can be
found in section~\ref{sec:registering} of this paper.
In addition the package offers the (generic) function
\begin{quotation}
\code{permute(x, order)}
\end{quotation}
where \code{x} is the data (a \code{dist} object, a matrix, an
array, a list or a numeric vector) to be reordered and \code{order} is a
\code{ser\_permutation} object of suitable length.
%The permutation for
%\code{dist} objects uses package \pkg{proxy}~\citep{seriation:Meyer:2007}.
For visualization, the package offers several options:
\begin{itemize}
\item Matrix shading with \func{pimage}. In contrast to the
standard \func{image} in package~\pkg{graphics}, \func{pimage}
displays the matrix as is with the first element in the top
left-hand corner and using a gamma-corrected gray scale.
\item Different heat maps (e.g., with optimally reordered
dendrograms) with \func{hmap}.
\item Visualization of data matrices in the spirit
of~\cite{seriation:Bertin:1981} with \func{bertinplot}.
\item \emph{Dissimilarity plot}, a new visualization to judge the
quality of a clustering using matrix shading
and seriation with \func{dissplot}.
\end{itemize}
We will introduce the package usage and the visualization options
in the examples in the next section.
\section{Examples and applications}
\label{sec:example}
We start this section with a simple first session to demonstrate the basic
usage of the package. Then we present and discuss several seriation
applications.
\subsection{A first session using seriation}
In the following example, we use the well known iris data set
(from \proglang{R}'s \pkg{datasets} package) which gives the
measurements in centimeters of the variables sepal length and width and petal
length and width, respectively, for 50 flowers from each of 3 species of the
iris family (Iris Setosa, Versicolor and Virginica).
First, we load the package \pkg{seriation} and the iris data set. We
remove the species classification and reorder the objects randomly since
they are already sorted by species in the data set. Then we calculate
the Euclidean distances between objects.
<<echo=FALSE>>=
set.seed(1234)
@
<<>>=
library("seriation")
data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(seq_len(nrow(x))),]
d <- dist(x)
@
To seriate the objects given the dissimilarities, we just call
\func{seriate} with the default settings.
<<>>=
o <- seriate(d)
o
@
The result is an object of class \code{ser\_permutation} for
one-mode data. The permutation vector length is $150$ for the
$150$ objects in the iris data set and the used seriation method is
\code{"ARSA"}, a simulated annealing heuristic
(see~Table~\ref{tab:methods}). The actual order can be accessed
using \func{get\_order}. In the following we show the first
15 elements in the permutation vector.
<<>>=
head(get_order(o), 15)
@
To visually inspect the effect of seriation on the distance matrix, we use
matrix shading with \func{pimage} (the result is shown in
Figure~\ref{fig:pimage1}).
<<label=pimage1, fig=TRUE, include=FALSE, width=7.5>>=
pimage(d, main = "Random")
@
<<label=pimage1-2, fig=TRUE, include=FALSE, width=7.5>>=
pimage(d, o, main = "Reordered")
@
\begin{figure}
\centering
\includegraphics[width=7.5cm]{seriation-pimage1}
\includegraphics[width=7.5cm]{seriation-pimage1-2}
\caption{Matrix shading of the distance matrix for the iris data.}
\label{fig:pimage1}
\end{figure}
We can also compare the improvement for different loss/merit functions
using \func{criterion}.
<<>>=
cbind(random = criterion(d), reordered = criterion(d, o))
@
Naturally, the reordered dissimilarity matrix achieves better values for all
criteria. Note that the gradient measures, inertia and the measure of
effectiveness are merit functions and for these measures larger values are
better (use \code{show\_criterion\_methods("dist")} to find out which measures
are loss and merit functions).
To visually compare the original data matrix and the
result of seriation, we can also use \func{pimage}.
We standardize the data using scale such that the visualized value
is the number of standard deviations an object differs from the
variable mean. For matrices containing negative values, \code{pimage}
uses automatically a divergent palette.
After using \func{pimage} for the original random data matrix,
we create a suitable \code{ser\_permutation} object for the original
two-mode data. Since the seriation above only produced an order for the rows
of the data, we add an identity permutation vector
for the columns (represented by \code{NA})
to the permutations object using the combine
function \func{c}. This new permutation object for $2$-mode data is used
for displaying the reordered scaled data. The two plots are shown in
Figure~\ref{fig:pimage2}.
<<label=pimage2, fig=TRUE, include=FALSE, width=7.5>>=
pimage(scale(x), main = "Random", prop = FALSE)
@
<<label=pimage2-2, fig=TRUE, include=FALSE, width=7.5>>=
o_2mode <- c(o, NA)
pimage(scale(x), o_2mode, main = "Reordered", prop = FALSE)
@
\begin{figure}
\centering
\includegraphics[width=7.5cm]{seriation-pimage2}
\includegraphics[width=7.5cm]{seriation-pimage2-2}
\caption{Matrix shading of the iris data matrix.}
\label{fig:pimage2}
\end{figure}
\subsection{Comparing different seriation methods}
To compare different seriation methods we use again the randomized iris data
set and the distance matrix \code{d} from the previous example. We include in
the comparison several seriation methods for dissimilarity matrices described
in Section~\ref{sec:methods}.
<<>>=
methods <- c("TSP","R2E", "ARSA", "HC", "GW", "OLO")
o <- sapply(methods, FUN = function(m) seriate(d, m))
@
<<echo=FALSE>>=
timing <- sapply(methods, FUN = function(m) system.time(seriate(d, m)),
simplify = FALSE)
@
\begin{table}
\centering
\begin{tabular}{lcccccc}
\hline
Seriation Method &
\Sexpr{methods[1]}&
\Sexpr{methods[2]}&
\Sexpr{methods[3]}&
\Sexpr{methods[4]}&
\Sexpr{methods[5]}&
\Sexpr{methods[6]} \\
\hline
Execution time [sec] &
\Sexpr{round(timing[[methods[1]]][1],4)}&
\Sexpr{round(timing[[methods[2]]][1],4)}&
\Sexpr{round(timing[[methods[3]]][1],4)}&
\Sexpr{round(timing[[methods[4]]][1],4)}&
\Sexpr{round(timing[[methods[5]]][1],4)}&
\Sexpr{round(timing[[methods[6]]][1],4)}\\
\hline
\end{tabular}
%%% fix me: for the vignette we need something else
\caption{Execution time of seriation of the iris data set for different
methods.}
\label{tab:timings}
\end{table}
Table~\ref{tab:timings} contains the execution times for running
seriation with the different methods. Except for the simulated annealing
method (ARSA) the seriation only takes a fraction of a second.
The direction of the resulting orderings is first normalized (aligned)
and then the orderings are displayed using matrix shading
(see Figure~\ref{fig:pimage3}).
<<label=pimage3-pre, eval=FALSE>>=
o <- ser_align(o)
for(s in o) pimage(d, s, main = get_method(s), key = FALSE)
@
<<label=pimage3, echo=FALSE, fig=TRUE, include=FALSE>>=
o <- ser_align(o)
for(i in 1:length(o)) {
pdf(file=paste("seriation-pimage_comp_", i , ".pdf", sep=""))
pimage(d, o[[i]], main = get_method(o[[i]]), key = FALSE)
dev.off()
}
@
\begin{figure}
\centering
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_1.pdf}
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_2.pdf}
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_3.pdf}\\
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_4.pdf}
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_5.pdf}
\includegraphics[width=.3\linewidth]{seriation-pimage_comp_6.pdf}
\caption{Image plot of the distance matrix for the iris data
using rearrangement by different seriation methods.}
\label{fig:pimage3}
\end{figure}
The first row of matrices in Figure~\ref{fig:pimage3} contains the
orders obtained by a TSP solver the rank-two ellipse seriation by Chen
and using the simulated annealing method (ARSA). The results of Chen and
ARSA are very similar (except that the order is reversed). The TSP
solver produces a smoother image with some lighter lines visible. The
reason for these lines is that the TSP only optimizes distances locally
between two neighboring objects. Therefore it is possible that in a
quite homogeneous block several objects are enclosed gradually getting
more different and then getting more similar again (see, e.g., the light
line close to the upper left corner of the TSP image in
Figure~\ref{fig:pimage3}).
The second row of Figure~\ref{fig:pimage3} contains three images based on
hierarchical clustering. The visual impression gets better from left (just
hierarchical clustering) to right (first using the Gruvaeus Wainer heuristic
and then optimal leaf ordering to rearrange the branches of the dendrogram
obtained by hierarchical clustering). The most striking feature in the
image for hierarchical clustering (HC in Figure~\ref{fig:pimage3}) is the
distinct cross going right through the center of the plot. This indicates that
several relatively dissimilar objects are caught in an otherwise homogeneous
block. This effect vanishes after rearranging the dendrogram branches
(see GW and OLO in Figure~\ref{fig:pimage3}).
%' To investigate this effect,
%' we plot the dendrogram obtained by hierarchical clustering which is used
%' to order the objects and compare it to the dendrogram rearranged
%' using the Gruvaeus Wainer heuristic.
%'
%' <<label=pimage3_dend, eval=FALSE>>=
%' plot(o[["HC"]], labels = FALSE, main = "Dendrogram HC")
%' plot(o[["GW"]], labels = FALSE, main = "Dendrogram GW")
%' @
%' <<echo=FALSE, fig=FALSE, include=FALSE, width=9>>=
%' def.par <- par(no.readonly = TRUE)
%' pdf(file="seriation-pimage3_dendrogram.pdf", width=9, height=4)
%' layout(t(1:2))
%' plot(o[["HC"]], labels = FALSE, main = "Dendrogram HC")
%' symbols(74.7,.5, rect = matrix(c(4, 3), ncol=2), add= TRUE,
%' inches = FALSE, lwd =2)
%'
%' plot(o[["GW"]], labels = FALSE, main = "Dendrogram GW")
%' symbols(98.7,.5, rect = matrix(c(4, 3), ncol=2), add= TRUE,
%' inches = FALSE, lwd =2)
%' par(def.par)
%' tmp <- dev.off()
%' @
%'
%' \begin{figure}
%' \centering
%' \includegraphics[width=\linewidth, trim=0 80 0 0, clip=TRUE]{seriation-pimage3_dendrogram}
%' \caption{Dendrograms for the seriation with HC and GW.}
%' \label{fig:pimage3_dendrogram}
%' \end{figure}
%'
%' Comparing the two dendrograms in Figure~\ref{fig:pimage3_dendrogram}, we see
%' that the branch left from the top is almost unchanged. The branch which is
%' responsible for the light cross in the shaded image is highlighted by a box.
%' The Gruvaeus Wainer heuristic rotates the highlighted branch towards the right
%' since the objects in it are more similar to the objects in there.
Finally, we compare the values of the loss/merit functions
for the different seriation methods.
<<>>=
crit <- sapply(o, FUN = function(x) criterion(d, x))
t(crit)
@
<<echo=FALSE, fig=TRUE, include=FALSE, label=crit1, width=6, height=8>>=
def.par <- par(no.readonly = TRUE)
m <- c("Path_length", "AR_events", "Moore_stress")
layout(matrix(seq_along(m), ncol=1))
#tmp <- apply(crit[m,], 1, dotchart, sub = m)
tmp <- lapply(m, FUN = function(i) dotchart(crit[i,], sub = i))
par(def.par)
@
\begin{figure}
\centering
\includegraphics[width=14cm]{seriation-crit1}
\caption{Comparison of different methods and seriation criteria}
\label{fig:crit1}
\end{figure}
For easier comparison, Figure~\ref{fig:crit1} contains a plot of the criteria
Hamiltonian path length, anti-Robinson events (\code{AR\_events}) and stress
using the Moore neighborhood. Clearly, the methods which directly try to
minimize the Hamiltonian path length (hierarchical clustering with optimal leaf
ordering (\code{OLO}) and the TSP heuristic) provide the best results
concerning the path length. For the number of anti-Robinson events, using the
simulated annealing heuristic (\code{ARSA}) provides the best result. Regarding stress, the
simulated annealing heuristic also provides the best result although, it does
not directly minimize this loss function.
\subsection{Registering new methods}
\label{sec:registering}
New methods to calculate criterion values and to compute a seriation can
be easily added by the user via the method registry mechanism provided
in \pkg{seriation}. Here we give a simple example of how to implement and
register a new seriation method.
In the registry we distinguish between methods for different types of
input data. With the following two commands we produce a list
of the available seriation methods for input data of class \code{dist}
and \code{matrix}.
<<>>=
list_seriation_methods("dist")
list_seriation_methods("matrix")
@
To get detailed information on a seriation method use the following.
<<>>=
get_seriation_method("dist", name = "ARSA")
@
To add a new seriation method, we first have to implement the seriation code as
a function with the two formal arguments \code{x} and \code{control}, and for
arrays also an additional argument \code{margin}.
\code{x} is the data
object and \code{control} contains a list with additional information for the
method passed on from \func{seriate}. The function has to return a list
of objects which can be coerced into \code{ser\_permutation\_vector}
objects (e.g., a list of integer vectors). The elements in the list
have to be in order corresponding to the dimensions of \code{x}.
In this example we just create a method to return a permutation
which reverses the original order of the objects, i.e., which returns
the reverse identity order.
<<>>=
seriation_method_reverse <- function(x, control = NULL,
margin = seq_along(dim(x))) {
lapply(seq_along(dim(x)), function(i)
if (i %in% margin) rev(seq(dim(x)[i]))
else NA)
}
@
The function produces integer sequences of the correct lengths,
one for each dimension of \code{x} (\code{control} is not used).
Since the function works for \code{matrix} and \code{array} we can register
it for both data types under the short name `Reverse'.
<<>>=
set_seriation_method("matrix", "New_Reverse", seriation_method_reverse,
"Reverse identity order")
set_seriation_method("array", "New_Reverse", seriation_method_reverse,
"Reverse identity order")
@
Now the new seriation method is registered and can be found by the user
and applied to data.
<<>>=
list_seriation_methods("matrix")
o <- seriate(matrix(1, ncol = 3, nrow = 4), "New_Reverse")
o
get_order(o, 1)
get_order(o, 2)
@
Criterion methods can be added in the same way. We refer the interested reader
to the documentation accompanying the package for detailed information and
an example.
If you have implemented a new criterion or seriation method, please consider
submitting the code to one of the maintainers of \pkg{seriation} for
inclusion in a future release of the package.
\subsection{Heat maps}
A heat map is a shaded/color coded data matrix with a dendrogram added to one
side and to the top to indicate the order of rows and columns. Typically,
reordering is done according to row or column means within the restrictions
imposed by the dendrogram. Heat maps recently became popular for visualizing
large scale genome expression data obtained via DNA microarray technology
\citep[see, e.g.,][]{seriation:Eisen:1998}.
From Section~\ref{sec:hierarchical_clustering} we know that it is possible to
find the optimal ordering of the leaf nodes of a dendrogram which minimizes
the distances between adjacent objects in reasonable time. Such an order might
provide an improvement over using simple reordering such as the row or column
means with respect to presentation. In \pkg{seriation} we provide
the function \func{hmap} which uses optimal ordering and can also use
seriation directly on distance matrices without using hierarchical
clustering to produce dendrograms first.
For the following example, we use again the randomly reordered iris data set
\code{x} from the examples above. To make the variables (columns) comparable,
we use standard scaling.
<<>>=
x <- scale(x, center = FALSE)
@
To produce a heat map with optimally reordered dendrograms (using
by default Optimal Leaf Ordering), the function
\func{hmap} can be used with its default settings.
<<eval=FALSE>>=
hmap(x, margin = c(7, 4), cexCol = 1, row_labels = FALSE)
@
With these settings, the
Euclidean distances between rows and between columns are calculated (with
\func{dist}), hierarchical clustering (\func{hclust}) is performed, the
resulting dendrograms are optimally reordered, and \func{heatmap.2} in package
\pkg{gplots} is used for plotting
(see Figure~\ref{fig:heatmap}(a) for the resulting plot).
<<eval=FALSE>>=
hmap(x, method = "MDS")
@
If a seriation method is used that does not depend on dendrograms, instead of hierarchical clustering,
seriation on the dissimilarity matrices for rows and columns is
performed and the reordered matrix
with the reordered dissimilarity matrices to the left and on top is
displayed (see Figure~\ref{fig:heatmap}(b)). A \code{method} argument can be used to choose different
seriation methods.
<<echo=FALSE, fig=FALSE, include=FALSE>>=
#bitmap(file = "seriation-heatmap1.png", type = "pnggray",
# height = 6, width = 6, res = 300, pointsize=14)
pdf(file = "seriation-heatmap1.pdf")
hmap(x, margin = c(7, 4), row_labels = FALSE, cexCol = 1)
tmp <- dev.off()
@
<<echo=FALSE, fig=FALSE, include=FALSE>>=
pdf(file = "seriation-heatmap2.pdf")
hmap(x, method="MDS")
tmp <- dev.off()
@
\begin{figure}
\begin{minipage}[b]{.48\linewidth}
\centering
\includegraphics[width=\linewidth]{seriation-heatmap1} \\
(a)
\end{minipage}
\begin{minipage}[b]{.48\linewidth}
\centering
\includegraphics[width=\linewidth]{seriation-heatmap2} \\
(b)
\end{minipage}
\caption{Two presentations of the rearranged iris data matrix. (a) as an
optimally reordered heat map and (b) as a seriated data matrix with reordered
dissimilarity matrices to the left and on top.}
\label{fig:heatmap}
\end{figure}
\subsection{Bertin's permutation matrix}
\cite{seriation:Bertin:1981,seriation:Bertin:1999}
introduced permutation matrices to analyze
multivariate data with medium to low sample size. The idea is to reveal a more
homogeneous structure in a data matrix~$\mathbf{X}$ by simultaneously
rearranging rows and columns. The rearranged matrix is displayed and cases and
variables can be grouped manually to gain a better understanding of the data.
%To quantify homogeneity, a purity function
%\begin{displaymath}
% \phi = \Phi(\mathbf{X})
%\end{displaymath}
%is defined. Let $\Pi$ be the set of all permutation functions
%$\pi$ for matrix $\mathbf{X}$.
%Note that function $\pi$ performs row and column permutations on a matrix.
%The optimal permutation with respect to
%purity can be found by
%\begin{displaymath}
% \pi^* = \argmax\nolimits_{\pi \in \Pi} \Phi(\pi(\mathbf{X})).
%\end{displaymath}
%Since, depending on the purity function, finding the optimal
%solution can be hard, often a near optimal solution is also acceptable
%for visualization.
%
%A possible purity function $\Phi$ is:
%Given distances between rows and columns of the data matrix, define purity as
%the sum of distances of adjacent rows/columns. Using this purity function,
%finding the optimal permutation $\pi^*$ means solving two (independent) TSPs,
%one for the columns and one for the rows.
To find a rearrangement of columns and rows which reveals structure a
purity function is used. A possible purity function is: Given distances between rows and columns of the data matrix, define purity as
the sum of distances of adjacent rows/columns. Using this purity function,
finding the optimal permutation means solving two (independent) TSPs,
one for the columns and one for the rows which can be done very conveniently
using the infrastructure provided by \pkg{seriation}.
As an example, we use the results of $8$ constitutional referenda for $41$
Irish communities~\citep{seriation:Falguerolles:1997}\footnote{The Irish data
set is included in this package. The original data and the text of the
referenda can be obtained from~\url{http://www.electionsireland.org/}}. To
make values comparable across columns (variables), the ranks of the values for
each variable are used instead of the original values.
<<>>=
data("Irish")
orig_matrix <- apply(Irish[,-6], 2, rank)
@
For seriation, we calculate distances between rows and between columns using
the sum of absolute rank differences (this is equal to the Minkowski distance
with power $1$). Then we apply seriation (using a TSP heuristic) to both
distance matrices and combine the two resulting \code{ser\_permutation} objects
into one object for two-mode data. The original and the reordered matrix are
plotted using \func{bertinplot}.
<<>>=
o <- c(
seriate(dist(orig_matrix, "minkowski", p = 1), method = "TSP"),
seriate(dist(t(orig_matrix), "minkowski", p = 1), method = "TSP")
)
o
@
In a newer version of the package this can be also done with the new heatmap seriation method for matrices.
<<>>=
get_seriation_method("matrix", name = "heatmap")
o <- seriate(orig_matrix, method = "heatmap", dist_fun = function(d) dist(d, "minkowski", p = 1),
seriation_method = "TSP")
o
@
<<eval=FALSE>>=
bertinplot(orig_matrix)
bertinplot(orig_matrix, o)
@
<<echo=FALSE, fig=TRUE, include=FALSE, label=bertin1, width=10>>=
bertinplot(orig_matrix)
@
<<echo=FALSE, fig=TRUE, include=FALSE, label=bertin2, width=10>>=
bertinplot(orig_matrix, o)
@
\begin{figure}
\centering
\includegraphics[width=15cm, trim=60 60 0 0]{seriation-bertin1} \\
(a)
\includegraphics[width=15cm, trim=60 60 0 0]{seriation-bertin2} \\
(b)
\caption{Bertin plot for the (a) original arrangement and the (b)
reordered Irish data set.}
\label{fig:bertin}
\end{figure}
The original matrix and the rearranged matrix are shown in
Figure~\ref{fig:bertin} as a matrix of bars where high values are highlighted
(filled blocks). Note that following Bertin, the cases (communities) are
displayed as the columns and the variables (referenda) as rows. Depending on
the number of cases and variables, columns and rows can be exchanged to obtain
a better visualization.
Although the columns are already ordered (communities in the same city
appear consecutively) in the original data matrix in
Figure~\ref{fig:bertin}(a), it takes some effort to find structure in
the data. For example, it seems that the variables `Marriage',
`Divorce', `Right to Travel' and `Right to Information' are correlated
since the values are all high in the block made up by the columns of the
communities in Dublin. The reordered matrix confirms this but makes the
structure much more apparent. Especially the contribution of low values
(which are not highlighted) to the overall structure becomes only
visible after rearrangement.
\subsection{Binary data matrices}
Binary or $0$-$1$ data matrices are quite common. Often such matrices are
called \emph{incidence matrices} since a $1$ in a cell indicates the incidence
of an event. In archaeology such an event could be that a special type of
artifact was found at a certain archaeological site.
This can be seen as a simplification of a so-called \emph{abundance matrix}
which codes in each cell the (relative) frequency or quantity of an artifact
type at a site. See \cite{seriation:Ihm:2005} for a comparison of
incidence and abundance matrices in archaeology.
Here we are interested in binary data.
For the example we use an artificial data set from~\cite{seriation:Bertin:1981}
called \emph{Townships}. The data set contains $9$ binary characteristics
(e.g., has a veterinary or has a high school) for $16$ townships. The idea of
the data set is that townships evolve from a rural to an urban environment over
time.
After loading the data set (which comes with the package), we use
\func{bertinplot} to visualize the data (\func{pimage} could also be used
but \func{bertinplot} allows for a nicer visualization).
Bars, the standard visualization of \func{bertinplot}, do not
make much sense for binary data. We therefore use the
panel function \func{panel.squares} without spacing
to plot black squares.
<<fig=TRUE, include=FALSE, label=binary1, width=9>>=
data("Townships")
bertinplot(Townships, panel = panel.tiles)
@
The original data in Figure~\ref{fig:binary}(a) does not reveal structure in
the data. To improve the display, we run the bond energy algorithm (BEA) for
columns and rows $10$ times with random starting points and report the best
solution.
<<echo=FALSE>>=
## to get consistent results
set.seed(10)
@
<<fig=TRUE, include=FALSE, label=binary2, width=9>>=
o <- seriate_rep(Townships, method = "BEA", criterion = "ME", rep = 10)
bertinplot(Townships, o, panel = panel.tiles)
@
The reordered matrix is displayed in Figure~\ref{fig:binary}(b). A
clear structure is visible. The variables (rows in a Bertin plot) can be
split into the three categories describing different evolution states of
townships:
\begin{enumerate}
\item Rural: No doctor, one-room school and possibly also
no water supply
\item Intermediate: Land reallocation, veterinary and agricultural
cooperative
\item Urban: Railway station, high school and police station
\end{enumerate}
The townships also clearly fall into these three groups which
tentatively can be called villages (first~$7$), towns (next~5) and
cities (final~2). The townships B and C are on the transition to the
next higher group.
\begin{figure}
\centering
\includegraphics[width=12cm, trim=0 40 0 30]{seriation-binary1} \\
(a)
\includegraphics[width=12cm, trim=0 40 0 30]{seriation-binary2} \\
(b)
\caption{The townships data set in original order (a) and
reordered using BEA (b).}
\label{fig:binary}
\end{figure}
<<>>=
rbind(
original = criterion(Townships),
reordered = criterion(Townships, o)
)
@
BEA tries to maximize the measure of effectiveness which is
much higher in the reordered matrix (in fact, 65 is the maximum for
the data set). Also the two types of stress are improved
significantly.
\subsection{Dissimilarity plot}
Assessing the quality of an obtained cluster solution has been a
research topic since the invention of cluster analysis. This is
especially important since all popular cluster algorithms produce a
clustering even for data without a ``cluster'' structure.
%A method to judge the quality of a cluster solution is by inspecting a
%visualization. For hierarchical clustering
%dendrogramms~\cite{seriation:Hartigan:1967} are available which show the
%hierarchical structure of the clustering as a binary tree and cluster quality
%can be judged by looking at the dissimilarities between objects in a cluster
%and objects in other clusters. However, such a visualization is
%only possible for heirarchical/nested clusterings.
%
%\marginpar{Cite Pison et al 1999 and Kaufmann and Rousseeuw}
%For the an arbitrary partitional clustering, the original objects can
%be displayed in a 2 dimensional scatter plot
%after using dimensionality reduction (e.g., PCA, MDS).
%Objects belonging to the same cluster can be marked and thus, if the
%dimensionality reduction preserves a large proportion of the
%variavility in the original data, the separation between clusters can be
%visually judged.
%
%Silhouettes
Matrix shading is an old technique to visualize clusterings by
displaying the rearranged matrices~\citep[see,
e.g.,][]{seriation:Sneath:1973,seriation:Ling:1973,seriation:Gale:1984}.
Initially matrix shading was used in connection with hierarchical
clustering, where the order of the dendrogram leaf nodes was used to
arrange the matrix. However, with some extensions, matrix shading can
also be used with any partitional clustering method.
\cite{seriation:Strehl:2003} suggest a matrix shading visualization called
\emph{CLUSION} where the dissimilarity matrix is arranged such that all objects
pertaining to a single cluster appear in consecutive order in the matrix. The
authors call this \emph{coarse seriation}. The result of a ``good'' clustering
should be a matrix with low dissimilarity values forming blocks around the main
diagonal. However, using coarse seriation, the order of the clusters has to be
predefined and the objects within each cluster are unordered.
The dissimilarity plots implemented by the function \func{dissplot} in
\pkg{seriation} improve \emph{CLUSION} using seriation methods. It aims
at visualizing global structure (similarity between different clusters
is reflected by their position relative to each other) as well as the
micro structure within each cluster (position of objects).
To position the clusters in the dissimilarity plot, an inter-cluster
dissimilarity matrix is calculated using the average between cluster
dissimilarities. \func{seriate} is used on this inter-cluster dissimilarity
matrix to arrange the clusters relative to each other resulting in on average
more similar clusters to appear closer together in the plot. Within each
cluster, \func{seriate} is used again on the sub-matrix of the dissimilarity
matrix concerning only the objects in the cluster.
For the example, we use again Euclidean distance between the objects in the
iris data set.
<<>>=
data("iris")
iris <- iris[sample(seq_len(nrow(iris))), ]
x_iris <- iris[, -5]
d_iris <- dist(x_iris, method = "euclidean")
@
First, we use \func{dissplot} without a clustering. We set \code{method} to
\code{NA} to prevent reordering and display the original matrix (see
Figure~\ref{fig:dissplot1}(a)). Then we omit the method argument which results
in using the default seriation technique from \func{seriate}. Since we
did not provide a clustering, the whole matrix is reordered
in one piece. From the result shown in Figure~\ref{fig:dissplot1}(b) it seems
that there is a clear structure in the data which suggests a two cluster
solution.
<<eval=FALSE, label=dissplot1>>=
## plot original matrix
dissplot(d_iris, method = NA)
@
<<eval=FALSE, label=dissplot2>>=
## plot reordered matrix
dissplot(d_iris, main = "Dissimilarity plot with seriation")
@
<<echo=FALSE, fig=FALSE, include=FALSE>>=
pdf(file = "seriation-dissplot1.pdf")
<<dissplot1>>
tmp <- dev.off()
pdf(file = "seriation-dissplot2.pdf")
<<dissplot2>>
tmp <- dev.off()
@
\begin{figure}
\begin{minipage}[b]{.48\linewidth}
\centering
\includegraphics[width=\linewidth]{seriation-dissplot1} \\
(a)
\end{minipage}
\begin{minipage}[b]{.48\linewidth}
\centering
\includegraphics[width=\linewidth]{seriation-dissplot2} \\
(b)
\end{minipage}
\caption{Two dissimilarity plots.
(a) the original dissimilarity matrix and
(b) the seriated dissimilarity matrix.}
\label{fig:dissplot1}
\end{figure}
Next, we create a cluster solution using the $k$-means algorithm.
Although we know
that the data set should contain $3$ groups representing the three species
of iris, we let $k$-means produce a $10$ cluster solution to study how such a
misspecification can be spotted using \func{dissplot}.
<<echo=FALSE>>=
set.seed(1234)
@
<<>>=
l <- kmeans(x_iris, 10)$cluster
#$
@
We create a standard dissimilarity plot by providing the cluster
solution as a vector of labels. The function rearranges the matrix and
plots the result. Since rearrangement can be a time consuming procedure for
large matrices, the rearranged matrix and all
information needed for plotting is returned as the result.
<<eval=FALSE, label=dissplot3>>=
res <- dissplot(d_iris, labels = l,
main = "Dissimilarity plot - standard")
@
<<echo=FALSE, fig=FALSE, include=FALSE>>=
pdf(file = "seriation-dissplot3.pdf")
## visualize the clustering
<<dissplot3>>
tmp <- dev.off()
pdf(file = "seriation-dissplot4.pdf")
## threshold
plot(res, main = "Dissimilarity plot - threshold",
threshold = 3)
tmp <- dev.off()
@
\begin{figure}
\centering
\includegraphics[width=10cm]{seriation-dissplot3}\\
(a)
\includegraphics[width=10cm]{seriation-dissplot4}\\
(b)
\caption{Dissimilarity plot for $k$-means solution with 10 clusters.
(a) standard plot and (b) plot with threshold.}
\label{fig:dissplot3}
\end{figure}
<<>>=
res
@
The resulting plot is shown in Figure~\ref{fig:dissplot3}(a). The
inter-cluster dissimilarities are shown as solid
gray blocks and the average object
dissimilarity within each cluster as gray triangles below the main diagonal of
the matrix. Since the clusters are arranged such that more similar clusters
are closer together, it is easy to see in Figure~\ref{fig:dissplot3}(a)
that clusters 6, 3 and 1 as well as clusters 10, 9, 5, 7, 8, 4 and 2
are very similar and form two blocks.
This suggests again that a two cluster solution would
be reasonable.
Since slight variations of gray values are hard to distinguish,
we plot the matrix again (using \func{plot} on the result above) and
use a threshold on the dissimilarity to suppress high dissimilarity
values in the plot.
<<eval=FALSE>>=
plot(res, options = list(main = "Seriation - threshold",
threshold = 3))
@
In the resulting plot in Figure~\ref{fig:dissplot3}(b), we see that the
block containing 10, 9, 5, 7, 8, 4 and 2 is very well defined and
cleanly separated from the other block. This suggests that these clusters
should form together a cluster in a solution with less clusters.
The other block is less well defined. There is considerable overlap between
clusters 6 and 3, but also cluster 3 and 1 share similar objects.
Using the information stored in the result of \func{dissplot} and
the class information available for the iris data set, we can analyze
the cluster solution and the interpretations of the dissimilarity plot.
<<>>=
#names(res)
table(iris[res$order, 5], res$label)[,res$cluster_order]
#$
@
As the plot in Figure~\ref{fig:dissplot3} indicated, the clusters 10, 9, 5, 7,
8, 4 and 2 should be a single cluster containing only flowers of the species
Iris Setosa. The clusters 6, 3 and 1 are more problematic since they contain a
mixture of Iris Versicolor and Virginica.
To illustrate the results of the dissimilarity plot in case a clustering
with a $k$ smaller than the actual number of groups in the data is used,
we use the Ruspini data set which consists of 75 points in four groups and
is also often used to illustrate clustering techniques. We load the data set,
calculate distances, perform $k$-means clustering with $k=3$ (although the
real number of groups is 4) and produce
a dissimilarity plot.
<<label=ruspini, fig=TRUE, include=FALSE>>=
data("ruspini", package = "cluster")
d <- dist(ruspini)
l <- kmeans(ruspini, 3)$cluster
dissplot(d, labels = l)
@
\begin{figure}
\centering
\includegraphics[width=10cm]{seriation-ruspini}\\
\caption{Dissimilarity plot for $k$-means solution with 3 clusters
for the Ruspini data set with 4 groups.}
\label{fig:ruspini}
\end{figure}
The dissimilarity plot in Figure~\ref{fig:ruspini} shows that
cluster 3 actually should be two separate clusters represented
by the two clearly visible darker triangles next to the main diagonal.
The dissimilarity plot using seriation is a useful tool to inspect the result
of clustering. It is especially useful to spot
misspecifications of the number of clusters employed.
A more detailed treatment of dissimilarity plots as a tool for exploring
partitional clustering can be found in
\cite{seriation:Hahsler+Kornik:2011}.
\section{Conclusion}
\label{sec:conclusion}
In this paper we presented the infrastructure provided by the
package~\pkg{seriation}. The infrastructure contains the necessary data
structures to store the linear order for one-, two- and $k$-mode data.
It also provides a wide array of seriation methods for different input
data, e.g., dissimilarities, binary and general data matrices focusing
on combinatorial optimization. New seriation methods can be easily
incorporated into the \pkg{seriation} framework by the user with the
method registry mechanism provided.
Based on seriation, \pkg{seriation} features several visualization
techniques. In particular, the optimally reordered heat map, the
Bertin plot and the dissimilarity plot present clear improvements over
standard plots.
A natural extension to \pkg{seriation} is the synthesis of ensembles of
seriations into a ``consensus'' one. Such ensembles do not only arise
when using different seriation methods, but also when varying data or
control parameters to obtain more robust solutions (see
e.g.~\cite{seriation:Jurman:2008} for a recent application of such ideas
in a molecular profiling context). The \proglang{R}~extension package
\pkg{relations}~\citep{seriation:Hornik+Meyer:2008} contains a variety
of methods for obtaining consensus \emph{relations}, covering consensus
seriation (where the relations are linear orders on the objects) as a
special case.
Future work on \pkg{seriation} will focus on adding further seriation
methods, such as for example methods for higher dimensional arrays and
methods for block seriation which aim at finding simultaneous partitions
of rows and columns in a data matrix~\citep[see,
e.g.,][]{seriation:Marcotorchino:1987}.
\section*{Acknowledgments}
The authors would like to thank Michael Brusco, Hans-Friedrich K{\"o}hn
and Stephanie Stahl for their seriation code, Fionn Murtagh for his BEA
implementation and the anonymous reviewers for their valuable comments
and suggestions.
%
%\bibliographystyle{abbrvnat}
\bibliography{seriation}
%
\end{document}