2025-01-12 04:36:52 +08:00

619 lines
21 KiB
Plaintext

%\VignetteIndexEntry{An Overview of the IRanges package}
%\VignetteDepends{IRanges}
%\VignetteKeywords{Ranges,IntegerRanges,IRanges,IRangesList,Views,AtomicList}
%\VignettePackage{IRanges}
\documentclass{article}
\usepackage[authoryear,round]{natbib}
<<style, echo=FALSE, results=tex>>=
BiocStyle::latex(use.unsrturl=FALSE)
@
\title{An Overview of the \Biocpkg{IRanges} package}
\author{Patrick Aboyoun, Michael Lawrence, Herv\'e Pag\`es}
\date{Edited: February 2018; Compiled: \today}
\begin{document}
\maketitle
\tableofcontents
<<options,echo=FALSE>>=
options(width=72)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
When analyzing sequences, we are often interested in particular
consecutive subsequences. For example, the \Sexpr{letters} vector could
be considered a sequence of lower-case letters, in alphabetical order.
We would call the first five letters (\textit{a} to \textit{e}) a
consecutive subsequence, while the subsequence containing only the
vowels would not be consecutive. It is not uncommon for an analysis
task to focus only on the geometry of the regions, while
ignoring the underlying sequence values. A list of indices would be a
simple way to select a subsequence. However, a sparser representation
for consecutive subsequences would be a range, a pairing of a start
position and a width, as used when extracting sequences with
\Rfunction{window}.
Two central classes are available in Bioconductor for representing
ranges: the \Rclass{IRanges} class defined in the \Biocpkg{IRanges}
package for representing ranges defined on a single space, and the
\Rclass{GRanges} class defined in the \Biocpkg{GenomicRanges} package
for representing ranges defined on multiple spaces.
In this vignette, we will focus on \Rclass{IRanges} objects. We will rely
on simple, illustrative example datasets, rather than large, real-world data,
so that each data structure and algorithm can be explained in an intuitive,
graphical manner. We expect that packages that apply \Biocpkg{IRanges} to
a particular problem domain will provide vignettes with relevant, realistic
examples.
The \Biocpkg{IRanges} package is available at bioconductor.org and can be
downloaded via \Rfunction{BiocManager::install}:
<<install, eval=FALSE>>=
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("IRanges")
@
<<initialize, results=hide>>=
library(IRanges)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\Rclass{IRanges} objects}
To construct an \Rclass{IRanges} object, we call the \Rfunction{IRanges}
constructor. Ranges are normally specified by passing two out of the three
parameters: start, end and width (see \Rcode{help(IRanges)} for more
information).
%
<<iranges-constructor>>=
ir1 <- IRanges(start=1:10, width=10:1)
ir1
ir2 <- IRanges(start=1:10, end=11)
ir3 <- IRanges(end=11, width=10:1)
identical(ir1, ir2) && identical(ir1, ir3)
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40),
width=c(12, 6, 6, 15, 6, 2, 7))
ir
@
%
All of the above calls construct the same \Rclass{IRanges} object, using
different combinations of the \Rcode{start}, \Rcode{end} and
\Rcode{width} parameters.
Accessing the starts, ends and widths is supported via the \Rfunction{start},
\Rfunction{end} and \Rfunction{width} getters:
<<iranges-start>>=
start(ir)
@
<<iranges-end>>=
end(ir)
@
<<iranges-width>>=
width(ir)
@
Subsetting an \Rclass{IRanges} object is supported, by numeric and logical
indices:
<<iranges-subset-numeric>>=
ir[1:4]
@
<<iranges-subset-logical>>=
ir[start(ir) <= 15]
@
In order to illustrate range operations, we'll create a function to plot
ranges.
<<plotRanges>>=
plotRanges <- function(x, xlim=x, main=deparse(substitute(x)),
col="black", sep=0.5, ...)
{
height <- 1
if (is(xlim, "IntegerRanges"))
xlim <- c(min(start(xlim)), max(end(xlim)))
bins <- disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins)*(height + sep)))
ybottom <- bins * (sep + height) - height
rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...)
title(main)
axis(1)
}
@
<<ir-plotRanges, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
plotRanges(ir)
@
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.5\textwidth]{IRangesOverview-ir-plotRanges}
\caption{\label{fig-ir-plotRanges}%
Plot of original ranges.}
\end{center}
\end{figure}
\subsection{Normality}
Sometimes, it is necessary to formally represent a
subsequence, where no elements are repeated and order is
preserved. Also, it is occasionally useful to think of an
\Rclass{IRanges} object as a set of integers, where no elements are
repeated and order does not matter.
The \Rclass{NormalIRanges} class formally represents a set of integers.
By definition an \Rclass{IRanges} object is said to be \textit{normal}
when its ranges are: (a) not empty (i.e. they have a non-null width);
(b) not overlapping; (c) ordered from left to right; (d) not even
adjacent (i.e. there must be a non empty gap between 2 consecutive
ranges).
There are three main advantages of using a \textit{normal}
\Rclass{IRanges} object: (1) it guarantees a subsequence encoding
or set of integers, (2) it is compact in terms of the number
of ranges, and (3) it uniquely identifies its information, which
simplifies comparisons.
The \Rfunction{reduce} function reduces any \Rclass{IRanges}
object to a \Rclass{NormalIRanges} by merging redundant ranges.
<<ranges-reduce, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
reduce(ir)
plotRanges(reduce(ir))
@
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-reduce}
\caption{\label{fig-ranges-reduce}%
Plot of reduced ranges.}
\end{center}
\end{figure}
\subsection{Lists of \Rclass{IRanges} objects}
It is common to manipulate collections of \Rclass{IRanges} objects
during an analysis. Thus, the \Biocpkg{IRanges} package defines some
specific classes for working with multiple \Rclass{IRanges} objects.
The \Rclass{IRangesList} class asserts that each element is an
\Rclass{IRanges} object and provides convenience methods, such
as \Rfunction{start}, \Rfunction{end} and \Rfunction{width} accessors
that return \Rclass{IntegerList} objects, aligning with the
\Rclass{IRangesList} object. Note that \Rclass{IntegerList} objects
will be covered later in more details in the ``Lists of Atomic Vectors''
section of this document.
To explicitly construct an \Rclass{IRangesList}, use the
\Rfunction{IRangesList} function.
<<rangeslist-contructor>>=
rl <- IRangesList(ir, rev(ir))
@
%
<<rangeslist-start>>=
start(rl)
@
\subsection{Vector Extraction}
As the elements of an \Rclass{IRanges} object encode consecutive
subsequences, they may be used directly in sequence extraction. Note
that when a \textit{normal} \Rclass{IRanges} is given as the index,
the result is a subsequence, as no elements are repeated or reordered.
If the sequence is a \Rclass{Vector} subclass (i.e. not an ordinary
\Rclass{vector}), the canonical \Rfunction{[} function accepts an
\Rclass{IRanges} object.
%
<<bracket-ranges>>=
set.seed(0)
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length=500),
seq(10, 0.001, length=500))
xVector <- rpois(1e7, lambda)
yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)])
xRle <- Rle(xVector)
yRle <- Rle(yVector)
irextract <- IRanges(start=c(4501, 4901) , width=100)
xRle[irextract]
@
%
\subsection{Finding Overlapping Ranges}
The function \Rfunction{findOverlaps} detects overlaps between two
\Rclass{IRanges} objects.
<<overlap-ranges>>=
ol <- findOverlaps(ir, reduce(ir))
as.matrix(ol)
@
\subsection{Counting Overlapping Ranges}
The function \Rfunction{coverage} counts the number of ranges over each
position.
<<ranges-coverage, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
cov <- coverage(ir)
plotRanges(ir)
cov <- as.vector(cov)
mat <- cbind(seq_along(cov)-0.5, cov)
d <- diff(cov) != 0
mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat)
mat <- mat[order(mat[,1]),]
lines(mat, col="red", lwd=4)
axis(2)
@
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-coverage}
\caption{\label{fig-ranges-coverage}%
Plot of ranges with accumulated coverage.}
\end{center}
\end{figure}
\subsection{Finding Neighboring Ranges}
The \Rfunction{nearest} function finds the nearest neighbor ranges (overlapping
is zero distance). The \Rfunction{precede} and \Rfunction{follow} functions
find the non-overlapping nearest neighbors on a specific side.
\subsection{Transforming Ranges}
Utilities are available for transforming an \Rclass{IRanges} object
in a variety of ways. Some transformations, like \Rfunction{reduce}
introduced above, can be dramatic, while others are simple per-range
adjustments of the starts, ends or widths.
\subsubsection{Adjusting starts, ends and widths}
Perhaps the simplest transformation is to adjust the start values by a
scalar offset, as performed by the \Rfunction{shift} function. Below,
we shift all ranges forward 10 positions.
%
<<ranges-shift>>=
shift(ir, 10)
@
There are several other ways to transform ranges. These include
\Rfunction{narrow}, \Rfunction{resize}, \Rfunction{flank},
\Rfunction{reflect}, \Rfunction{restrict}, and \Rfunction{threebands}.
For example \Rfunction{narrow} supports the adjustment of start, end
and width values, which should be relative to each range. These adjustments
are vectorized over the ranges. As its name suggests, the ranges can only be
narrowed.
%
<<ranges-narrow>>=
narrow(ir, start=1:5, width=2)
@
The \Rfunction{restrict} function ensures every range falls within a
set of bounds. Ranges are contracted as necessary, and the ranges that
fall completely outside of but not adjacent to the bounds are dropped,
by default.
%
<<ranges-restrict>>=
restrict(ir, start=2, end=3)
@
The \Rfunction{threebands} function extends \Rfunction{narrow} so that
the remaining left and right regions adjacent to the narrowed region
are also returned in separate \Rclass{IRanges} objects.
%
<<ranges-threebands>>=
threebands(ir, start=1:5, width=2)
@
The arithmetic operators \Rfunction{+}, \Rfunction{-} and
\Rfunction{*} change both the start and the end/width by symmetrically
expanding or contracting each range. Adding or subtracting a numeric
(integer) vector to an \Rclass{IRanges} causes each range to be
expanded or contracted on each side by the corresponding value in the
numeric vector.
<<ranges-plus>>=
ir + seq_len(length(ir))
@
%
The \Rfunction{*} operator symmetrically magnifies an \Rclass{IRanges}
object by a factor, where positive contracts (zooms in) and negative
expands (zooms out).
%
<<ranges-asterisk>>=
ir * -2 # double the width
@
WARNING: The semantic of these arithmetic operators might be revisited
at some point. Please restrict their use to the context of interactive
visualization (where they arguably provide some convenience) but avoid
to use them programmatically.
\subsubsection{Making ranges disjoint}
A more complex type of operation is making a set of ranges disjoint,
\textit{i.e.} non-overlapping. For example, \Rfunction{threebands}
returns a disjoint set of three ranges for each input range.
The \Rfunction{disjoin} function makes an \Rclass{IRanges} object
disjoint by fragmenting it into the widest ranges where the set of
overlapping ranges is the same.
<<ranges-disjoin, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
disjoin(ir)
plotRanges(disjoin(ir))
@
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-disjoin}
\caption{\label{fig-ranges-disjoin}%
Plot of disjoined ranges.}
\end{center}
\end{figure}
A variant of \Rfunction{disjoin} is \Rfunction{disjointBins}, which
divides the ranges into bins, such that the ranges in each bin are
disjoint. The return value is an integer vector of the bins.
<<ranges-disjointBins>>=
disjointBins(ir)
@
\subsubsection{Other transformations}
Other transformations include \Rfunction{reflect} and
\Rfunction{flank}. The former ``flips'' each range within a set of common
reference bounds.
<<ranges-reflect>>=
reflect(ir, IRanges(start(ir), width=width(ir)*2))
@
%
The \Rfunction{flank} returns ranges of a specified width that flank,
to the left (default) or right, each input range. One use case of this
is forming promoter regions for a set of genes.
<<ranges-flank>>=
flank(ir, width=seq_len(length(ir)))
@
%
\subsection{Set Operations}
Sometimes, it is useful to consider an \Rclass{IRanges} object as a
set of integers, although there is always an implicit ordering. This is
formalized by \Rclass{NormalIRanges}, above, and we now present
versions of the traditional mathematical set operations
\textit{complement}, \textit{union}, \textit{intersect}, and
\textit{difference} for \Rclass{IRanges} objects. There are two
variants for each operation. The first treats each \Rclass{IRanges}
object as a set and returns a \textit{normal} value, while the other
has a ``parallel'' semantic like \Rfunction{pmin}/\Rfunction{pmax} and
performs the operation for each range pairing separately.
The \textit{complement} operation is implemented by the
\Rfunction{gaps} and \Rfunction{pgap} functions. By default,
\Rfunction{gaps} will return the ranges that fall between the ranges
in the (normalized) input. It is possible to specify a set of bounds,
so that flanking ranges are included.
<<ranges-gaps, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
gaps(ir, start=1, end=50)
plotRanges(gaps(ir, start=1, end=50), c(1,50))
@
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-gaps}
\caption{\label{fig-ranges-gap}%
Plot of gaps from ranges.}
\end{center}
\end{figure}
\Rfunction{pgap} considers each parallel pairing between two
\Rclass{IRanges} objects and finds the range, if any, between
them. Note that the function name is singular, suggesting that only
one range is returned per range in the input.
<<ranges-pgap>>=
@
The remaining operations, \textit{union}, \textit{intersect} and
\textit{difference} are implemented by the \Rfunction{[p]union},
\Rfunction{[p]intersect} and \Rfunction{[p]setdiff} functions,
respectively. These are relatively self-explanatory.
<<ranges-union>>=
@
<<ranges-punion>>=
@
<<ranges-intersect>>=
@
<<ranges-pintersect>>=
@
<<ranges-setdiff>>=
@
<<ranges-psetdiff>>=
@
% \subsection{Mapping Ranges Between Vectors}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Vector Views}
The \Biocpkg{IRanges} package provides the virtual \Rclass{Views} class,
which stores a vector-like object, referred to as the ``subject'', together
with an \Rclass{IRanges} object defining ranges on the subject. Each range
is said to represent a \textit{view} onto the subject.
Here, we will demonstrate the \Rclass{RleViews} class, where the subject
is of class \Rclass{Rle}. Other \Rclass{Views} implementations exist,
such as \Rclass{XStringViews} in the \Biocpkg{Biostrings} package.
\subsection{Creating Views}
There are two basic constructors for creating views: the \Rfunction{Views}
function based on indicators and the \Rfunction{slice} based on numeric
boundaries.
<<Views-constructors>>=
xViews <- Views(xRle, xRle >= 1)
xViews <- slice(xRle, 1)
xRleList <- RleList(xRle, 2L * rev(xRle))
xViewsList <- slice(xRleList, 1)
@
Note that \Rclass{RleList} objects will be covered later in more details
in the ``Lists of Atomic Vectors'' section of this document.
\subsection{Aggregating Views}
While \Rfunction{sapply} can be used to loop over each window, the native
functions \Rfunction{viewMaxs}, \Rfunction{viewMins}, \Rfunction{viewSums},
and \Rfunction{viewMeans} provide fast looping to calculate their respective
statistical summaries.
<<views-looping>>=
head(viewSums(xViews))
viewSums(xViewsList)
head(viewMaxs(xViews))
viewMaxs(xViewsList)
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Lists of Atomic Vectors}
In addition to the range-based objects described in the previous sections,
the \Biocpkg{IRanges} package provides containers for storing lists of
atomic vectors such as \Rclass{integer} or \Rclass{Rle} objects.
The \Rclass{IntegerList} and \Rclass{RleList} classes represent lists of
\Rclass{integer} vectors and \Rclass{Rle} objects respectively.
They are subclasses of the \Rclass{AtomicList} virtual class which is
itself a subclass of the \Rclass{List} virtual class defined in the
\Biocpkg{S4Vectors} package.
<<AtomicList-intro>>=
showClass("RleList")
@
As the class definition above shows, the \Rclass{RleList} class is virtual
with subclasses \Rclass{SimpleRleList} and \Rclass{CompressedRleList}. A
\Rclass{SimpleRleList} class uses an ordinary \R{} list to store the
underlying elements and the \Rclass{CompressedRleList} class stores the
elements in an unlisted form and keeps track of where the element breaks
are. The former ``simple list" class is useful when the \Rclass{Rle} elements
are long and the latter ``compressed list" class is useful when the list is
long and/or sparse (i.e. a number of the list elements have length 0).
In fact, all of the atomic vector types (logical, integer, numeric,
complex, character, raw, and factor) have similar list classes that derive
from the \Rclass{List} virtual class. For example, there is an
\Rclass{IntegerList} virtual class with subclasses
\Rclass{SimpleIntegerList} and \Rclass{CompressedIntegerList}.
Each of the list classes for atomic sequences, be they stored as vectors
or \Rclass{Rle} objects, have a constructor function with a name of the
appropriate list virtual class, such as \Rclass{IntegerList}, and an
optional argument \Rcode{compress} that takes an argument to specify
whether or not to create the simple list object type or the compressed
list object type. The default is to create the compressed list object
type.
<<list-construct>>=
args(IntegerList)
cIntList1 <- IntegerList(x=xVector, y=yVector)
cIntList1
sIntList2 <- IntegerList(x=xVector, y=yVector, compress=FALSE)
sIntList2
## sparse integer list
xExploded <- lapply(xVector[1:5000], function(x) seq_len(x))
cIntList2 <- IntegerList(xExploded)
sIntList2 <- IntegerList(xExploded, compress=FALSE)
object.size(cIntList2)
object.size(sIntList2)
@
The \Rfunction{length} function returns the number of elements in a
\Rclass{Vector}-derived object and, for a \Rclass{List}-derived object
like ``simple list" or ``compressed list", the \Rfunction{lengths}
function returns an integer vector containing the lengths of each of the
elements:
<<list-length>>=
length(cIntList2)
Rle(lengths(cIntList2))
@
Just as with ordinary \R{} \Rclass{list} objects, \Rclass{List}-derived
object support the \Rfunction{[[} for element extraction, \Rfunction{c}
for concatenating, and \Rfunction{lapply}/\Rfunction{sapply} for looping.
When looping over sparse lists, the ``compressed list" classes can be much
faster during computations since only the non-empty elements are looped over
during the \Rfunction{lapply}/\Rfunction{sapply} computation and all the empty
elements are assigned the appropriate value based on their status.
<<list-lapply>>=
system.time(sapply(xExploded, mean))
system.time(sapply(sIntList2, mean))
system.time(sapply(cIntList2, mean))
identical(sapply(xExploded, mean), sapply(sIntList2, mean))
identical(sapply(xExploded, mean), sapply(cIntList2, mean))
@
Unlike ordinary \R{} \Rclass{list} objects, \Rclass{AtomicList} objects support
the \Rfunction{Ops} (e.g. \Rfunction{+}, \Rfunction{==}, \Rfunction{\&}),
\Rfunction{Math} (e.g. \Rfunction{log}, \Rfunction{sqrt}),
\Rfunction{Math2} (e.g. \Rfunction{round}, \Rfunction{signif}),
\Rfunction{Summary} (e.g. \Rfunction{min}, \Rfunction{max}, \Rfunction{sum}),
and \Rfunction{Complex} (e.g. \Rfunction{Re}, \Rfunction{Im}) group generics.
<<list-groupgenerics>>=
xRleList > 0
yRleList <- RleList(yRle, 2L * rev(yRle))
xRleList + yRleList
sum(xRleList > 0 | yRleList > 0)
@
Since these atomic lists inherit from \Rclass{List}, they can also use
the looping function \Rfunction{endoapply} to perform endomorphisms.
<<list-endoapply>>=
safe.max <- function(x) { if(length(x)) max(x) else integer(0) }
endoapply(sIntList2, safe.max)
endoapply(cIntList2, safe.max)
endoapply(sIntList2, safe.max)[[1]]
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Session Information}
Here is the output of \Rcode{sessionInfo()} on the system on which this
document was compiled:
<<SessionInfo, echo=FALSE>>=
sessionInfo()
@
\end{document}