clustuneR
Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
clustuneR builds clusters from inputted sequence alignments and/or phylogenetic trees, allowing users to choose between multiple cluster-building algorithms implemented in the package. These algorithms can be further augmented through the selection of parameters, such as a required similarity for cluster formation, or a required level of certainty. The package also takes in meta-data associated with sequences such as a known collection date or subtype/variant classification. These can also allow users to identify cluster-level characteristics, such as the range of collection dates or the most common subtype/variant within a cluster.
If a subset of sequences are specified as “New”, then clustuneR simulates cluster growth by building clusters in two stages: first clusters are built from sequences which are not specified as new, then the new sequences are added to clusters. Depending on the clustering method used, this second step may include compromises to insure that new sequences do not retroactively change the membership of clusters. For example, if a single new sequence forms a cluster with two, previously separate clusters, then those two clusters would have ambiguous growth. Pairing cluster-level meta-data, with the growth of clusters is a common goal in research and clustuneR contains some functions to help test predictive models based on cluster data. Furthermore, clustuneR facilitates the assignment of multiple cluster sets from the same data using different methods and parameters. Pairing these with the effectiveness of growth models can be useful in method/parameter selection.
Installation
Because clustuneR uses pplacer to graft new sequences onto a phylogenetic tree, it can currently only be run on Linux systems.
If you have the git
version control system
installed on your computer, you can clone the repository by navigating
to a location of your filesystem where the package will be copied, and
then running
git clone https://github.com/PoonLab/clustuneR.git
If you do not have git
installed, then you can download the most
recent (developmental version) package as a ZIP archive at this link:
https://github.com/PoonLab/clustuneR/archive/refs/heads/master.zip
or from the Releases page: https://github.com/PoonLab/clustuneR/releases
If you have downloaded a .zip
or .tar.gz
archive, you can use
unzip
or tar -zvxf
on the command line, or double-click on the
archive file in your desktop environment.
Use cd clustuneR
to enter the package directory and run the following
command to install the package into R:
R CMD INSTALL .
You should see something like this on your console:
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
* installing *source* package ‘clustuneR’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (clustuneR)
Usage example
Building a tree
We start with a multiple sequence alignment of sequences that are
labelled with sample collection dates. An example of anonymized public
domain HIV-1 sequences from a study based in northern Alberta (Canada)
is provided in data/na.fasta
. First, we use an R script to exclude the
sequences collected in the most recent year:
require(clustuneR)
require(ape)
require(lubridate)
setwd("~/git/clustuneR")
seqs <- ape::read.FASTA("data/na.fasta", type="DNA")
# parse sequence headers (alternatively import from another file)
seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
old.seqs <- seqs[year(seq.info$coldate) < max.year]
write.FASTA(old.seqs, file="data/na-old.fasta")
Next, we use IQ-TREE to reconstruct a maximum likelihood tree relating the “old” sequences:
iqtree -bb 1000 -m GTR -nstop 200 -s na-old.fasta
Note we’ve specified the generalized time reversible model of nucleotide
substitution to bypass the model selection stage. Even so, this is a
time-consuming step - to speed things up, we’ve provided IQ-TREE output
files at data/na.nwk
and data/na.log
.
Grafting new sequences
Next, we import both the sequence alignment and the ML tree into R. We
will use clustuneR
to graft the sequences from the most recent year
using the program pplacer
and the output files from IQ-TREE.
phy <- ape::read.tree("data/na.nwk")
# use pplacer to graft new sequences onto old tree
phy.extend <- extend.tree(phy, seq.info, seqs, mc.cores=4, log.file="data/na.log")
Finding the optimal threshold
Next, we want to configure clustuneR
to fit two Poisson regression
models to the distribution of new cases among clusters, for a range of
genetic distance thresholds:
# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) list(t=phy.extend, branch.thresh=x, boot.thresh=0.95))
cluster.sets <- multi.cluster(step.cluster, param.list)
# configure Poisson regression models
p.models = list(
"NullModel" = function(x){
glm(Growth~Size, data=x, family="poisson")
},
"TimeModel" = function(x){
glm(Growth~Size+coldate, data=x, family="poisson")
}
)
p.trans = list( # average sample collection dates across nodes in each cluster
"coldate" = function(x){mean(x)}
)
res <- fit.analysis(cluster.sets, predictive.models=p.models,
predictor.transformations=p.trans)
AICs <- get.AIC(res)
delta.AIC <- AICs$TimeModelAIC - AICs$NullModelAIC
We can visualize the difference in AICs between models as a function of the distance threshold:
cutoffs <- sapply(param.list, function(x) x$branch.thresh)
par(mar=c(5,5,1,1))
plot(cutoffs, delta.AIC, type='l', col='cadetblue', lwd=2)
abline(h=0, lty=2)
Explanation
The optimal distance threshold is associated with the lowest value of
delta.AIC
. We expect that adding information on sample collection
dates should improve our ability to predict where the next infections
will occur. However, this improvement will depend on how we have
partitioned the database of known infections into clusters. If every
known infection is merged into a single giant cluster, then there is no
meaningful way to predict where new cases will occur, since there is no
variation for a sample of one cluster. If every infection each becomes a
cluster of one, then there will be excessive information loss due to
random variation in sampling dates. At the threshold that minimizes
delta.AIC
, the known infections are partitioned into clusters in such
a way that minimizes the information loss associated with incorporating
sample dates into the predictive model.
References
This package includes the binaries for pplacer and guppy (https://matsen.fhcrc.org/pplacer, released under the GPLv3 license), which are used to add new tips onto a fixed tree to simulate cluster growth prospectively.
- Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC bioinformatics. 2010 Dec;11(1):1-6.
As an example, this package includes a subset of a larger published
HIV-1 pol sequence data set. These sequences were originally published
in a study by Vrancken et al. (2017) and publicly accessible in the
GenBank database under the PopSet accession 1033910942
.
-
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic acids research. 2000 Jan 1;28(1):15-8.
-
Vrancken B, Adachi D, Benedet M, Singh A, Read R, Shafran S, Taylor GD, Simmonds K, Sikora C, Lemey P, Charlton CL. The multi-faceted dynamics of HIV-1 transmission in Northern Alberta: A combined analysis of virus genetic and public health data. Infection, Genetics and Evolution. 2017 Aug 1;52:100-5.