bayroot

ArtPoon, RouxCil 2021

bayroot: a Bayesian approach to root-to-tip regression

bayroot is a simple R package that provides functions for a Bayesian approach to root-to-tip regression.

Description

Root-to-tip regression (RTT) makes use of the molecular clock assumption that the rate of evolution is roughly constant over time. Starting with a phylogenetic tree relating genetic sequences that have been observed at different points in time, this method regresses the divergence of each sequence from an inferred common ancestor at the root of the tree against the time of sampling. If the molecular clock assumption holds, then we expect to see a linear increase in divergence over time. The standard approach to fitting this linear model is to minimize a cost function such as least squares or R-squared (Rambaut et al 2016).

A more complex approach to this problem is to use Bayesian methods to sample trees from the posterior distribution defined by the sequences, sampling dates, model of evolution, and prior belief. However, there are applications where a full Bayesian treatment may be difficult to compute, such as for large numbers of sequences, or when many sequences have unknown sampling dates to be estimated. Therefore we developed bayroot to provide a solution that is intermediate between these two extremes.

Dependencies

bayroot was developed in R version 3.6.2. It requires the following packages:

  • phytools for the reroot function
  • msm for the truncated normal distribution function (rtnorm)
  • ggfree for analyzing and plotting trees
  • chemCal for inverse.predict function
  • twt (optional, for simulating test data)

Usage

Fitting a root-to-tip regression

# load example tree (101 influenza HA sequences from 2001-2006) from Russell et al. 2008)
phy <- read.tree("data/h3n2.nwk")

# first show root-to-tip regression using ape::rtt
tip.dates <- get.dates(phy, format="%d-%b-%Y")
rooted <- rtt(phy, as.integer(tip.dates))
div <- node.depth.edgelength(rooted)[1:Ntip(rooted)]
plot(tip.dates, div, xlim=c(as.Date("1999-01-01"), max(tip.dates)), 
     ylim=c(0, max(div)))
abline(lm(div ~ tip.dates))

# now carry out Bayesian sampling, using the RTT tree to initialize the chain
settings <- list(seq.len=987, format="%d-%b-%Y", 
                 mindate=as.Date("1995-01-01"), maxdate=as.Date("2000-01-01"),
                 meanlog=-5, sdlog=2,
                 root.delta=0.03, date.sd=60, rate.delta=0.002)
params <- list(phy=rooted, rate=0.01, origin=as.Date("1995-01-01"))
res <- bayroot(nstep=1e5, skip=100, params=params, settings=settings, echo=T)  # about 8 minutes
plot(res, burnin=100)  # calls a generic S3 method

# since we can't make a trace log of trees, bayroot provides a plotting option for viewing a specific step in the chain sample
plot(res, settings, step=1000)

# another way of looking at the MCMC results
plot(tip.dates, div, xlim=c(as.Date("1999-01-01"), max(tip.dates)), 
      ylim=c(0, max(div)), type='n')
abline(fit)
for (i in seq(100, nrow(res$log), length.out=100)) {
  row <- res$log[i,]
  clock <- row$rate / settings$seq.len
  segments(
    x0=row$origin, x1=max(tip.dates), 
    y0=0, y1=clock*as.integer(max(tip.dates)-row$origin),
    col=rgb(0,0,1,0.2)
  )
}
points(tip.dates, div, pch=21, cex=1, bg='white')

Estimating unknown tip dates

bayroot was actually developed for our study of the latent HIV reservoir. HIV converts its RNA genome into DNA and then integrates that DNA into the host genome. In a small number of host cells, that HIV DNA can remain dormant for years until the cell becomes “reactivated” and starts to produce new virus copies. We are interested in reconstructing when copies of HIV in this latent reservoir became integrated.

Since the HIV DNA basically stops evolving upon integration, this problem is equivalent to imputing missing dates on some tips of a time-scaled tree. A Bayesian approach is well-suited to this problem! We implemented a compartmental model in the R package twt to simulate trees relating a mixture of actively evolving HIV RNA and integrated HIV DNA. Let’s reconstruct the integration dates for one of these simulations.

First, we’ll load the bayroot functions and a couple of packages, and then import the simulated data. We wrote some utility functions that were designed to operate on files containing only one tree and integration times for one replicate, but we didn’t want to upload hundreds of these files to the repository. Instead we’re going to extract one replicate and generate the expected file formats:

source("bayroot.R")
require(ape)
require(lubridate)

# 50 trees generated from a simulation of cell dynamics
trees <- read.tree("data/testdata-latent2.nwk")
times <- read.csv("data/testdata-latent2.csv")

# export one of the replicates to new files
rep <- 1
tf <- "data/example.nwk"
write.tree(trees[[rep]], file=tf)

cf <- "data/example.csv"
int.times <- times$int.time[times$rep==rep]
names(int.times) <- times$sample[times$rep==rep]
write.csv(as.data.frame(int.times), file=cf)

Next, we’ll configure our analysis:

phy <- read.tree(tf)
settings <- list(
  seq.len=1233,  # AY772699
  format="%Y-%m-%d",
  # we arbitrarily assigned 2000-01-01 as the actual origin date
  mindate=as.Date("1999-12-01"),  # one month before origin
  maxdate=as.Date("2000-02-01"),  # one month after origin
  meanlog=-5, sdlog=2,  # hyperparameters for lognormal prior on rate
  root.delta=0.01,  # proposal on root location
  date.sd=10,  # days, origin proposal
  rate.delta=0.01,  # proposal on clock rate
  censored = phy$tip.label[grepl("^Latent", phy$tip.label)]
)

This utility function will initialize the model parameters and then run a chain sample for 20,000 steps, only keeping every 20th step. If you want to see the chain being updated in real time, you can add an argument echo=TRUE - note this will generate a lot of console output! Otherwise just wait a couple of minutes for the chain sample to complete.

source("scripts/validate.R")  # loads package chemCal
set.seed(1)  # make this reproducible
res <- fit.bayroot(tf, cf, settings=settings, nstep=2e4, skip=20, 
                   max.date=as.Date("2001-04-01"))

Note we set the maximum integration date to the above date because ART was initiated 15 months post-infection in this set of simulations.

Here I’m displaying the contents of the list returned from fit.bayroot:

> summary(res)
           Length Class   Mode   
phy         5     phylo   list   
pred.dates 10     -none-  list   
tip.dates  40     Date    numeric
censored   40     Date    numeric
chain       2     bayroot list   
  • phy is just the original tree
  • pred.dates is a list containing data frames of integration date estimates for every censored tip
  • tip.dates is a vector of sampling times associated with each tip as Date objects, including censored tips
  • censored replaces Date values in tip.dates with missing values (NA) for censored tips
  • chain is a custom S3 object of class bayroot that contains our chain samples of model parameters and rooted trees from the posterior distribution

If we call the generic plot method on the bayroot object, R will display a composite set of plots summarizing the posterior traces of our chain sample. We’ll discard the first 2,000 steps as burnin:

plot(res$chain, burnin=100)

Let’s compare our posterior sample of integration dates to standard root-to-tip regression using a couple of utility functions that we’ve provided in scripts/validate.R:

rt <- root2tip(phy)
true.vals <- get.true.values(rt, cf)
est <- get.estimates(res, rt)

and generate a plot summarizing this comparison:

par(mar=c(5,5,1,1))
plot(rt, true.vals=true.vals, xlab="Time since infection",
     ylab="Divergence", xlim=c(0, 20), ylim=c(0, max(rt$div)),
     las=1, cex.axis=0.8)
points(est$est, est$div, pch=19, col=rgb(0,0,1,0.5), cex=0.8)
segments(x0=est$lo95, x1=est$hi95, y0=est$div, col=rgb(0,0,1,0.3), lwd=5)
abline(v=15, lty=2)
legend(x=0, y=0.035, legend=c("RTT", "bayroot", "true date"), cex=0.8,
       col=c('red', 'blue', 'black'), pch=c(19, 19, 3), pt.lwd=2, bty='n')

Note the open circles represent RNA sequences used to calibrate the molecular clock.

References

  • Rambaut, A., Lam, T. T., Max Carvalho, L., & Pybus, O. G. (2016). Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus evolution, 2(1), vew007.
  • Russell, C. A., Jones, T. C., Barr, I. G., Cox, N. J., Garten, R. J., Gregory, V., … & Smith, D. J. (2008). The global circulation of seasonal influenza A (H3N2) viruses. Science, 320(5874), 340-346.