Highlighter
is a web-based tool maintained by the Los Alamos National Laboratory
(LANL) HIV Sequence Database that makes it easy for investigators to
visually scan their sequence alignments for compositional differences.
However, it is inconvenient for users to use Highlighter to process a
large number of alignment files as the tool only accepts a single file
as input. In addition, the LANL website limits users to a maximum of 500
sequences per file, which can be problematic for users working with
alignments derived from next-generation sequencing platforms.
Furthermore, the tool does not visualize the frequencies of different
sequence variants in a dataset.
highlineR is a free, open-source R package that provides users with
accessible functions for batch-processing of sequence alignments,
including NGS data, to generate plots similar to Highlighter. In
addition, highlineR extends the Highlighter plot by varying line
widths to represent variant frequencies in the data (see Usage).
Installation
The simplest method to install highlineR is to use the R devtools
package:
# install.packages("devtools") # if not already installed
require(devtools)
devtools::install_github("PoonLab/highlineR")
Usage
As a basic example of using highlineR
, we’re going to load a number of
anonymized data
sets
sets from a published
study
of HIV-1 diversity within patients:
require(highlineR)
#> Loading required package: highlineR
# use glob to retrieve paths to FASTA files
files <- Sys.glob('~/git/highlineR/inst/extdata/*.fasta')
highline(files[1:2]) # render the first two alignments
Each horizontal grey band represents a sequence variant. The area of
the band is proportional to the number of times that variant occurs in
the alignment. (It may be more intuitive to make height proportional
to variant counts, but this causes problems when a variant occurs 1000
times or more!) By default, highlineR
selects the most common variant
to be the master (reference) sequence and places it at the top of the
plot — it is also labeled with an (m)
.
Each band is decorated with coloured tick marks to indicate the
locations of nucleotide (or amino acid) differences relative to the
master sequence. A colour reference key (legend) is provided at the top
of the plot.
Note that for the above example, the files were loaded from a developer
directory. To load these same files from your installed package as a
user, you’d have to replace the following line:
# files <- Sys.glob('~/git/highlineR/inst/extdata/*.fasta')
files <- Sys.glob(paste0(system.file(package='highlineR'), '/extdata/*.fasta'))
Background
A multiple sequence alignment (MSA) is a hypothesis about how residues
(nucleotides or amino acids) in two or more sequences were derived from
residues in a common ancestral sequence. By convention, each sequence is
arranged horizontally in rows to be read from left to right, and
evolutionarily-related (homologous) residues are arranged vertically in
a column. It is possible for two or more sequences to be exactly
identical. When this occurs, we refer to the shared sequence as the
“sequence variant”, and the number of times this variant appears in
the data as its “count”. (Note there is no established terminology for
these features.) It is common practice to reduce a sequence alignment to
the unique variants, especially when certain variants predominate the
sample population. By doing so, however, we lose information on the
relative abundance of variants.
If an MSA comprises a large number of sequences, it may become difficult
to visualize the composition of the alignment. Highlighter plots were
developed to reduce the information in the MSA by marking the location
of residues that are different from the reference sequence, which has
been selected by the user or the program. We refer to this reference
sequence the master sequence. By default, the highline
function
selects the most common sequence variant to be the master.
The highline
function is actually a high-level wrapper for several
functions in highlineR
. These lower level functions are exposed to the
user, so you have the option of doing more extensive customization than
highline
will allow. Detailed instructions for using (and potentially
modifying) these functions are provided in the
CONTRIBUTING.md document.
In summary, the pipeline for generating a plot in highlineR
are:
-
Initialize a highlineR
session. This creates an R
environment for storing
the sequence alignment data sets.
-
Import one or more sequence alignments from the respective files.
highlineR
supports FASTA, FASTQ and CSV file formats.
-
Parse the raw sequence data from each file.
-
Compress the sequence data so that identical sequences are
represented by a single copy annotated by the number of instances in
the alignment.
-
Calculate the sequence differences from the “master sequence” and
generate the Highlighter style plot.