sup

DBecker7, davidchampredon, ArtPoon, ConnorChato, GopiGugan 2019
GitHub

sup - Sequencing Uncertainty Propagation

Objectives

Propagate sequencing uncertainty in phylogenetic analysis.

Sequencing is a multi-step process which is prone to errors. If the output for the sequencing of a biological sample gives ATTGCTATGC, what is the error probability associated with this result, at each position (for example, what is the probability that the second base is indeed a T)? How can we propagate this uncertainty in downstream phylogenic analysis?

  1. Raw short read files (e.g. SAM)
    • Sequences are read little bits at a time. Each read is recorded and stored in a file (along with its alignment to a reference sequence).
    • The sequence reads include information on the read quality. This is encoded as a Phred score, which uses unicode characters to encode uncertainty about a given base call.
    • The called base is assigned a probability according to the Phred score. The remaining probability is assigned equally to the remaining possible bases.
      • For example, if P(T) = 0.7 from the Phred score, then P(A) = P(C) = P(G) = 0.1.
    • Given many short reads, the probabilities for each base pair at each site are added together.
      • If, say, T is always 100% certain, then the the sum of the probabilities will represent the number of reads.
  2. Consensus FASTQ Files
    • SAM files aren’t always practical (or available), so often FASTQ files are used instead.
    • These contain a Phred score for each base call. This phred score is calculated similarly to the process for SAM files, but only the uncertainty for the most probable base is reported.
  3. Consensus FASTA files
    • Contain no information about sequence uncertainty.
    • However, uncertainty can be generated around FASTA to investigate how sensitive the methods are to the assumption of certainty in the conseq.

Running

Dependencies:

  • sratoolkit
  • pangolin
  • R packages: install.packages(c("here", "dplyr", "ggplot2", "ape", "gtools", "tidyr", "rmarkdown", "patchwork", "scales", "stringr", "xtable"))
# Make required binaries (custom C program by Gopi Gugan)
cd src/parse-sam-c
make
cd ../..

# Download and parse SAM files into uncertainty matrices
Rscript src/downloader.R # Downloads the SAM files
bash src/run_all.sh # runs parse-sam-c on all SAM files

# Sample from the uncertainty matrix, produce reports
bash scripts/sample-S.sh