gromstole | Medical Bioinformatics @Western

Gromstole is collection of Python and R scripts to estimate the relative frequencies of different SARS-CoV-2 lineages, including specific “variants of concern”, from next-generation sequence data (FASTQ) derived from wastewater samples. This includes:

a wrapper script (minimap2.py) to rapidly stream output from the reference mapping program minimap to extract coverage and mutation frequency statistics for each sample;
an R script (estimate-freqs.R) that uses quasibinomial regression to estimate the frequency of a variant of concern from the frequencies of mutations based on the associated constellation file
a second R script (make-barplots.R) for visualizing these variant frequency estimates across a set of samples as a barplot

Dependencies

cutadapt 1.18+
minimap2 version 2.17+
Python version 3.6+
R

Usage

The following summarizes a workflow based on a pair of FASTQ files associated with a study of wastewater by the Nevada State Public Health Laboratory. These data can be obtained from the NCBI Sequence Read Archive using the command line tool fasterq-dump that is distributed with the sra-tools package.

The consellations submodule needs to be initialized prior to running the scripts for the first time.

git submodule init; git submodule update

Prior to running the scripts, the submodule should be updated to use the latest constellation JSON files.

git submodule foreach git pull origin main

art@Langley:~/git/gromstole$ ~/src/sratoolkit.2.11.3-ubuntu64/bin/fasterq-dump -p SRR17724299
join   :|-------------------------------------------------- 100%   
concat :|-------------------------------------------------- 100%   
spots read      : 614,374
reads read      : 1,228,748
reads written   : 1,228,748
art@Langley:~/git/gromstole$ python3 scripts/minimap2.py SRR17724299_1.fastq SRR17724299_2.fastq -o testrun
Defaulting output prefix stem to SRR17724299_1.fastq
50000.0 reads, 46769.0 (94%) mapped
100000.0 reads, 93735.0 (94%) mapped
150000.0 reads, 140548.0 (94%) mapped
200000.0 reads, 187560.0 (94%) mapped
250000.0 reads, 234509.5 (94%) mapped
300000.0 reads, 281501.0 (94%) mapped
350000.0 reads, 328490.0 (94%) mapped
400000.0 reads, 375566.0 (94%) mapped
450000.0 reads, 422431.5 (94%) mapped
500000.0 reads, 469568.5 (94%) mapped
550000.0 reads, 516504.5 (94%) mapped
600000.0 reads, 563587.5 (94%) mapped
art@Langley:~/git/gromstole$ Rscript scripts/estimate-freqs.R testrun constellations/constellations/definitions/cB.1.617.2.json \
testrun/SRR17724299_1.delta.json 
Loading required package: lubridate

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Loading required package: jsonlite
Loading required package: seqinr
Loading required package: here
here() starts at /home/art/git/gromstole

This yields the following JSON file:

{
  "results": [
    {
      "sample": "SRR17724299_1",
      "mutation": "S:T19R",
      "nucleotide": "C21618G",
      "count": 643,
      "coverage": 653
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "S:P681R",
      "nucleotide": "C23604G",
      "count": 2010,
      "coverage": 2025
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "S:D950N",
      "nucleotide": "G24410A",
      "count": 1629,
      "coverage": 2078
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "ORF3a:S26L",
      "nucleotide": "C25469T",
      "count": 1539,
      "coverage": 1556
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "M:I82T",
      "nucleotide": "T26767C",
      "count": 1915,
      "coverage": 1922
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "ORF7a:V82A",
      "nucleotide": "T27638C",
      "count": 501,
      "coverage": 502
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "ORF7a:T120I",
      "nucleotide": "C27752T",
      "count": 563,
      "coverage": 567
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "N:D63G",
      "nucleotide": "A28461G",
      "count": 2056,
      "coverage": 2078
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "N:R203M",
      "nucleotide": "G28881T",
      "count": 845,
      "coverage": 852
    },
    {
      "sample": "SRR17724299_1",
      "mutation": "N:D377Y",
      "nucleotide": "G29402T",
      "count": 646,
      "coverage": 655
    }
  ],
  "metadata": [
    {
      "sample": "SRR17724299_1",
      "lab": "gromstole"
    }
  ],
  "estimate": [
    {
      "est": 0.9576,
      "lower.95": 0.8888,
      "upper.95": 0.9903,
      "_row": "SRR17724299_1"
    }
  ],
  "lineage": ["B.1.617.2"],
  "run.dir": ["testrun"]
}

Description

Our current modelling strategy consists of a quasibinomial GLM for the count and coverage of the mutations that define a VOC, which are parsed from constellation JSON files curated by cov-lineages.

The script minimap2.py currently accepts paired-end reads in separate FASTQ files and outputs [prefix]-mapped.csv and [prefix]-coverage.csv into the specified output directory. It uses data/NC043312.fa as a reference genome, but an alternative FASTA file can be specified by the user. By default, it uses cutadapt to trim adapter sequences from the data.

Use python3 scripts/minimap2.py -h to see all of the options.

python3 scripts/minimap2.py r1.fastq r2.fastq --outdir results --prefix name-of-sample

Given a directory containing one or more subdirectories that each contain paired-end Illumina FASTQ fiiles, the following R script runs the binomial regression and outputs a json file that contains all relevant information:

the counts of each mutation of the lineages/ file
the coverage at every position on the reference genome,
the metadata that was used as input (optional),
the estimate of the proportion (including 95% confidence interval),
the lineage name, and
the name of the input directory.

Rscript scripts/estimate-freqs.R results/name-of-sample constellations/constellations/definitions/cBA.1.csv results/outfile.json path/to/metadata.csv