Skip to contents

Treat each bam file as a bulk sample and identify variants against the reference

Usage

find_variants(
  bam_path,
  reference,
  annotation,
  min_nucleotide_depth = 100,
  homopolymer_window = 3,
  annotated_region_only = FALSE,
  names_from = "gene_name",
  threads = 1
)

Arguments

bam_path

character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!).

reference

DNAStringSet: the reference genome

annotation

GRanges: the annotation of the reference genome. You can load a GTF/GFF annotation file with anno <- rtracklayer::import(file).

min_nucleotide_depth

integer(1): minimum read depth for a position to be considered a variant.

homopolymer_window

integer(1): the window size to calculate the homopolymer percentage. The homopolymer percentage is calculated as the percentage of the most frequent nucleotide in a window of -homopolymer_window to homopolymer_window nucleotides around the variant position, excluding the variant position itself. Calculation of the homopolymer percentage is skipped when homopolymer_window = 0. This is useful for filtering out Nanopore sequencing errors in homopolymer regions.

annotated_region_only

logical(1): whether to only consider variants outside annotated regions. If TRUE, only variants outside annotated regions will be returned. If FALSE, all variants will be returned, which could take significantly longer time.

names_from

character(1): the column name in the metadata column of the annotation (mcols(annotation)[, names_from]) to use for the region column in the output.

threads

integer(1): number of threads to use. Threading is done over each annotated region and (if annotated_region_only = FALSE) unannotated gaps for each bam file.

Value

A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region, homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the most frequent nucleotide in a window of homopolymer_window nucleotides around the variant position, excluding the variant position itself.

Details

Each bam file is treated as a bulk sample to perform pileup and identify variants. You can run sc_mutations with the variants identified with this function to get single-cell allele counts. Note that reference genome FASTA files may have the chromosome names field as `>chr1 1` instead of `>chr1`. You may need to remove the trailing number to match the chromosome names in the bam file, for example with names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1]).

Examples

outdir <- tempfile()
dir.create(outdir)
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)
download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq",
  destfile = file.path(outdir, "demultipelxed.fq")
) # can't be bothered to run demultiplexing again
if (!any(is.na(sys_which(c("minimap2", "k8"))))) {
  minimap2_align( # align to genome
    config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")),
    fa_file = genome_fa,
    fq_in = file.path(outdir, "demultipelxed.fq"),
    annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"),
    outdir = outdir
  )
  variants <- find_variants(
    bam_path = file.path(outdir, "align2genome.bam"),
    reference = genome_fa,
    annotation = GenomicRanges::GRanges("chr14", IRanges::IRanges(1, 1)),
    min_nucleotide_depth = 10
  )
  head(variants)
}