Treat each bam file as a bulk sample and identify variants against the reference
Usage
find_variants(
bam_path,
reference,
annotation,
min_nucleotide_depth = 100,
homopolymer_window = 3,
annotated_region_only = FALSE,
names_from = "gene_name",
threads = 1
)
Arguments
- bam_path
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!).
- reference
DNAStringSet: the reference genome
- annotation
GRanges: the annotation of the reference genome. You can load a GTF/GFF annotation file with
anno <- rtracklayer::import(file)
.- min_nucleotide_depth
integer(1): minimum read depth for a position to be considered a variant.
- homopolymer_window
integer(1): the window size to calculate the homopolymer percentage. The homopolymer percentage is calculated as the percentage of the most frequent nucleotide in a window of
-homopolymer_window
tohomopolymer_window
nucleotides around the variant position, excluding the variant position itself. Calculation of the homopolymer percentage is skipped whenhomopolymer_window = 0
. This is useful for filtering out Nanopore sequencing errors in homopolymer regions.- annotated_region_only
logical(1): whether to only consider variants outside annotated regions. If
TRUE
, only variants outside annotated regions will be returned. IfFALSE
, all variants will be returned, which could take significantly longer time.- names_from
character(1): the column name in the metadata column of the annotation (
mcols(annotation)[, names_from]
) to use for theregion
column in the output.- threads
integer(1): number of threads to use. Threading is done over each annotated region and (if
annotated_region_only = FALSE
) unannotated gaps for each bam file.
Value
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window
nucleotides around
the variant position, excluding the variant position itself.
Details
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations
with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as `>chr1 1` instead of `>chr1`. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1])
.
Examples
outdir <- tempfile()
dir.create(outdir)
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE)
download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq",
destfile = file.path(outdir, "demultipelxed.fq")
) # can't be bothered to run demultiplexing again
if (!any(is.na(sys_which(c("minimap2", "k8"))))) {
minimap2_align( # align to genome
config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")),
fa_file = genome_fa,
fq_in = file.path(outdir, "demultipelxed.fq"),
annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"),
outdir = outdir
)
variants <- find_variants(
bam_path = file.path(outdir, "align2genome.bam"),
reference = genome_fa,
annotation = GenomicRanges::GRanges("chr14", IRanges::IRanges(1, 1)),
min_nucleotide_depth = 10
)
head(variants)
}