Treat each bam file as a bulk sample and identify variants against the reference
Usage
find_variants(
bam_path,
reference,
annotation,
min_nucleotide_depth = 100,
homopolymer_window = 3,
annotated_region_only = FALSE,
names_from = "gene_name",
threads = 1
)
Arguments
- bam_path
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!).
- reference
DNAStringSet: the reference genome
- annotation
GRanges: the annotation of the reference genome. You can load a GTF/GFF annotation file with
anno <- rtracklayer::import(file)
.- min_nucleotide_depth
integer(1): minimum read depth for a position to be considered a variant.
- homopolymer_window
integer(1): the window size to calculate the homopolymer percentage. The homopolymer percentage is calculated as the percentage of the most frequent nucleotide in a window of
-homopolymer_window
tohomopolymer_window
nucleotides around the variant position, excluding the variant position itself. Calculation of the homopolymer percentage is skipped whenhomopolymer_window = 0
. This is useful for filtering out Nanopore sequencing errors in homopolymer regions.- annotated_region_only
logical(1): whether to only consider variants outside annotated regions. If
TRUE
, only variants outside annotated regions will be returned. IfFALSE
, all variants will be returned, which could take significantly longer time.- names_from
character(1): the column name in the metadata column of the annotation (
mcols(annotation)[, names_from]
) to use for theregion
column in the output.- threads
integer(1): number of threads to use. Threading is done over each annotated region and (if
annotated_region_only = FALSE
) unannotated gaps for each bam file.
Value
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window
nucleotides around
the variant position, excluding the variant position itself.
Details
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations
with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as `>chr1 1` instead of `>chr1`. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1])
.
Examples
outdir <- tempfile()
dir.create(outdir)
genome_fa <- system.file("extdata", "rps24.fa.gz", package = "FLAMES")
minimap2_align( # align to genome
config = jsonlite::fromJSON(
system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")),
fa_file = genome_fa,
fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"),
annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
outdir = outdir
)
#> 02:13:22 AM Fri Oct 25 2024 minimap2_align
#> total mapped primary
#> /tmp/RtmpoS7Kzz/file1f6d16a7deed/align2genome.bam 10 10 10
#> secondary
#> /tmp/RtmpoS7Kzz/file1f6d16a7deed/align2genome.bam 0
variants <- find_variants(
bam_path = file.path(outdir, "align2genome.bam"),
reference = genome_fa,
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
min_nucleotide_depth = 4
)
#> 02:13:23 Reading reference ...
#> 02:13:23 Reading annotation ...
#> 02:13:23 Adding unannotated gaps ...
#> 02:13:23 Got 1 bam file, parallelizing over each region ...
#>
|
| | 0%
|
|======================= | 33%
|
|=============================================== | 67%
|
|======================================================================| 100%
#>
#> 02:13:25 Merging results ...
#> 02:13:25 Calculating homopolymer percentages ...
#>
|
| | 0%
|
|=== | 4%
|
|====== | 8%
|
|========= | 12%
|
|============ | 17%
|
|=============== | 21%
|
|================== | 25%
|
|==================== | 29%
|
|======================= | 33%
|
|========================== | 38%
|
|============================= | 42%
|
|================================ | 46%
|
|=================================== | 50%
|
|====================================== | 54%
|
|========================================= | 58%
|
|============================================ | 62%
|
|=============================================== | 67%
|
|================================================== | 71%
|
|==================================================== | 75%
|
|======================================================= | 79%
|
|========================================================== | 83%
|
|============================================================= | 88%
|
|================================================================ | 92%
|
|=================================================================== | 96%
|
|======================================================================| 100%
#>
head(variants)
#> # A tibble: 6 × 10
#> seqnames pos nucleotide count sum freq ref bam_path region
#> <fct> <dbl> <fct> <int> <dbl> <dbl> <fct> <chr> <chr>
#> 1 chr14 1084 G 4 9 0.444 A /tmp/RtmpoS7Kzz/file… Rps24
#> 2 chr14 1085 C 4 10 0.4 T /tmp/RtmpoS7Kzz/file… Rps24
#> 3 chr14 1217 A 4 10 0.4 G /tmp/RtmpoS7Kzz/file… Rps24
#> 4 chr14 1384 - 4 10 0.4 A /tmp/RtmpoS7Kzz/file… Rps24
#> 5 chr14 2724 + 4 9 0.444 G /tmp/RtmpoS7Kzz/file… Rps24
#> 6 chr14 2789 - 5 10 0.5 A /tmp/RtmpoS7Kzz/file… Rps24
#> # ℹ 1 more variable: homopolymer_pct <dbl>