Treat each bam file as a bulk sample and identify variants against the reference
Usage
find_variants(
bam_path,
reference,
annotation,
min_nucleotide_depth = 100,
homopolymer_window = 3,
annotated_region_only = FALSE,
names_from = "gene_name",
threads = 1
)
Arguments
- bam_path
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!).
- reference
DNAStringSet: the reference genome
- annotation
GRanges: the annotation of the reference genome. You can load a GTF/GFF annotation file with
anno <- rtracklayer::import(file)
.- min_nucleotide_depth
integer(1): minimum read depth for a position to be considered a variant.
- homopolymer_window
integer(1): the window size to calculate the homopolymer percentage. The homopolymer percentage is calculated as the percentage of the most frequent nucleotide in a window of
-homopolymer_window
tohomopolymer_window
nucleotides around the variant position, excluding the variant position itself. Calculation of the homopolymer percentage is skipped whenhomopolymer_window = 0
. This is useful for filtering out Nanopore sequencing errors in homopolymer regions.- annotated_region_only
logical(1): whether to only consider variants outside annotated regions. If
TRUE
, only variants outside annotated regions will be returned. IfFALSE
, all variants will be returned, which could take significantly longer time.- names_from
character(1): the column name in the metadata column of the annotation (
mcols(annotation)[, names_from]
) to use for theregion
column in the output.- threads
integer(1): number of threads to use. Threading is done over each annotated region and (if
annotated_region_only = FALSE
) unannotated gaps for each bam file.
Value
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window
nucleotides around
the variant position, excluding the variant position itself.
Details
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations
with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as `>chr1 1` instead of `>chr1`. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1])
.
Examples
ppl <- example_pipeline("SingleCellPipeline")
#> Writing configuration parameters to: /tmp/RtmpgpEV0i/file25bf2842cb6e/config_file_9663.json
#> Configured steps:
#> barcode_demultiplex: TRUE
#> genome_alignment: TRUE
#> gene_quantification: TRUE
#> isoform_identification: TRUE
#> read_realignment: TRUE
#> transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_step(ppl, "genome_alignment")
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz -> /tmp/RtmpgpEV0i/file25bf2842cb6e/align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 8 threads...
#> Indexing bam files
variants <- find_variants(
bam_path = ppl@genome_bam,
reference = ppl@genome_fa,
annotation = ppl@annotation,
min_nucleotide_depth = 4
)
#> 03:22:37 Reading reference ...
#> 03:22:37 Reading annotation ...
#> Merging overlapping genes ...
#> 0 overlapping regions, their gene_name(s) are merged with `, ` as separator
#> 03:22:37 Adding unannotated gaps ...
#> 03:22:37 Got 1 bam file, parallelizing over each region ...
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> 03:22:42 Merging results ...
#> 03:22:42 Calculating homopolymer percentages ...
#>
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================ | 24%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================= | 34%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================ | 41%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|============================== | 44%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|=================================================== | 74%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
#>
head(variants)
#> # A tibble: 6 × 10
#> seqnames pos nucleotide count sum freq ref bam_path region
#> <fct> <dbl> <fct> <int> <dbl> <dbl> <fct> <chr> <chr>
#> 1 chr14 8 G 4 30 0.133 A /tmp/RtmpgpEV0i/fil… Rps24
#> 2 chr14 11 A 4 36 0.111 G /tmp/RtmpgpEV0i/fil… Rps24
#> 3 chr14 15 A 4 37 0.108 G /tmp/RtmpgpEV0i/fil… Rps24
#> 4 chr14 22 - 12 227 0.0529 T /tmp/RtmpgpEV0i/fil… Rps24
#> 5 chr14 26 + 6 302 0.0199 C /tmp/RtmpgpEV0i/fil… Rps24
#> 6 chr14 26 - 8 302 0.0265 C /tmp/RtmpgpEV0i/fil… Rps24
#> # ℹ 1 more variable: homopolymer_pct <dbl>