Calculate the per gene UMI count matrix by parsing the genome alignment file.
Usage
quantify_gene(
annotation,
outdir,
infq,
n_process,
pipeline = "sc_single_sample",
samples = NULL
)
Arguments
- annotation
The file path to the annotation file in GFF3 format
- outdir
The path to directory to store all output files.
- infq
The input FASTQ file.
- n_process
The number of processes to use for parallelization.
- pipeline
The pipeline type as a character string, either
sc_single_sample
(single-cell, single-sample),- samples
A vector of sample names, default to the file names of input fastq files, or folder names if
fastqs
is a vector of folders.bulk
(bulk, single or multi-sample), orsc_multi_sample
(single-cell, multiple samples)
Details
After the genome alignment step (do_genome_align
), the alignment file will be parsed to
generate the per gene UMI count matrix. For each gene in the annotation file, the number of
reads whose mapped ranges overlap with the gene's genome coordinates will be assigned to the
gene. For reads can be assigned to multiple gene, the read will be assigned to the gene with
the highest number of overlapping nucleotides. If the read can be assigned to multiple genes
with the same number of overlapping nucleotides, the read will be not be assigned.
After the read-to-gene assignment, the per gene UMI count matrix will be generated. Specifically, for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e. the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same group will be collapsed to generate the UMI counts for each gene.
Finally, a new fastq file with deduplicated reads by keeping the longest read in each UMI.