Skip to contents

Calculate the per gene UMI count matrix by parsing the genome alignment file.

Usage

quantify_gene(
  annotation,
  outdir,
  infq,
  n_process,
  pipeline = "sc_single_sample",
  samples = NULL
)

Arguments

annotation

The file path to the annotation file in GFF3 format

outdir

The path to directory to store all output files.

infq

The input FASTQ file.

n_process

The number of processes to use for parallelization.

pipeline

The pipeline type as a character string, either sc_single_sample (single-cell, single-sample),

samples

A vector of sample names, default to the file names of input fastq files, or folder names if fastqs is a vector of folders. bulk (bulk, single or multi-sample), or sc_multi_sample (single-cell, multiple samples)

Value

The count matrix will be saved in the output folder as transcript_count.csv.gz.

Details

After the genome alignment step (do_genome_align), the alignment file will be parsed to generate the per gene UMI count matrix. For each gene in the annotation file, the number of reads whose mapped ranges overlap with the gene's genome coordinates will be assigned to the gene. For reads can be assigned to multiple gene, the read will be assigned to the gene with the highest number of overlapping nucleotides. If the read can be assigned to multiple genes with the same number of overlapping nucleotides, the read will be not be assigned.

After the read-to-gene assignment, the per gene UMI count matrix will be generated. Specifically, for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e. the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same group will be collapsed to generate the UMI counts for each gene.

Finally, a new fastq file with deduplicated reads by keeping the longest read in each UMI.