Semi-supervised isoform detection and annotation for long read data.
This variant is for single cell data. By default, this pipeline demultiplexes input
fastq data (match_cell_barcode = TRUE
). Specific parameters relating to
analysis can be changed either through function arguments, or through a
configuration JSON file.
Usage
sc_long_pipeline(
annotation,
fastq,
genome_bam = NULL,
outdir,
genome_fa,
minimap2 = NULL,
k8 = NULL,
barcodes_file = NULL,
expect_cell_number = NULL,
config_file = NULL
)
Arguments
- annotation
The file path to the annotation file in GFF3 format
- fastq
The file path to input fastq file
- genome_bam
Optional file path to a bam file to use instead of fastq file (skips initial alignment step)
- outdir
The path to directory to store all output files.
- genome_fa
The file path to genome fasta file.
- minimap2
Path to minimap2, if it is not in PATH. Only required if either or both of
do_genome_align
anddo_read_realign
areTRUE
.- k8
Path to the k8 Javascript shell binary. Only required if
do_genome_align
isTRUE
.- barcodes_file
The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the demultiplexing will be performed using BLAZE. Default is
NULL
.- expect_cell_number
Expected number of cells for identifying the barcode list in BLAZE. This could be just a rough estimate. E.g., the targeted number of cells. Required if the
do_barcode_demultiplex
areTRUE
in the the JSON configuration file andbarcodes_file
is not specified. Default isNULL
.- config_file
File path to the JSON configuration file. If specified,
config_file
overrides all configuration parameters
Value
if do_transcript_quantification
set to true, sc_long_pipeline
returns a SingleCellExperiment
object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
- transcript_count.csv.gz
- a transcript count matrix (also contained in the SingleCellExperiment)
- isoform_annotated.filtered.gff3
- isoforms in gff3 format (also contained in the SingleCellExperiment)
- transcript_assembly.fa
- transcript sequence from the isoforms
- align2genome.bam
- sorted BAM file with reads aligned to genome
- realign2transcript.bam
- sorted realigned BAM file using the transcript_assembly.fa as reference
- tss_tes.bedgraph
- TSS TES enrichment for all reads (for QC)
if do_transcript_quantification
set to false, nothing will be returned
Details
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron chain and TSS/TES distace less than
MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1), or the reverse complemented (-1) or not strand specific (0), which results in strand information being based on reference annotation.
See also
bulk_long_pipeline()
for bulk long data,
SingleCellExperiment()
for how data is outputted
Examples
outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
if (!any(is.na(find_bin(c("minimap2", "k8"))))) {
sce <- FLAMES::sc_long_pipeline(
genome_fa = genome_fa,
fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
outdir = outdir,
barcodes_file = bc_allow
)
}
#> No config file provided, creating a default config in /tmp/RtmpoS7Kzz/file1f6d48152c2a
#> Writing configuration parameters to: /tmp/RtmpoS7Kzz/file1f6d48152c2a/config_file_8045.json
#> 02:14:15 AM Fri Oct 25 2024 Start running
#> 02:14:15 AM Fri Oct 25 2024 Demultiplexing using flexiplex...
#> Matching cell barcodes...
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 8
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpoS7Kzz/file1f6d48152c2a/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> 02:14:15 AM Fri Oct 25 2024 Demultiplex done
#> Running FLAMES pipeline...
#> #### Input parameters:
#> {
#> "comment": ["this is the default config for nanopore single cell long read data using 10x 3'end kit. use splice annotation in alignment. "],
#> "pipeline_parameters": {
#> "seed": [2022],
#> "threads": [8],
#> "do_barcode_demultiplex": [true],
#> "do_gene_quantification": [true],
#> "do_genome_alignment": [true],
#> "do_isoform_identification": [true],
#> "bambu_isoform_identification": [false],
#> "multithread_isoform_identification": [false],
#> "do_read_realignment": [true],
#> "do_transcript_quantification": [true],
#> "oarfish_quantification": [true]
#> },
#> "barcode_parameters": {
#> "max_bc_editdistance": [2],
#> "max_flank_editdistance": [8],
#> "pattern": {
#> "primer": ["CTACACGACGCTCTTCCGATCT"],
#> "BC": ["NNNNNNNNNNNNNNNN"],
#> "UMI": ["NNNNNNNNNNNN"],
#> "polyT": ["TTTTTTTTT"]
#> },
#> "strand": ["-"],
#> "TSO_seq": ["AAGCAGTGGTATCAACGCAGAGTACATGGG"],
#> "TSO_prime": [3],
#> "full_length_only": [false]
#> },
#> "isoform_parameters": {
#> "generate_raw_isoform": [false],
#> "max_dist": [10],
#> "max_ts_dist": [100],
#> "max_splice_match_dist": [10],
#> "min_fl_exon_len": [40],
#> "max_site_per_splice": [3],
#> "min_sup_cnt": [5],
#> "min_cnt_pct": [0.001],
#> "min_sup_pct": [0.2],
#> "bambu_trust_reference": [true],
#> "strand_specific": [0],
#> "remove_incomp_reads": [4],
#> "downsample_ratio": [1]
#> },
#> "alignment_parameters": {
#> "use_junctions": [true],
#> "no_flank": [false]
#> },
#> "realign_parameters": {
#> "use_annotation": [true]
#> },
#> "transcript_counting": {
#> "min_tr_coverage": [0.4],
#> "min_read_coverage": [0.4]
#> }
#> }
#> gene annotation: /__w/_temp/Library/FLAMES/extdata/rps24.gtf.gz
#> genome fasta: /tmp/RtmpoS7Kzz/file1f6d48152c2a/rps24.fa
#> input fastq: /tmp/RtmpoS7Kzz/file1f6d48152c2a/matched_reads.fastq
#> output directory: /tmp/RtmpoS7Kzz/file1f6d48152c2a
#> minimap2 path:
#> k8 path:
#> #### Aligning reads to genome using minimap2
#> 02:14:15 AM Fri Oct 25 2024 minimap2_align
#> 02:14:15 AM Fri Oct 25 2024 Start gene quantification and UMI deduplication
#> 02:14:15 AM Fri Oct 25 2024 quantify genes
#> Found genome alignment file(s): align2genome.bam
#> 02:14:15 AM Fri Oct 25 2024 Gene quantification and UMI deduplication done!
#> 02:14:15 AM Fri Oct 25 2024 Start isoform identificaiton
#> 02:14:15 AM Fri Oct 25 2024 find_isoform
#> 02:14:15 AM Fri Oct 25 2024 Isoform identificaiton done
#> #### Realigning deduplicated reads to transcript using minimap2
#> 02:14:15 AM Fri Oct 25 2024 minimap2_realign
#> Sorting by CB
#> #### Generating transcript count matrix