Skip to contents

Semi-supervised isoform detection and annotation for long read data. This variant is for multi-sample single cell data. By default, this pipeline demultiplexes input fastq data (match_cell_barcode = TRUE). Specific parameters relating to analysis can be changed either through function arguments, or through a configuration JSON file.

Usage

sc_long_multisample_pipeline(
  annotation,
  fastqs,
  outdir,
  genome_fa,
  minimap2 = NULL,
  k8 = NULL,
  barcodes_file = NULL,
  expect_cell_numbers = NULL,
  config_file = NULL
)

Arguments

annotation

The file path to the annotation file in GFF3 format

fastqs

The input fastq files for multiple samples. Should be a named vector of file paths (eithr to FASTQ files or directories containing FASTQ files). The names of the vector will be used as the sample names.

outdir

The path to directory to store all output files.

genome_fa

The file path to genome fasta file.

minimap2

Path to minimap2, if it is not in PATH. Only required if either or both of do_genome_align and do_read_realign are TRUE.

k8

Path to the k8 Javascript shell binary. Only required if do_genome_align is TRUE.

barcodes_file

The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the demultiplexing will be performed using BLAZE. Default is NULL.

expect_cell_numbers

A vector of roughly expected numbers of cells in each sample E.g., the targeted number of cells. Required if using BLAZE for demultiplexing, specifically, when the do_barcode_demultiplex are TRUE in the the JSON configuration file and barcodes_file is not specified. Default is NULL.

config_file

File path to the JSON configuration file. If specified, config_file overrides all configuration parameters

Value

If "do_transcript_quantification" set to true, a list with two elements:

metadata

A list of metadata from the pipeline run.

sces

A list of SingleCellExperiment objects, one for each sample.

Details

By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align), FLAMES summarizes the alignment for each read in every sample by grouping reads with similar splice junctions to get a raw isoform annotation (do_isoform_id). The raw isoform annotation is compared against the reference annotation to correct potential splice site and transcript start/end errors. Transcripts that have similar splice junctions and transcript start/end to the reference transcript are merged with the reference. This process will also collapse isoforms that are likely to be truncated transcripts. If isoform_id_bambu is set to TRUE, bambu::bambu will be used to generate the updated annotations (Not implemented for multi-sample yet). Next is the read realignment step (do_read_realign), where the sequence of each transcript from the update annotation is extracted, and the reads are realigned to this updated transcript_assembly.fa by minimap2. The transcripts with only a few full-length aligned reads are discarded (Not implemented for multi-sample yet). The reads are assigned to transcripts based on both alignment score, fractions of reads aligned and transcript coverage. Reads that cannot be uniquely assigned to transcripts or have low transcript coverage are discarded. The UMI transcript count matrix is generated by collapsing the reads with the same UMI in a similar way to what is done for short-read scRNA-seq data, but allowing for an edit distance of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage can be modified by the JSON configuration file (config_file).

The default parameters can be changed either through the function arguments are through the configuration JSON file config_file. the pipeline_parameters section specifies which steps are to be executed in the pipeline - by default, all steps are executed. The isoform_parameters section affects isoform detection - key parameters include:

Min_sup_cnt

which causes transcripts with less reads aligned than it's value to be discarded

MAX_TS_DIST

which merges transcripts with the same intron chain and TSS/TES distace less than MAX_TS_DIST

strand_specific

which specifies if reads are in the same strand as the mRNA (1), or the reverse complemented (-1) or not strand specific (0), which results in strand information being based on reference annotation.

See also

bulk_long_pipeline() for bulk long data, SingleCellExperiment() for how data is outputted

Examples

reads <- ShortRead::readFastq(
  system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")
)
outdir <- tempfile()
dir.create(outdir)
dir.create(file.path(outdir, "fastq"))
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads,
  file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)

sce_list <- FLAMES::sc_long_multisample_pipeline(
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  fastqs = c("sampleA" = file.path(outdir, "fastq"),
    "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"),
    "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"),
    "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")),
  outdir = outdir,
  genome_fa = genome_fa,
  barcodes_file = rep(bc_allow, 4)
)
#> No config file provided, creating a default config in /tmp/RtmpoS7Kzz/file1f6d5a5aef91 
#> Writing configuration parameters to:  /tmp/RtmpoS7Kzz/file1f6d5a5aef91/config_file_8045.json 
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpoS7Kzz/file1f6d5a5aef91/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpoS7Kzz/file1f6d5a5aef91/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 92
#> Number of reads with exactly one barcode match: 91
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpoS7Kzz/file1f6d5a5aef91/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 95
#> Number of reads with exactly one barcode match: 94
#> Number of chimera reads: 0
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpoS7Kzz/file1f6d5a5aef91/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 193
#> Number of reads where at least one barcode was found: 181
#> Number of reads with exactly one barcode match: 179
#> Number of chimera reads: 0
#> All done!
#> Running FLAMES pipeline...
#> #### Input parameters:
#> {
#>   "comment": ["this is the default config for nanopore single cell long read data using 10x 3'end kit. use splice annotation in alignment. "],
#>   "pipeline_parameters": {
#>     "seed": [2022],
#>     "threads": [8],
#>     "do_barcode_demultiplex": [true],
#>     "do_gene_quantification": [true],
#>     "do_genome_alignment": [true],
#>     "do_isoform_identification": [true],
#>     "bambu_isoform_identification": [false],
#>     "multithread_isoform_identification": [false],
#>     "do_read_realignment": [true],
#>     "do_transcript_quantification": [true],
#>     "oarfish_quantification": [true]
#>   },
#>   "barcode_parameters": {
#>     "max_bc_editdistance": [2],
#>     "max_flank_editdistance": [8],
#>     "pattern": {
#>       "primer": ["CTACACGACGCTCTTCCGATCT"],
#>       "BC": ["NNNNNNNNNNNNNNNN"],
#>       "UMI": ["NNNNNNNNNNNN"],
#>       "polyT": ["TTTTTTTTT"]
#>     },
#>     "strand": ["-"],
#>     "TSO_seq": ["AAGCAGTGGTATCAACGCAGAGTACATGGG"],
#>     "TSO_prime": [3],
#>     "full_length_only": [false]
#>   },
#>   "isoform_parameters": {
#>     "generate_raw_isoform": [false],
#>     "max_dist": [10],
#>     "max_ts_dist": [100],
#>     "max_splice_match_dist": [10],
#>     "min_fl_exon_len": [40],
#>     "max_site_per_splice": [3],
#>     "min_sup_cnt": [5],
#>     "min_cnt_pct": [0.001],
#>     "min_sup_pct": [0.2],
#>     "bambu_trust_reference": [true],
#>     "strand_specific": [0],
#>     "remove_incomp_reads": [4],
#>     "downsample_ratio": [1]
#>   },
#>   "alignment_parameters": {
#>     "use_junctions": [true],
#>     "no_flank": [false]
#>   },
#>   "realign_parameters": {
#>     "use_annotation": [true]
#>   },
#>   "transcript_counting": {
#>     "min_tr_coverage": [0.4],
#>     "min_read_coverage": [0.4]
#>   }
#> }gene annotation: /__w/_temp/Library/FLAMES/extdata/rps24.gtf.gz 
#> genome fasta: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/rps24.fa 
#> input fastqs: /tmp/RtmpoS7Kzz/file1f6d5a5aef91/sampleA_matched_reads.fastq
#>  /tmp/RtmpoS7Kzz/file1f6d5a5aef91/sample1_matched_reads.fastq
#>  /tmp/RtmpoS7Kzz/file1f6d5a5aef91/sample2_matched_reads.fastq
#>  /tmp/RtmpoS7Kzz/file1f6d5a5aef91/sample3_matched_reads.fastq
#>  
#> output directory: /tmp/RtmpoS7Kzz/file1f6d5a5aef91 
#> minimap2 path: 
#> k8 path: 
#> #### Aligning reads to genome using minimap2
#> 	Aligning sample  sampleA ...
#> 02:14:10 AM Fri Oct 25 2024 minimap2_align
#> 	Aligning sample  sample1 ...
#> 02:14:11 AM Fri Oct 25 2024 minimap2_align
#> 	Aligning sample  sample2 ...
#> 02:14:11 AM Fri Oct 25 2024 minimap2_align
#> 	Aligning sample  sample3 ...
#> 02:14:11 AM Fri Oct 25 2024 minimap2_align
#> 02:14:11 AM Fri Oct 25 2024 quantify genes 
#> Found genome alignment file(s): 	sample1_align2genome.bam
#> 	sample2_align2genome.bam
#> 	sample3_align2genome.bam
#> 	sampleA_align2genome.bam
#> 02:14:12 AM Fri Oct 25 2024 find_isoform
#> #### Realign to transcript using minimap2
#> 	Realigning sample  sampleA ...
#> 02:14:12 AM Fri Oct 25 2024 minimap2_realign
#> Sorting by  CB 
#> 	Realigning sample  sample1 ...
#> 02:14:12 AM Fri Oct 25 2024 minimap2_realign
#> Sorting by  CB 
#> 	Realigning sample  sample2 ...
#> 02:14:12 AM Fri Oct 25 2024 minimap2_realign
#> Sorting by  CB 
#> 	Realigning sample  sample3 ...
#> 02:14:12 AM Fri Oct 25 2024 minimap2_realign
#> Sorting by  CB 
#> #### Generating transcript count matrix