Pipeline for multi-sample long-read scRNA-seq data

Semi-supervised isofrom detection and annotation for long read data. This variant is meant for multi-sample scRNA-seq data. Specific parameters can be configured in the config file (see create_config), input files are specified via arguments.

Usage

MultiSampleSCPipeline(
  config_file,
  outdir,
  fastq,
  annotation,
  genome_fa,
  genome_mmi,
  minimap2,
  samtools,
  barcodes_file,
  expect_cell_number,
  controllers
)

Arguments

config_file: Path to the JSON configuration file. See create_config for creating one.
outdir: Path to the output directory. If it does not exist, it will be created.
fastq: A named vector of fastq file (or folder) paths. Each element of the vector will be treated as a sample. The names of the vector will be used as the sample names. If not named, the sample names will be generated from the file names.
annotation: The file path to the annotation file in GFF3 / GTF format.
genome_fa: The file path to the reference genome in FASTA format.
genome_mmi: (optional) The file path to minimap2's index reference genome.
minimap2: (optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via basilisk.
samtools: (optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via basilisk.
barcodes_file: The file with expected cell barcodes, with each barcode on a new line.
expect_cell_number: The expected number of cells in the sample. This is used if barcodes_file is not provided. See BLAZE for more details.
controllers: (optional, experimental) A crew_class_controller object for running certain steps

Value

A FLAMES.MultiSampleSCPipeline object. The pipeline can be run using the run_FLAMES function. The resulting list of SingleCellExperiment objects can be accessed using the experiment method.

Details

By default the pipeline starts with demultiplexing the input fastq data. If the cell barcodes are known apriori (e.g. via coupled short-read sequencing), the barcodes_file argument can be used to specify a file containing the cell barcodes, and a modified Rcpp version of flexiplex will be used; otherwise, expect_cell_number need to be provided, and BLAZE will be used to generate the cell barcodes. The pipeline then aligns the reads to the genome using minimap2. The alignment is then used for isoform detection (either using FLAMES or bambu, can be configured). The reads are then realigned to the detected isoforms. Finally, a transcript count matrix is generated (either using FLAMES's simplistic counting or oarfish's Expectation Maximization algorithm, can be configured). The results can be accssed with experiment(pipeline). If the pipeline errored out / new steps were configured, it can be resumed by calling resume_FLAMES(pipeline)

Examples

reads <- ShortRead::readFastq(
  system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")
)
outdir <- tempfile()
dir.create(outdir)
dir.create(file.path(outdir, "fastq"))
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads,
  file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)
ppl <- MultiSampleSCPipeline(
  config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE),
  outdir = outdir,
  fastq = c("sampleA" = file.path(outdir, "fastq"),
    "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"),
    "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"),
    "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")),
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  genome_fa = genome_fa,
  barcodes_file = rep(bc_allow, 4)
)
#> Writing configuration parameters to:  /tmp/Rtmp3Wjsxa/file358a74837864/config_file_13706.json 
#> Configured steps: 
#> 	barcode_demultiplex: TRUE
#> 	genome_alignment: TRUE
#> 	gene_quantification: TRUE
#> 	isoform_identification: TRUE
#> 	read_realignment: TRUE
#> 	transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> ── Running step: barcode_demultiplex @ Fri Jul  4 03:18:37 2025 ────────────────
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/Rtmp3Wjsxa/file358a74837864/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/Rtmp3Wjsxa/file358a74837864/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 92
#> Number of reads with exactly one barcode match: 91
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/Rtmp3Wjsxa/file358a74837864/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 95
#> Number of reads with exactly one barcode match: 94
#> Number of chimera reads: 0
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/Rtmp3Wjsxa/file358a74837864/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 193
#> Number of reads where at least one barcode was found: 181
#> Number of reads with exactly one barcode match: 179
#> Number of chimera reads: 0
#> All done!
#> ── Running step: genome_alignment @ Fri Jul  4 03:18:38 2025 ───────────────────
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_matched_reads.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample1_matched_reads.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample1_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample2_matched_reads.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample2_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample3_matched_reads.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample3_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> ── Running step: gene_quantification @ Fri Jul  4 03:18:39 2025 ────────────────
#> 03:18:39 AM Fri Jul 04 2025 quantify genes 
#> Using BAM(s): /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_align2genome.bam,
#> /tmp/Rtmp3Wjsxa/file358a74837864/sample1_align2genome.bam,
#> /tmp/Rtmp3Wjsxa/file358a74837864/sample2_align2genome.bam, and
#> /tmp/Rtmp3Wjsxa/file358a74837864/sample3_align2genome.bam
#> ── Running step: isoform_identification @ Fri Jul  4 03:18:40 2025 ─────────────
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> ── Running step: read_realignment @ Fri Jul  4 03:18:41 2025 ───────────────────
#> Checking for fastq file(s) /tmp/Rtmp3Wjsxa/file358a74837864/fastq, /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample1.fq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample2.fq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/fastq/sample3.fq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_matched_reads.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample1_matched_reads.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample2_matched_reads.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample3_matched_reads.fastq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_matched_reads_dedup.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample1_matched_reads_dedup.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample2_matched_reads_dedup.fastq.gz, /tmp/Rtmp3Wjsxa/file358a74837864/sample3_matched_reads_dedup.fastq.gz
#> 	files found
#> Realigning sample /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_matched_reads_dedup.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sampleA_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample1_matched_reads_dedup.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample1_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample2_matched_reads_dedup.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample2_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/Rtmp3Wjsxa/file358a74837864/sample3_matched_reads_dedup.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a74837864/sample3_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> ── Running step: transcript_quantification @ Fri Jul  4 03:18:42 2025 ──────────
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> $sampleA
#> class: SingleCellExperiment 
#> dim: 14 137 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACCATGAGTCGTTTG AACTCTTGTCACCTAA ... TTGTAGGTCTTACCTA
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample1
#> class: SingleCellExperiment 
#> dim: 14 62 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(62): AAGCCGCGTGTGAATA AAGGAGCGTGCTGTAT ... TTGTAGGTCAGTGTTG
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample2
#> class: SingleCellExperiment 
#> dim: 14 67 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(67): AACCATGAGTCGTTTG AAGCCGCGTGTGAATA ... TTGTAGGTCTTACCTA
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample3
#> class: SingleCellExperiment 
#> dim: 14 99 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(99): AACTCTTGTCACCTAA AACTTTCCACAGACTT ... TTGTAGGGTGGGTATG
#>   TTGTAGGTCAGTGTTG
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>

Usage

Arguments

Value

Details

See also

Examples