Skip to contents

Semi-supervised isofrom detection and annotation for long read data. This variant is meant for multi-sample scRNA-seq data. Specific parameters can be configured in the config file (see create_config), input files are specified via arguments.

Usage

MultiSampleSCPipeline(
  config_file,
  outdir,
  fastq,
  annotation,
  genome_fa,
  minimap2,
  samtools,
  barcodes_file,
  expect_cell_number
)

Arguments

config_file

Path to the JSON configuration file. See create_config for creating one.

outdir

Path to the output directory. If it does not exist, it will be created.

fastq

A named vector of fastq file (or folder) paths. Each element of the vector will be treated as a sample. The names of the vector will be used as the sample names. If not named, the sample names will be generated from the file names.

annotation

The file path to the annotation file in GFF3 / GTF format.

genome_fa

The file path to the reference genome in FASTA format.

minimap2

(optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via basilisk.

samtools

(optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via basilisk. provided, FLAMES will use a copy from bioconda via basilisk.

barcodes_file

The file with expected cell barcodes, with each barcode on a new line.

expect_cell_number

The expected number of cells in the sample. This is used if barcodes_file is not provided. See BLAZE for more details.

Value

A FLAMES.MultiSampleSCPipeline object. The pipeline can be run using the run_FLAMES function. The resulting list of SingleCellExperiment objects can be accessed using the experiment method.

Details

By default the pipeline starts with demultiplexing the input fastq data. If the cell barcodes are known apriori (e.g. via coupled short-read sequencing), the barcodes_file argument can be used to specify a file containing the cell barcodes, and a modified Rcpp version of flexiplex will be used; otherwise, expect_cell_number need to be provided, and BLAZE will be used to generate the cell barcodes. The pipeline then aligns the reads to the genome using minimap2. The alignment is then used for isoform detection (either using FLAMES or bambu, can be configured). The reads are then realigned to the detected isoforms. Finally, a transcript count matrix is generated (either using FLAMES's simplistic counting or oarfish's Expectation Maximization algorithm, can be configured). The results can be accssed with experiment(pipeline). If the pipeline errored out / new steps were configured, it can be resumed by calling resume_FLAMES(pipeline)

See also

SingleCellPipeline for single-sample long data and more details on the pipeline output, create_config for creating a configuration file, BulkPipeline for bulk long data.

Examples

reads <- ShortRead::readFastq(
  system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")
)
outdir <- tempfile()
dir.create(outdir)
dir.create(file.path(outdir, "fastq"))
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads[1:100],
  file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads,
  file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)
ppl <- MultiSampleSCPipeline(
  config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE),
  outdir = outdir,
  fastq = c("sampleA" = file.path(outdir, "fastq"),
    "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"),
    "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"),
    "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")),
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  genome_fa = genome_fa,
  barcodes_file = rep(bc_allow, 4)
)
#> Writing configuration parameters to:  /tmp/RtmpgpEV0i/file25bf4840fba9/config_file_9663.json 
#> Configured steps: 
#> 	barcode_demultiplex: TRUE
#> 	genome_alignment: TRUE
#> 	gene_quantification: TRUE
#> 	isoform_identification: TRUE
#> 	read_realignment: TRUE
#> 	transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> Running step: barcode_demultiplex
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 92
#> Number of reads with exactly one barcode match: 91
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 95
#> Number of reads with exactly one barcode match: 94
#> Number of chimera reads: 0
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 1
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 193
#> Number of reads where at least one barcode was found: 181
#> Number of reads with exactly one barcode match: 179
#> Number of chimera reads: 0
#> All done!
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Running step: gene_quantification
#> 03:22:07 AM Wed May 21 2025 quantify genes 
#> Found genome alignment file(s): 	sample1_align2genome.bam
#> 	sample2_align2genome.bam
#> 	sample3_align2genome.bam
#> 	sampleA_align2genome.bam
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads.fastq
#> 	files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads_dedup.fastq
#> 	files found
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Running step: transcript_quantification
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> $sampleA
#> class: SingleCellExperiment 
#> dim: 14 137 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACCATGAGTCGTTTG AACTCTTGTCACCTAA ... TTGTAGGTCTTACCTA
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample1
#> class: SingleCellExperiment 
#> dim: 14 62 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(62): AAGCCGCGTGTGAATA AAGGAGCGTGCTGTAT ... TTGTAGGTCAGTGTTG
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample2
#> class: SingleCellExperiment 
#> dim: 14 67 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(67): AACCATGAGTCGTTTG AAGCCGCGTGTGAATA ... TTGTAGGTCTTACCTA
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> 
#> $sample3
#> class: SingleCellExperiment 
#> dim: 14 99 
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(99): AACTCTTGTCACCTAA AACTTTCCACAGACTT ... TTGTAGGGTGGGTATG
#>   TTGTAGGTCAGTGTTG
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>