Pipeline for multi-sample long-read scRNA-seq data
Source:R/sc_long_multisample_pipeline.R
MultiSampleSCPipeline.Rd
Semi-supervised isofrom detection and annotation for long read data. This variant is
meant for multi-sample scRNA-seq data. Specific parameters can be configured in
the config file (see create_config
), input files are specified via
arguments.
Usage
MultiSampleSCPipeline(
config_file,
outdir,
fastq,
annotation,
genome_fa,
minimap2,
samtools,
barcodes_file,
expect_cell_number
)
Arguments
- config_file
Path to the JSON configuration file. See
create_config
for creating one.- outdir
Path to the output directory. If it does not exist, it will be created.
- fastq
A named vector of fastq file (or folder) paths. Each element of the vector will be treated as a sample. The names of the vector will be used as the sample names. If not named, the sample names will be generated from the file names.
- annotation
The file path to the annotation file in GFF3 / GTF format.
- genome_fa
The file path to the reference genome in FASTA format.
- minimap2
(optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
.- samtools
(optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
. provided, FLAMES will use a copy from bioconda viabasilisk
.- barcodes_file
The file with expected cell barcodes, with each barcode on a new line.
- expect_cell_number
The expected number of cells in the sample. This is used if
barcodes_file
is not provided. SeeBLAZE
for more details.
Value
A FLAMES.MultiSampleSCPipeline
object. The pipeline can be run using
the run_FLAMES
function. The resulting list of SingleCellExperiment
objects can be accessed using the experiment
method.
Details
By default the pipeline starts with demultiplexing the input fastq data. If the
cell barcodes are known apriori (e.g. via coupled short-read sequencing), the
barcodes_file
argument can be used to specify a file containing the cell
barcodes, and a modified Rcpp version of flexiplex
will be used; otherwise,
expect_cell_number
need to be provided, and BLAZE
will be used to
generate the cell barcodes. The pipeline then aligns the reads to the genome using
minimap2
. The alignment is then used for isoform detection (either using
FLAMES
or bambu
, can be configured). The reads are then realigned
to the detected isoforms. Finally, a transcript count matrix is generated (either
using FLAMES
's simplistic counting or oarfish
's Expectation
Maximization algorithm, can be configured). The results can be accssed with
experiment(pipeline)
. If the pipeline errored out / new steps were configured,
it can be resumed by calling resume_FLAMES(pipeline)
See also
SingleCellPipeline
for single-sample long data and more details on the
pipeline output,
create_config
for creating a configuration file,
BulkPipeline
for bulk long data.
Examples
reads <- ShortRead::readFastq(
system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")
)
outdir <- tempfile()
dir.create(outdir)
dir.create(file.path(outdir, "fastq"))
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
ShortRead::writeFastq(reads[1:100],
file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads[1:100],
file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads,
file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)
ppl <- MultiSampleSCPipeline(
config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE),
outdir = outdir,
fastq = c("sampleA" = file.path(outdir, "fastq"),
"sample1" = file.path(outdir, "fastq", "sample1.fq.gz"),
"sample2" = file.path(outdir, "fastq", "sample2.fq.gz"),
"sample3" = file.path(outdir, "fastq", "sample3.fq.gz")),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
genome_fa = genome_fa,
barcodes_file = rep(bc_allow, 4)
)
#> Writing configuration parameters to: /tmp/RtmpgpEV0i/file25bf4840fba9/config_file_9663.json
#> Configured steps:
#> barcode_demultiplex: TRUE
#> genome_alignment: TRUE
#> gene_quantification: TRUE
#> isoform_identification: TRUE
#> read_realignment: TRUE
#> transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> Running step: barcode_demultiplex
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 1
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 1
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 92
#> Number of reads with exactly one barcode match: 91
#> Number of chimera reads: 1
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 1
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 100
#> Number of reads where at least one barcode was found: 95
#> Number of reads with exactly one barcode match: 94
#> Number of chimera reads: 0
#> All done!
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 1
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf4840fba9/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> Searching for barcodes...
#> Number of reads processed: 193
#> Number of reads where at least one barcode was found: 181
#> Number of reads with exactly one barcode match: 179
#> Number of chimera reads: 0
#> All done!
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Running step: gene_quantification
#> 03:22:07 AM Wed May 21 2025 quantify genes
#> Found genome alignment file(s): sample1_align2genome.bam
#> sample2_align2genome.bam
#> sample3_align2genome.bam
#> sampleA_align2genome.bam
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample1.fq.gz, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample2.fq.gz, /tmp/RtmpgpEV0i/file25bf4840fba9/fastq/sample3.fq.gz
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads.fastq
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads_dedup.fastq, /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads_dedup.fastq
#> files found
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sampleA_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample1_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample2_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Realigning sample /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf4840fba9/sample3_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 1 with CB threads...
#> Running step: transcript_quantification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> $sampleA
#> class: SingleCellExperiment
#> dim: 14 137
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACCATGAGTCGTTTG AACTCTTGTCACCTAA ... TTGTAGGTCTTACCTA
#> TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>
#> $sample1
#> class: SingleCellExperiment
#> dim: 14 62
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(62): AAGCCGCGTGTGAATA AAGGAGCGTGCTGTAT ... TTGTAGGTCAGTGTTG
#> TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>
#> $sample2
#> class: SingleCellExperiment
#> dim: 14 67
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(67): AACCATGAGTCGTTTG AAGCCGCGTGTGAATA ... TTGTAGGTCTTACCTA
#> TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>
#> $sample3
#> class: SingleCellExperiment
#> dim: 14 99
#> metadata(0):
#> assays(1): counts
#> rownames(14): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_9 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(99): AACTCTTGTCACCTAA AACTTTCCACAGACTT ... TTGTAGGGTGGGTATG
#> TTGTAGGTCAGTGTTG
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#>