Semi-supervised isofrom detection and annotation for long read data. This variant is
meant for single sample scRNA-seq data. Specific parameters can be configured in
the config file (see create_config
), input files are specified via
arguments.
Usage
SingleCellPipeline(
config_file,
outdir,
fastq,
annotation,
genome_fa,
minimap2,
samtools,
barcodes_file,
expect_cell_number
)
Arguments
- config_file
Path to the JSON configuration file. See
create_config
for creating one.- outdir
Path to the output directory. If it does not exist, it will be created.
- fastq
Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample.
- annotation
The file path to the annotation file in GFF3 / GTF format.
- genome_fa
The file path to the reference genome in FASTA format.
- minimap2
(optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
.- samtools
(optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
. provided, FLAMES will use a copy from bioconda viabasilisk
.- barcodes_file
The file with expected cell barcodes, with each barcode on a new line.
- expect_cell_number
The expected number of cells in the sample. This is used if
barcodes_file
is not provided. SeeBLAZE
for more details.
Value
A FLAMES.SingleCellPipeline
object. The pipeline can be run using
run_FLAMES(pipeline)
. The results can be accessed with experiment(pipeline)
.
The pipeline also outputs a number of output files into the given outdir
directory.
Some of these output files include:
- matched_reads.fastq
- fastq file with reads demultiplexed
- align2genome.bam
- sorted BAM file with reads aligned to genome
- matched_reads_dedup.fastq
- demultiplexed and UMI-deduplicated fastq file
- transcript_assembly.fa
- transcript sequence from the isoforms
- isoform_annotated.filtered.gff3
- isoforms in gff3 format (also contained in the SingleCellExperiment)
- realign2transcript.bam
- sorted realigned BAM file using the transcript_assembly.fa as reference
Details
By default the pipeline starts with demultiplexing the input fastq data. If the
cell barcodes are known apriori (e.g. via coupled short-read sequencing), the
barcodes_file
argument can be used to specify a file containing the cell
barcodes, and a modified Rcpp version of flexiplex
will be used; otherwise,
expect_cell_number
need to be provided, and BLAZE
will be used to
generate the cell barcodes. The pipeline then aligns the reads to the genome using
minimap2
. The alignment is then used for isoform detection (either using
FLAMES
or bambu
, can be configured). The reads are then realigned
to the detected isoforms. Finally, a transcript count matrix is generated (either
using FLAMES
's simplistic counting or oarfish
's Expectation
Maximization algorithm, can be configured). The results can be accssed with
experiment(pipeline)
. If the pipeline errored out / new steps were configured,
it can be resumed by calling resume_FLAMES(pipeline)
See also
create_config
for creating a configuration file,
BulkPipeline
for bulk long data,
MultiSampleSCPipeline
for multi sample single cell pipelines.
Examples
outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
ppl <- SingleCellPipeline(
config_file = create_config(outdir, gene_quantification = FALSE),
outdir = outdir,
fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
genome_fa = genome_fa,
barcodes_file = bc_allow
)
#> Writing configuration parameters to: /tmp/RtmpgpEV0i/file25bf554404c4/config_file_9663.json
#> Configured steps:
#> barcode_demultiplex: TRUE
#> genome_alignment: TRUE
#> gene_quantification: TRUE
#> isoform_identification: TRUE
#> read_realignment: TRUE
#> transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> Running step: barcode_demultiplex
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 8
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf554404c4/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf554404c4/align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 8 threads...
#> Indexing bam files
#> Running step: gene_quantification
#> 03:22:16 AM Wed May 21 2025 quantify genes
#> Found genome alignment file(s): align2genome.bam
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Checking for fastq file(s) /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads.fastq
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads_dedup.fastq
#> files found
#> Realigning sample /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf554404c4/realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 8 with CB threads...
#> Running step: transcript_quantification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> class: SingleCellExperiment
#> dim: 10 137
#> metadata(0):
#> assays(1): counts
#> rownames(10): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_8 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACTCTTGTCACCTAA AACCATGAGTCGTTTG ... TTGTAGGTCAGTGTTG
#> TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):