Skip to contents

Semi-supervised isofrom detection and annotation for long read data. This variant is meant for single sample scRNA-seq data. Specific parameters can be configured in the config file (see create_config), input files are specified via arguments.

Usage

SingleCellPipeline(
  config_file,
  outdir,
  fastq,
  annotation,
  genome_fa,
  minimap2,
  samtools,
  barcodes_file,
  expect_cell_number
)

Arguments

config_file

Path to the JSON configuration file. See create_config for creating one.

outdir

Path to the output directory. If it does not exist, it will be created.

fastq

Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample.

annotation

The file path to the annotation file in GFF3 / GTF format.

genome_fa

The file path to the reference genome in FASTA format.

minimap2

(optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via basilisk.

samtools

(optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via basilisk. provided, FLAMES will use a copy from bioconda via basilisk.

barcodes_file

The file with expected cell barcodes, with each barcode on a new line.

expect_cell_number

The expected number of cells in the sample. This is used if barcodes_file is not provided. See BLAZE for more details.

Value

A FLAMES.SingleCellPipeline object. The pipeline can be run using run_FLAMES(pipeline). The results can be accessed with experiment(pipeline). The pipeline also outputs a number of output files into the given outdir directory. Some of these output files include:

matched_reads.fastq

- fastq file with reads demultiplexed

align2genome.bam

- sorted BAM file with reads aligned to genome

matched_reads_dedup.fastq

- demultiplexed and UMI-deduplicated fastq file

transcript_assembly.fa

- transcript sequence from the isoforms

isoform_annotated.filtered.gff3

- isoforms in gff3 format (also contained in the SingleCellExperiment)

realign2transcript.bam

- sorted realigned BAM file using the transcript_assembly.fa as reference

Details

By default the pipeline starts with demultiplexing the input fastq data. If the cell barcodes are known apriori (e.g. via coupled short-read sequencing), the barcodes_file argument can be used to specify a file containing the cell barcodes, and a modified Rcpp version of flexiplex will be used; otherwise, expect_cell_number need to be provided, and BLAZE will be used to generate the cell barcodes. The pipeline then aligns the reads to the genome using minimap2. The alignment is then used for isoform detection (either using FLAMES or bambu, can be configured). The reads are then realigned to the detected isoforms. Finally, a transcript count matrix is generated (either using FLAMES's simplistic counting or oarfish's Expectation Maximization algorithm, can be configured). The results can be accssed with experiment(pipeline). If the pipeline errored out / new steps were configured, it can be resumed by calling resume_FLAMES(pipeline)

See also

create_config for creating a configuration file, BulkPipeline for bulk long data, MultiSampleSCPipeline for multi sample single cell pipelines.

Examples

outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)
ppl <- SingleCellPipeline(
  config_file = create_config(outdir, gene_quantification = FALSE),
  outdir = outdir,
  fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  genome_fa = genome_fa,
  barcodes_file = bc_allow
)
#> Writing configuration parameters to:  /tmp/RtmpgpEV0i/file25bf554404c4/config_file_9663.json 
#> Configured steps: 
#> 	barcode_demultiplex: TRUE
#> 	genome_alignment: TRUE
#> 	gene_quantification: TRUE
#> 	isoform_identification: TRUE
#> 	read_realignment: TRUE
#> 	transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> Running step: barcode_demultiplex
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf554404c4/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf554404c4/align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 8 threads...
#> Indexing bam files
#> Running step: gene_quantification
#> 03:22:16 AM Wed May 21 2025 quantify genes 
#> Found genome alignment file(s): 	align2genome.bam
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Checking for fastq file(s) /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads.fastq
#> 	files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads_dedup.fastq
#> 	files found
#> Realigning sample /tmp/RtmpgpEV0i/file25bf554404c4/matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf554404c4/realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 8 with CB threads...
#> Running step: transcript_quantification
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> class: SingleCellExperiment 
#> dim: 10 137 
#> metadata(0):
#> assays(1): counts
#> rownames(10): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_8 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACTCTTGTCACCTAA AACCATGAGTCGTTTG ... TTGTAGGTCAGTGTTG
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):